[AccessD] Speed Problems when looping through records or the terrors of deduping

Stephen stephen at bondsoftware.co.nz
Sun Sep 14 03:09:25 CDT 2008


Kevin,

What if you built a query with the table occuring 2 times, with the
Title field from each of the tables as output, then as the third output
field, a call to your JaroDistance function with the 2 fields, and a
criteria on this field of >0.8 ??

I can see immediate problems with this approach though.  A cartesian
query on this data will produce a huge potential recordset (18K * 18K =
324M if they all met the criteria).  

As a test, I just cobbled together a 500 record table, did the above (my
JaroDistance function simply returned a random number between 0 and 1
using the RND function).  Of the potential 250K record output there were
about 50K meeting the criteria and the query took about 65 seconds on my
*slow* development computer.  I know, I know, 324M is about 1300x bigger
than 250K, but at the above figures, 23-24 hours should do it.

On second thought, there has to be a better way ... someone else ??

FWIW,
Stephen Bond






-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Kevin Waddle
Sent: Sunday, 14 September 2008 7:16 p.m.
To: Stephen
Subject: [AccessD] Speed Problems when looping through records or the
terrors of deduping

Hello,
 
I have a table with 18,000 records.  The records are titles stored as a
string.
 
I am trying compare each record to all of the other records in the table
using a Jaro-Winkler Distance function to find similar / potential
duplicate records.
 
So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
'Composite Science and Technology' and 'Composites Science and
Technology'
would flag as similar.
 
I have tried looping through the table twice:
    Set dbs = CurrentDb
    Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
    Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
    rst2.MoveFirst
    rst1.MoveFirst
    Do Until rst1.EOF
            If JaroDistance(rst1![CitationTitle], rst2![CitationTitle])
>
0.8 Then
                'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE
            DoEvents
            rst2.MoveNext
        Loop
        rst1.MoveNext
    Loop

I have tried wrapping the Table into a Class Module and Collection and
looping through the collection twice:
    Set dbs = CurrentDb
    Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
    rst1.MoveFirst
    Do Until rst1.EOF
        Set cls = New clsCitation
        Set cls2 = New clsCitation2
            cls.CitID = rst1![CitationTitleID]
            cls.Citation = rst1![CitationTitle]
            cls.RecordCount = cnt
            Citation.Add cls

        cls2.CitID2 = rst1![CitationTitleID]
        cls2.Citation2 = rst1![CitationTitle]
        cls2.RecordCount2 = cnt
        Citation2.Add cls2
        rst1.MoveNext
    Loop
 
    For Each cls In Citation
        For Each cls2 In Citation2
            If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then
                'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE"
            DoEvents
        Next
    Next
 
 
Essentially, I end up looping through the entire table 18,000 times!
No matter what I do it is taking from 45 to 60 seconds to make one pass
through the table.  
This means that to complete the entire process would take about 12 days,
24 hours a day!
 
Can someone point me to a faster method for applying fuzzy deduping?
 
Thanks,
Kevin
--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com






More information about the AccessD mailing list