[AccessD] Speed Problems when looping through records or theterrors of deduping

Sun Sep 14 07:49:07 CDT 2008

Kevin:

Is most of the time being used in the Jaro Function?  If so, could you make
a simple test or two which would rule in or rule out a match? Like if the
first N characters of the two string match then it passes the test and you
don't have to call the function?

Rocky Smolin
Beach Access Software
858-259-4334
www.e-z-mrp.com
www.bchacc.com

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock
Sent: Sunday, September 14, 2008 1:26 AM
To: accessd at databaseadvisors.com
Subject: Re: [AccessD] Speed Problems when looping through records or
theterrors of deduping

Hi Kevin

Given the explanation here on the Jaro-Winkler Distance:

  http://en.wikipedia.org/wiki/Jaro-Winkler 

you ultimately will have to make 18000^2 comparisons.
You could perhaps gain some speed by moving the logic to the query but the
implementation would not be easy.

So I believe you are left with some brute-force methods to decrease the
count of comparisons. One I could think of is to only compare titles of
similar length. A title of length, say, 20 will not match one of length 40
while your two examples have either an exact match of length or a difference
of one.

The accepted difference in length could be found dynamically from the
minimum Jaro score to accept. With a difference of 1 (the example with
dwayne and duane from the link above), the Jaro score is above 0.8 thus you
would probably need to accept a difference of 2 or 3. My guess is that for a
score of 0.9 a difference of only 1 could be accepted while for a score of
0.7 the difference could be 4 or 5. A few experiments with your function
could reveal that.

With this simple method you could probably cut the processing time by 80% by
decreasing the count of iterations from 18000 * 18000 to perhaps only 18000
* 3000.

/gustav

>>> thewaddles at sbcglobal.net 14-09-2008 09:08 >>>
Hello,

I have a table with 18,000 records.  The records are titles stored as a
string.

I am trying compare each record to all of the other records in the table
using a Jaro-Winkler Distance function to find similar / potential duplicate
records.

So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
'Composite Science and Technology' and 'Composites Science and Technology'
would flag as similar.

I have tried looping through the table twice:
    Set dbs = CurrentDb
    Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
    Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
    rst2.MoveFirst
    rst1.MoveFirst
    Do Until rst1.EOF
            If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) >
0.8 Then
                'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE
            DoEvents
            rst2.MoveNext
        Loop
        rst1.MoveNext
    Loop

I have tried wrapping the Table into a Class Module and Collection and
looping through the collection twice:
    Set dbs = CurrentDb
    Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
    rst1.MoveFirst
    Do Until rst1.EOF
        Set cls = New clsCitation
        Set cls2 = New clsCitation2
            cls.CitID = rst1![CitationTitleID]
            cls.Citation = rst1![CitationTitle]
            cls.RecordCount = cnt
            Citation.Add cls

        cls2.CitID2 = rst1![CitationTitleID]
        cls2.Citation2 = rst1![CitationTitle]
        cls2.RecordCount2 = cnt
        Citation2.Add cls2
        rst1.MoveNext
    Loop

    For Each cls In Citation
        For Each cls2 In Citation2
            If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then
                'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE"
            DoEvents
        Next
    Next

Essentially, I end up looping through the entire table 18,000 times!
No matter what I do it is taking from 45 to 60 seconds to make one pass
through the table.  
This means that to complete the entire process would take about 12 days, 24
hours a day!

Can someone point me to a faster method for applying fuzzy deduping?

Thanks,
Kevin

--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com