[AccessD] Speed Problems when looping through records or the terrors of deduping

Sun Sep 14 06:33:03 CDT 2008

Would SOUNDEX be of any use?  Probably not, but have a google at it.

Max

On Sun, Sep 14, 2008 at 9:25 AM, Gustav Brock <Gustav at cactus.dk> wrote:

> Hi Kevin
>
> Given the explanation here on the Jaro-Winkler Distance:
>
>  http://en.wikipedia.org/wiki/Jaro-Winkler
>
> you ultimately will have to make 18000^2 comparisons.
> You could perhaps gain some speed by moving the logic to the query but the
> implementation would not be easy.
>
> So I believe you are left with some brute-force methods to decrease the
> count of comparisons. One I could think of is to only compare titles of
> similar length. A title of length, say, 20 will not match one of length 40
> while your two examples have either an exact match of length or a difference
> of one.
>
> The accepted difference in length could be found dynamically from the
> minimum Jaro score to accept. With a difference of 1 (the example with
> dwayne and duane from the link above), the Jaro score is above 0.8 thus you
> would probably need to accept a difference of 2 or 3. My guess is that for a
> score of 0.9 a difference of only 1 could be accepted while for a score of
> 0.7 the difference could be 4 or 5. A few experiments with your function
> could reveal that.
>
> With this simple method you could probably cut the processing time by 80%
> by decreasing the count of iterations from 18000 * 18000 to perhaps only
> 18000 * 3000.
>
> /gustav
>
> >>> thewaddles at sbcglobal.net 14-09-2008 09:08 >>>
> Hello,
>
> I have a table with 18,000 records.  The records are titles stored as a
> string.
>
> I am trying compare each record to all of the other records in the table
> using a Jaro-Winkler Distance function to find similar / potential
> duplicate
> records.
>
> So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
> 'Composite Science and Technology' and 'Composites Science and Technology'
> would flag as similar.
>
> I have tried looping through the table twice:
>    Set dbs = CurrentDb
>    Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
> dbOpenDynaset)
>    Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
> dbOpenDynaset)
>    rst2.MoveFirst
>    rst1.MoveFirst
>    Do Until rst1.EOF
>            If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) >
> 0.8 Then
>                'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE
>            DoEvents
>            rst2.MoveNext
>        Loop
>        rst1.MoveNext
>    Loop
>
> I have tried wrapping the Table into a Class Module and Collection and
> looping through the collection twice:
>    Set dbs = CurrentDb
>    Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
> dbOpenDynaset)
>    rst1.MoveFirst
>    Do Until rst1.EOF
>        Set cls = New clsCitation
>        Set cls2 = New clsCitation2
>            cls.CitID = rst1![CitationTitleID]
>            cls.Citation = rst1![CitationTitle]
>            cls.RecordCount = cnt
>            Citation.Add cls
>
>        cls2.CitID2 = rst1![CitationTitleID]
>        cls2.Citation2 = rst1![CitationTitle]
>        cls2.RecordCount2 = cnt
>        Citation2.Add cls2
>        rst1.MoveNext
>    Loop
>
>    For Each cls In Citation
>        For Each cls2 In Citation2
>            If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then
>                'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE"
>            DoEvents
>        Next
>    Next
>
>
> Essentially, I end up looping through the entire table 18,000 times!
> No matter what I do it is taking from 45 to 60 seconds to make one pass
> through the table.
> This means that to complete the entire process would take about 12 days, 24
> hours a day!
>
> Can someone point me to a faster method for applying fuzzy deduping?
>
> Thanks,
> Kevin
>
>
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com
>