Kevin Waddle
thewaddles at sbcglobal.net
Sun Sep 14 02:08:07 CDT 2008
Hello,
I have a table with 18,000 records. The records are titles stored as a
string.
I am trying compare each record to all of the other records in the table
using a Jaro-Winkler Distance function to find similar / potential duplicate
records.
So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
'Composite Science and Technology' and 'Composites Science and Technology'
would flag as similar.
I have tried looping through the table twice:
Set dbs = CurrentDb
Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
rst2.MoveFirst
rst1.MoveFirst
Do Until rst1.EOF
If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) >
0.8 Then
'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE
DoEvents
rst2.MoveNext
Loop
rst1.MoveNext
Loop
I have tried wrapping the Table into a Class Module and Collection and
looping through the collection twice:
Set dbs = CurrentDb
Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
rst1.MoveFirst
Do Until rst1.EOF
Set cls = New clsCitation
Set cls2 = New clsCitation2
cls.CitID = rst1![CitationTitleID]
cls.Citation = rst1![CitationTitle]
cls.RecordCount = cnt
Citation.Add cls
cls2.CitID2 = rst1![CitationTitleID]
cls2.Citation2 = rst1![CitationTitle]
cls2.RecordCount2 = cnt
Citation2.Add cls2
rst1.MoveNext
Loop
For Each cls In Citation
For Each cls2 In Citation2
If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then
'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE"
DoEvents
Next
Next
Essentially, I end up looping through the entire table 18,000 times!
No matter what I do it is taking from 45 to 60 seconds to make one pass
through the table.
This means that to complete the entire process would take about 12 days, 24
hours a day!
Can someone point me to a faster method for applying fuzzy deduping?
Thanks,
Kevin