Max Wanadoo
max.wanadoo at gmail.com
Sun Sep 14 06:33:03 CDT 2008
Would SOUNDEX be of any use? Probably not, but have a google at it. Max On Sun, Sep 14, 2008 at 9:25 AM, Gustav Brock <Gustav at cactus.dk> wrote: > Hi Kevin > > Given the explanation here on the Jaro-Winkler Distance: > > http://en.wikipedia.org/wiki/Jaro-Winkler > > you ultimately will have to make 18000^2 comparisons. > You could perhaps gain some speed by moving the logic to the query but the > implementation would not be easy. > > So I believe you are left with some brute-force methods to decrease the > count of comparisons. One I could think of is to only compare titles of > similar length. A title of length, say, 20 will not match one of length 40 > while your two examples have either an exact match of length or a difference > of one. > > The accepted difference in length could be found dynamically from the > minimum Jaro score to accept. With a difference of 1 (the example with > dwayne and duane from the link above), the Jaro score is above 0.8 thus you > would probably need to accept a difference of 2 or 3. My guess is that for a > score of 0.9 a difference of only 1 could be accepted while for a score of > 0.7 the difference could be 4 or 5. A few experiments with your function > could reveal that. > > With this simple method you could probably cut the processing time by 80% > by decreasing the count of iterations from 18000 * 18000 to perhaps only > 18000 * 3000. > > /gustav > > >>> thewaddles at sbcglobal.net 14-09-2008 09:08 >>> > Hello, > > I have a table with 18,000 records. The records are titles stored as a > string. > > I am trying compare each record to all of the other records in the table > using a Jaro-Winkler Distance function to find similar / potential > duplicate > records. > > So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and > 'Composite Science and Technology' and 'Composites Science and Technology' > would flag as similar. > > I have tried looping through the table twice: > Set dbs = CurrentDb > Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", > dbOpenDynaset) > Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", > dbOpenDynaset) > rst2.MoveFirst > rst1.MoveFirst > Do Until rst1.EOF > If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) > > 0.8 Then > 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE > DoEvents > rst2.MoveNext > Loop > rst1.MoveNext > Loop > > I have tried wrapping the Table into a Class Module and Collection and > looping through the collection twice: > Set dbs = CurrentDb > Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", > dbOpenDynaset) > rst1.MoveFirst > Do Until rst1.EOF > Set cls = New clsCitation > Set cls2 = New clsCitation2 > cls.CitID = rst1![CitationTitleID] > cls.Citation = rst1![CitationTitle] > cls.RecordCount = cnt > Citation.Add cls > > cls2.CitID2 = rst1![CitationTitleID] > cls2.Citation2 = rst1![CitationTitle] > cls2.RecordCount2 = cnt > Citation2.Add cls2 > rst1.MoveNext > Loop > > For Each cls In Citation > For Each cls2 In Citation2 > If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then > 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE" > DoEvents > Next > Next > > > Essentially, I end up looping through the entire table 18,000 times! > No matter what I do it is taking from 45 to 60 seconds to make one pass > through the table. > This means that to complete the entire process would take about 12 days, 24 > hours a day! > > Can someone point me to a faster method for applying fuzzy deduping? > > Thanks, > Kevin > > > -- > AccessD mailing list > AccessD at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/accessd > Website: http://www.databaseadvisors.com >