Steve Goodhall
steve at goodhall.info
Sun Sep 14 10:38:48 CDT 2008
My suggestion would be to do this in memory instead of comparing recordsets. To summarize: Load the table into an array. At a minimum you will need the title and the unique row ID. A table of types might be a good idea. Run through the table checking each entry against all subsequent entries in the table (avoids checking each pair twice). Write a table with the unique row ID's of each matching row. Use that table to drive subsequent processing. Intuitively, I think this should be much faster because it will not go through the database engine, but to be fair I have not subjected it to a formal test. Regards, Steve Goodhall, PMP 248-505-5204 mobile -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Kevin Waddle Sent: Sunday, September 14, 2008 3:08 AM To: accessd at databaseadvisors.com; ACCESS-L at PEACH.EASE.LSOFT.COM Subject: [AccessD] Speed Problems when looping through records or theterrors of deduping Hello, I have a table with 18,000 records. The records are titles stored as a string. I am trying compare each record to all of the other records in the table using a Jaro-Winkler Distance function to find similar / potential duplicate records. So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and 'Composite Science and Technology' and 'Composites Science and Technology' would flag as similar. I have tried looping through the table twice: Set dbs = CurrentDb Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) rst2.MoveFirst rst1.MoveFirst Do Until rst1.EOF If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) > 0.8 Then 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE DoEvents rst2.MoveNext Loop rst1.MoveNext Loop I have tried wrapping the Table into a Class Module and Collection and looping through the collection twice: Set dbs = CurrentDb Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) rst1.MoveFirst Do Until rst1.EOF Set cls = New clsCitation Set cls2 = New clsCitation2 cls.CitID = rst1![CitationTitleID] cls.Citation = rst1![CitationTitle] cls.RecordCount = cnt Citation.Add cls cls2.CitID2 = rst1![CitationTitleID] cls2.Citation2 = rst1![CitationTitle] cls2.RecordCount2 = cnt Citation2.Add cls2 rst1.MoveNext Loop For Each cls In Citation For Each cls2 In Citation2 If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE" DoEvents Next Next Essentially, I end up looping through the entire table 18,000 times! No matter what I do it is taking from 45 to 60 seconds to make one pass through the table. This means that to complete the entire process would take about 12 days, 24 hours a day! Can someone point me to a faster method for applying fuzzy deduping? Thanks, Kevin -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com