Dan Waters
dwaters at usinternet.com
Sun Sep 14 11:08:52 CDT 2008
Kevin, You don't need to make all 324M comparisons. Suppose you had 10 records to look for duplicates. Visually, if you have a 10 X 10 grid, you can remove the squares where record A = record A. You can also remove the squares where you're checking record B against record A if you've already checked record A against record B. So starting with 100 comparisons, we can remove 10. Then we can remove half of what's left. So to start with 100, we end up with 45, or 45%. You need to check 45% of 324M. Or about 162M comparisons. The way to do this is during your loop, use a sequential record ID to compare your starting record only against records that have a larger sequential record ID. I'm guessing you still have a lot of hours to run, but it's less than checking the full 324M. Good Luck! Dan -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Stephen Sent: Sunday, September 14, 2008 3:09 AM To: Access Developers discussion and problem solving Subject: Re: [AccessD] Speed Problems when looping through records or the terrors of deduping Kevin, What if you built a query with the table occuring 2 times, with the Title field from each of the tables as output, then as the third output field, a call to your JaroDistance function with the 2 fields, and a criteria on this field of >0.8 ?? I can see immediate problems with this approach though. A cartesian query on this data will produce a huge potential recordset (18K * 18K = 324M if they all met the criteria). As a test, I just cobbled together a 500 record table, did the above (my JaroDistance function simply returned a random number between 0 and 1 using the RND function). Of the potential 250K record output there were about 50K meeting the criteria and the query took about 65 seconds on my *slow* development computer. I know, I know, 324M is about 1300x bigger than 250K, but at the above figures, 23-24 hours should do it. On second thought, there has to be a better way ... someone else ?? FWIW, Stephen Bond -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Kevin Waddle Sent: Sunday, 14 September 2008 7:16 p.m. To: Stephen Subject: [AccessD] Speed Problems when looping through records or the terrors of deduping Hello, I have a table with 18,000 records. The records are titles stored as a string. I am trying compare each record to all of the other records in the table using a Jaro-Winkler Distance function to find similar / potential duplicate records. So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and 'Composite Science and Technology' and 'Composites Science and Technology' would flag as similar. I have tried looping through the table twice: Set dbs = CurrentDb Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) rst2.MoveFirst rst1.MoveFirst Do Until rst1.EOF If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) > 0.8 Then 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE DoEvents rst2.MoveNext Loop rst1.MoveNext Loop I have tried wrapping the Table into a Class Module and Collection and looping through the collection twice: Set dbs = CurrentDb Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) rst1.MoveFirst Do Until rst1.EOF Set cls = New clsCitation Set cls2 = New clsCitation2 cls.CitID = rst1![CitationTitleID] cls.Citation = rst1![CitationTitle] cls.RecordCount = cnt Citation.Add cls cls2.CitID2 = rst1![CitationTitleID] cls2.Citation2 = rst1![CitationTitle] cls2.RecordCount2 = cnt Citation2.Add cls2 rst1.MoveNext Loop For Each cls In Citation For Each cls2 In Citation2 If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE" DoEvents Next Next Essentially, I end up looping through the entire table 18,000 times! No matter what I do it is taking from 45 to 60 seconds to make one pass through the table. This means that to complete the entire process would take about 12 days, 24 hours a day! Can someone point me to a faster method for applying fuzzy deduping? Thanks, Kevin -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com