[AccessD] Speed Problems when looping through records orthe terrors of deduping

A.D.Tejpal adtp at airtelmail.in
Mon Sep 15 10:02:04 CDT 2008


Kevin,

    Sample subroutine P_MatchAllRecords(), as given below, is expected to provide significant improvement in speed. Higher the incidence of matching values, more pronounced will be the gain in speed.

    Basically it involves one cycle of stepping through only a portion of total records. This portion gets smaller and smaller as more matches are found.

    For sample demonstration, T_A is the name of table having fields ID (autonumber P.K.), Title (text type) and Matched (number type). To start with, ID and Title of first record are stored in variables as master ID and master Title. An update query, stores the value of this master ID in field named Matched, for all records where Title matches the master Title, using function Fn_MatchTwoValues(). You can substitute your own function here. The scope of update query is additionally restricted to only those records where field Matched is blank and difference of lengths for Title and master Title is <= 2. You can change this value of 2 to whatever is felt appropriate.

    After the above update action, the recordset gets re-set to balance records where field Matched is still blank. ID and Title for first record of this new recordset become the fresh master values and the update query gets executed for all matching records as in previous para. Thus with every new cycle, the recordset continues to shrink, ignoring all records that stand matched.

    Finally, the results arranged in groups of matched records can be viewed via sample query named Q_MatchedRecords, also given below.

    You might like to try it out and confirm the outcome.

    Note - In the worst case scenario, if there is no matched record at all, there will be one cycle of stepping through all the records, with one execution of update query per record.

Best wishes,
A.D. Tejpal
------------

' Sample code in general module
'================================
Sub P_MatchAllRecords()
    ' Table T_A has fields ID (autonumber P.K.),
    ' Title (text type) and Matched (number type)
    ' Compares the contents of Title on each
    ' record with all other records for a match
    ' (satisfying function Fn_MatchTwoValues())
    ' Field named Matched gets updated with ID
    ' of master matching record.
    Dim db As DAO.Database
    Dim rst As DAO.Recordset
    Dim Rtv As Variant, Ln As Long, Idn As Long
    
    Set db = DBEngine(0)(0)
    
    ' Clear Matched field for all records
    db.Execute "UPDATE T_A SET " & _
                "Matched = Null;", dbFailOnError
    
    Do While True
        ' Set rst to cover only those records where
        ' Matched field is still blank
        Set rst = db.OpenRecordset("SELECT * " & _
                "FROM T_A WHERE " & _
                "Title Is Not Null And Matched Is Null")
        If rst.EOF Then
            Exit Do
        End If
        
        ' Get values for first record
        With rst
            Rtv = .Fields("Title")
            Ln = Len(Rtv)
            Idn = .Fields("ID")
        End With
        
        ' Update Matched field to Idn, only where it
        ' is blank, length differential of Title vs Rtv
        ' is within 2, and Fn_MatchTwoValues()
        ' comparing Title with Rtv returns True.
        db.Execute "UPDATE T_A SET " & _
                "Matched = " & Idn & _
                " WHERE Matched Is Null " & _
                "And Abs(Len(Title) - " & Ln & _
                ") <= 2 And " & _
                "Fn_MatchTwoValues(Title, '" & _
                Rtv & "') = True;", dbFailOnError
        
        ' Note:
        ' The statement above, treats 2 as max alowed
        ' difference in length, beyond which there is no
        ' need to test for Fn_MatchTwoValues().
        ' Depending upon the actual requirements,
        ' this value of 2 can be revised as desired.
    Loop
    
    rst.Close
    Set rst = Nothing
    Set db = Nothing
End Sub
'===================================

Q_MatchedRecords
(Sample query showing matched records)
'===================================
SELECT T_A.*
FROM T_A
ORDER BY T_A.Matched, T_A.ID;
'===================================

  -----Original Message-----
  From: accessd-bounces at databaseadvisors.com
  [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Kevin Waddle
  Sent: Sunday, 14 September 2008 7:16 p.m.
  To: Stephen
  Subject: [AccessD] Speed Problems when looping through records or the
  terrors of deduping

  Hello,
   
  I have a table with 18,000 records.  The records are titles stored as a
  string.
   
  I am trying compare each record to all of the other records in the table
  using a Jaro-Winkler Distance function to find similar / potential
  duplicate records.
   
  So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
  'Composite Science and Technology' and 'Composites Science and
  Technology'
  would flag as similar.
   
  I have tried looping through the table twice:
  <<>Snipped> 
   
  Essentially, I end up looping through the entire table 18,000 times!
  No matter what I do it is taking from 45 to 60 seconds to make one pass
  through the table.  
  This means that to complete the entire process would take about 12 days,
  24 hours a day!
   
  Can someone point me to a faster method for applying fuzzy deduping?
   
  Thanks,
  Kevin


More information about the AccessD mailing list