[AccessD] Speed Problems when looping through recordsorthe terrors of deduping

A.D.Tejpal adtp at airtelmail.in
Mon Sep 15 11:37:26 CDT 2008


    As implied in my previous post, the value of 2 as max permissible difference in length, so as to qualify as a candidate for comparison, is arbitrary. Kevin has to decide upon the most suitable safe value.

A.D. Tejpal
------------

  ----- Original Message ----- 
  From: Mark A Matte 
  To: Access Developers discussion and problem solving 
  Sent: Monday, September 15, 2008 21:21
  Subject: Re: [AccessD] Speed Problems when looping through recordsorthe terrors of deduping

  I think you may eliminate some matches with this approach...if he runs the code that he provided...and uses this technique to limit the number of times he calls the function...matches wil be missed.  with his code I found some matches with at least a difference of 4 in length.  

  For example:JaroDistance([bo_name],"matte,mark")
  RESULTS:
  bo_name  : Score  : Length
  Matte,Mark : 1 : 10
  mathews,marc  :  0.804  : 12

  This would eliminate both records from further comparisons.  So the next set would be ommitted

  EX:  JaroDistance([bo_name],"mathews,marc")
  RESULTS:
  bo_name  : Score  : Length
  Matte,Mark  : 0.804  :  10
  Matthews,Frances  :  0.825  :  16
  pathews,marcus  : 0.8927  : 14
  mathews,marc  : 1  : 12

  Frances Mathews and Marcus Pathews  would not be included due to Marc Mathews being matched in the first run.

  Good Luck,

  Mark A. Matte

  PS...Also, in looking at the original function provided...there possibly may be some unnecessary looping...and the function CleanString...seems to be doing some replacements twice?...example..."&" is replaced with "and" then " and " with " "...then " " with ""......the same with a few others to " "...then at the end the " " to "".  Not sure if it would save any time...but seems like it might be redundancy



  > From: adtp at airtelmail.in
  > To: accessd at databaseadvisors.com
  > Date: Mon, 15 Sep 2008 20:32:04 +0530
  > CC: adtejpal at gmail.com
  > Subject: Re: [AccessD] Speed Problems when looping through records orthe terrors of deduping
  >
  > Kevin,
  >
  > Sample subroutine P_MatchAllRecords(), as given below, is expected to provide significant improvement in speed. Higher the incidence of matching values, more pronounced will be the gain in speed.
  >
  > Basically it involves one cycle of stepping through only a portion of total records. This portion gets smaller and smaller as more matches are found.
  >
  > For sample demonstration, T_A is the name of table having fields ID (autonumber P.K.), Title (text type) and Matched (number type). To start with, ID and Title of first record are stored in variables as master ID and master Title. An update query, stores the value of this master ID in field named Matched, for all records where Title matches the master Title, using function Fn_MatchTwoValues(). You can substitute your own function here. The scope of update query is additionally restricted to only those records where field Matched is blank and difference of lengths for Title and master Title is <= 2. You can change this value of 2 to whatever is felt appropriate.
  >
  > After the above update action, the recordset gets re-set to balance records where field Matched is still blank. ID and Title for first record of this new recordset become the fresh master values and the update query gets executed for all matching records as in previous para. Thus with every new cycle, the recordset continues to shrink, ignoring all records that stand matched.
  >
  > Finally, the results arranged in groups of matched records can be viewed via sample query named Q_MatchedRecords, also given below.
  >
  > You might like to try it out and confirm the outcome.
  >
  > Note - In the worst case scenario, if there is no matched record at all, there will be one cycle of stepping through all the records, with one execution of update query per record.
  >
  > Best wishes,
  > A.D. Tejpal
  > ------------
  >
  > ' Sample code in general module
  > '================================
  > Sub P_MatchAllRecords()
  > ' Table T_A has fields ID (autonumber P.K.),
  > ' Title (text type) and Matched (number type)
  > ' Compares the contents of Title on each
  > ' record with all other records for a match
  > ' (satisfying function Fn_MatchTwoValues())
  > ' Field named Matched gets updated with ID
  > ' of master matching record.
  > Dim db As DAO.Database
  > Dim rst As DAO.Recordset
  > Dim Rtv As Variant, Ln As Long, Idn As Long
  >
  > Set db = DBEngine(0)(0)
  >
  > ' Clear Matched field for all records
  > db.Execute "UPDATE T_A SET " & _
  > "Matched = Null;", dbFailOnError
  >
  > Do While True
  > ' Set rst to cover only those records where
  > ' Matched field is still blank
  > Set rst = db.OpenRecordset("SELECT * " & _
  > "FROM T_A WHERE " & _
  > "Title Is Not Null And Matched Is Null")
  > If rst.EOF Then
  > Exit Do
  > End If
  >
  > ' Get values for first record
  > With rst
  > Rtv = .Fields("Title")
  > Ln = Len(Rtv)
  > Idn = .Fields("ID")
  > End With
  >
  > ' Update Matched field to Idn, only where it
  > ' is blank, length differential of Title vs Rtv
  > ' is within 2, and Fn_MatchTwoValues()
  > ' comparing Title with Rtv returns True.
  > db.Execute "UPDATE T_A SET " & _
  > "Matched = " & Idn & _
  > " WHERE Matched Is Null " & _
  > "And Abs(Len(Title) - " & Ln & _
  > ") <= 2 And " & _
  > "Fn_MatchTwoValues(Title, '" & _
  > Rtv & "') = True;", dbFailOnError
  >
  > ' Note:
  > ' The statement above, treats 2 as max alowed
  > ' difference in length, beyond which there is no
  > ' need to test for Fn_MatchTwoValues().
  > ' Depending upon the actual requirements,
  > ' this value of 2 can be revised as desired.
  > Loop
  >
  > rst.Close
  > Set rst = Nothing
  > Set db = Nothing
  > End Sub
  > '===================================
  >
  > Q_MatchedRecords
  > (Sample query showing matched records)
  > '===================================
  > SELECT T_A.*
  > FROM T_A
  > ORDER BY T_A.Matched, T_A.ID;
  > '===================================


More information about the AccessD mailing list