[AccessD] Speed Problems when looping through records orthe terrors of deduping

Mark A Matte markamatte at hotmail.com
Mon Sep 15 10:51:25 CDT 2008


I think you may eliminate some matches with this approach...if he runs the code that he provided...and uses this technique to limit the number of times he calls the function...matches wil be missed.  with his code I found some matches with at least a difference of 4 in length.  

For example:JaroDistance([bo_name],"matte,mark")
RESULTS:
bo_name  : Score  : Length
Matte,Mark : 1 : 10
mathews,marc  :  0.804  : 12

This would eliminate both records from further comparisons.  So the next set would be ommitted

EX:  JaroDistance([bo_name],"mathews,marc")
RESULTS:
bo_name  : Score  : Length
Matte,Mark  : 0.804  :  10
Matthews,Frances  :  0.825  :  16
pathews,marcus  : 0.8927  : 14
mathews,marc  : 1  : 12

Frances Mathews and Marcus Pathews  would not be included due to Marc Mathews being matched in the first run.

Good Luck,

Mark A. Matte

PS...Also, in looking at the original function provided...there possibly may be some unnecessary looping...and the function CleanString...seems to be doing some replacements twice?...example..."&" is replaced with "and" then " and " with " "...then " " with ""......the same with a few others to " "...then at the end the " " to "".  Not sure if it would save any time...but seems like it might be redundancy



> From: adtp at airtelmail.in
> To: accessd at databaseadvisors.com
> Date: Mon, 15 Sep 2008 20:32:04 +0530
> CC: adtejpal at gmail.com
> Subject: Re: [AccessD] Speed Problems when looping through records orthe terrors of deduping
>
> Kevin,
>
> Sample subroutine P_MatchAllRecords(), as given below, is expected to provide significant improvement in speed. Higher the incidence of matching values, more pronounced will be the gain in speed.
>
> Basically it involves one cycle of stepping through only a portion of total records. This portion gets smaller and smaller as more matches are found.
>
> For sample demonstration, T_A is the name of table having fields ID (autonumber P.K.), Title (text type) and Matched (number type). To start with, ID and Title of first record are stored in variables as master ID and master Title. An update query, stores the value of this master ID in field named Matched, for all records where Title matches the master Title, using function Fn_MatchTwoValues(). You can substitute your own function here. The scope of update query is additionally restricted to only those records where field Matched is blank and difference of lengths for Title and master Title is <= 2. You can change this value of 2 to whatever is felt appropriate.
>
> After the above update action, the recordset gets re-set to balance records where field Matched is still blank. ID and Title for first record of this new recordset become the fresh master values and the update query gets executed for all matching records as in previous para. Thus with every new cycle, the recordset continues to shrink, ignoring all records that stand matched.
>
> Finally, the results arranged in groups of matched records can be viewed via sample query named Q_MatchedRecords, also given below.
>
> You might like to try it out and confirm the outcome.
>
> Note - In the worst case scenario, if there is no matched record at all, there will be one cycle of stepping through all the records, with one execution of update query per record.
>
> Best wishes,
> A.D. Tejpal
> ------------
>
> ' Sample code in general module
> '================================
> Sub P_MatchAllRecords()
> ' Table T_A has fields ID (autonumber P.K.),
> ' Title (text type) and Matched (number type)
> ' Compares the contents of Title on each
> ' record with all other records for a match
> ' (satisfying function Fn_MatchTwoValues())
> ' Field named Matched gets updated with ID
> ' of master matching record.
> Dim db As DAO.Database
> Dim rst As DAO.Recordset
> Dim Rtv As Variant, Ln As Long, Idn As Long
>
> Set db = DBEngine(0)(0)
>
> ' Clear Matched field for all records
> db.Execute "UPDATE T_A SET " & _
> "Matched = Null;", dbFailOnError
>
> Do While True
> ' Set rst to cover only those records where
> ' Matched field is still blank
> Set rst = db.OpenRecordset("SELECT * " & _
> "FROM T_A WHERE " & _
> "Title Is Not Null And Matched Is Null")
> If rst.EOF Then
> Exit Do
> End If
>
> ' Get values for first record
> With rst
> Rtv = .Fields("Title")
> Ln = Len(Rtv)
> Idn = .Fields("ID")
> End With
>
> ' Update Matched field to Idn, only where it
> ' is blank, length differential of Title vs Rtv
> ' is within 2, and Fn_MatchTwoValues()
> ' comparing Title with Rtv returns True.
> db.Execute "UPDATE T_A SET " & _
> "Matched = " & Idn & _
> " WHERE Matched Is Null " & _
> "And Abs(Len(Title) - " & Ln & _
> ") <= 2 And " & _
> "Fn_MatchTwoValues(Title, '" & _
> Rtv & "') = True;", dbFailOnError
>
> ' Note:
> ' The statement above, treats 2 as max alowed
> ' difference in length, beyond which there is no
> ' need to test for Fn_MatchTwoValues().
> ' Depending upon the actual requirements,
> ' this value of 2 can be revised as desired.
> Loop
>
> rst.Close
> Set rst = Nothing
> Set db = Nothing
> End Sub
> '===================================
>
> Q_MatchedRecords
> (Sample query showing matched records)
> '===================================
> SELECT T_A.*
> FROM T_A
> ORDER BY T_A.Matched, T_A.ID;
> '===================================
>
> -----Original Message-----
> From: accessd-bounces at databaseadvisors.com
> [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Kevin Waddle
> Sent: Sunday, 14 September 2008 7:16 p.m.
> To: Stephen
> Subject: [AccessD] Speed Problems when looping through records or the
> terrors of deduping
>
> Hello,
>
> I have a table with 18,000 records. The records are titles stored as a
> string.
>
> I am trying compare each record to all of the other records in the table
> using a Jaro-Winkler Distance function to find similar / potential
> duplicate records.
>
> So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
> 'Composite Science and Technology' and 'Composites Science and
> Technology'
> would flag as similar.
>
> I have tried looping through the table twice:
> <<>Snipped>
>
> Essentially, I end up looping through the entire table 18,000 times!
> No matter what I do it is taking from 45 to 60 seconds to make one pass
> through the table.
> This means that to complete the entire process would take about 12 days,
> 24 hours a day!
>
> Can someone point me to a faster method for applying fuzzy deduping?
>
> Thanks,
> Kevin
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com

_________________________________________________________________
Stay up to date on your PC, the Web, and your mobile phone with Windows Live.
http://clk.atdmt.com/MRT/go/msnnkwxp1020093185mrt/direct/01/



More information about the AccessD mailing list