[AccessD] Speed Problems when looping through records or the terrors of deduping

Mark A Matte markamatte at hotmail.com
Mon Sep 15 09:04:25 CDT 2008



I ran a small test with about 2700 records...I called the function from a query and compared all names to my name...( JaroDistance([bo_name],"Mark Matte"))...for testing...this took out the cartersian join, or any join that may slow this down....and againts 2700 records it took 7-8 seconds.  

I believe Rocky hit on it earlier...it does not appear that your time is lost in getting the records, or joining them...its running them through the function.  

Your test:  18000 records= 45 seconds per
My test:  2700 records= 7 second per
Rationalization: 7*2700records=18900...7*7seconds=49

...its the same amount of time...bottom line is it doesn't matter, at this point, how you pull the records together to compare them...it just takes that long to compare.

If the speed is to be improved, I agree the 2 best approaches would be 1...work on the speed of the function...not the recordset/query.  2...call the function less times.

Good Luck,

Mark A. Matte


> Date: Mon, 15 Sep 2008 08:37:15 +0200
> From: Gustav at cactus.dk
> To: accessd at databaseadvisors.com
> Subject: Re: [AccessD] Speed Problems when looping through records or the terrors of deduping
>
> Hi Kevin
>
> .. So the following records would be considered duplicates:
> 'European Journal of Social Psychology' = 'Social Psychology'
> 'Applied Developmental Psychology' = 'Journal of Applied Developmental Psychology'
>
> But these comparison return a Jaro distance of 0.53 and 0.76, not fulfilling your requirement of 0.8 as the minimum.
> So how should these match?
>
> That said, such examples could be found much faster by a simple string comparison using InStr. This will return a positive value when the smaller string is found in the larger.
>
> /gustav
>
>>>> thewaddles at sbcglobal.net 15-09-2008 08:07>>>
> All,
> Thank you very much for your suggestions.
>
> I have spent way more time on this than I can recoup with the bid I placed.
> Now it has become more of an obsession than a project!
>
> I still haven't found a solution that really works, however I would like to
> try to answer all those that took the time to reply.
>
> Thank you,
> Kevin
>
> Dan suggested only looping through records with higher It's, so Record 1
> would be checked against records 2 to 18,000, Record 17,998 would be checked
> against 17,999 to 18,000.
> - Great idea. I have implemented this, but it's still not fast enough.
>
> Emilia and Steve suggested doing the comparison in memory rather than
> looping.
> - I created a Class Module (clsCitation) containing the TitleID and Title
> and a collection (Citation) to store the class:
> Do Until rst1.EOF
> Set cls = New clsCitation
> If rst1![Checked] = False Then
> cls.CitID = rst1![CitationTitleID]
> cls.Citation = rst1![CitationTitle]
> cls.RecordCount = cnt
> Citation.Add cls
> End If
> rst1.MoveNext
> Loop
> Set Citation2 = Citation
> - I then looped through the two collections. Still took about 45 seconds per loop
>
> Rocky asked if most of the time was in the function.
> - I took the function and resulting SQL commands out and the time
> for each cycle was about the same.
>
> Gustav suggested cutting down on the number of comparisons based on string length.
> - Unfortunately, there appears to have been no normalization in the table structure.
> So the following records would be considered duplicates:
> 'European Journal of Social Psychology' = 'Social Psychology'
> 'Applied Developmental Psychology' = 'Journal of Applied Developmental Psychology'
>
> Stephen suggested a Cartesian query and then hit immediately on the problem with that route.
> My problem is I get a "Not Enough Space on Temp Disk" error before the query finishes.
>
> David asked for the functions:
> The Jaro Function uses two functions: JaroDistance and CleanString. I have
> also included a Levenshtein Distance function. The Levenshtein distance
> between two strings is the minimum number of operations needed to transform
> one string into the other, where an operation is an insertion, deletion, or
> substitution of a single character.
>
> '*** Code Start ***
> Function JaroDistance(ByVal str1 As String, ByVal str2 As String) As Double
> Dim Len1 As Integer
> Dim Len2 As Integer
> Dim LenMin As Integer
> Dim LenMax As Integer
> Dim m As Integer
> Dim i As Integer
> Dim j As Integer
> Dim Common As Integer
> Dim tr As Double
> Dim a1 As String
> Dim a2 As String
> Dim Aux As String
> Dim AuxStr As String
> Dim f As Integer
> Dim l As Integer
> Dim wcd As Double
> Dim wrd As Double
> Dim wtr As Double
> Dim f1() As Boolean
> Dim f2() As Boolean
> str1 = CleanString(str1)
> str2 = CleanString(str2)
> Len1 = Len(str1)
> Len2 = Len(str2)
> If Len1> Len2 Then
> Aux = Len2
> Len2 = Len1
> Len1 = Aux
> AuxStr = str1
> str1 = str2
> str2 = AuxStr
> End If
> LenMin = Len1
> LenMax = Len2
> ReDim f1(Len1), f2(Len2)
> For i = 1 To Len1
> f1(i) = False
> Next i
> For j = 1 To Len2
> f2(j) = False
> Next j
> m = Int((LenMax / 2) - 1)
> Common = 0
> tr = 0
> For i = 1 To Len1
> a1 = Mid(str1, i, 1)
> If m>= i Then
> f = 1
> l = i + m
> Else
> f = i - m
> l = i + m
> End If
> If l> LenMax Then
> l = LenMax
> End If
> For j = f To l
> a2 = Mid(str2, j, 1)
> If (a2 = a1) And (f2(j) = False) Then
> Common = Common + 1
> f1(i) = True
> f2(j) = True
> GoTo linea_exit
> End If
> Next j
> linea_exit:
> Next i
> l = 1
> For i = 1 To Len1
> If f1(i) Then
> For j = l To Len2
> If f2(j) Then
> l = j + 1
> a1 = Mid(str1, i, 1)
> a2 = Mid(str2, j, 1)
> If a1 <> a2 Then
> tr = tr + 0.5
> End If
> Exit For
> End If
> Next j
> End If
> Next i
> wcd = 1 / 3
> wrd = 1 / 3
> wtr = 1 / 3
> If Common <> 0 Then
> JaroDistance = wcd * Common / Len1 + wrd * Common / Len2 + wtr * (Common - tr) / Common
> Else
> JaroDistance = 0
> End If
> End Function
> Function CleanString(str1 As String) As String
> str1 = Replace(str1, ".", "")
> str1 = Replace(str1, ",", "")
> str1 = Replace(str1, "-", "")
> str1 = Replace(str1, ";", "")
> str1 = Replace(str1, ":", "")
> str1 = Replace(str1, "Á", "A")
> str1 = Replace(str1, "É", "E")
> str1 = Replace(str1, "Í", "I")
> str1 = Replace(str1, "Ó", "O")
> str1 = Replace(str1, "Ú", "U")
> str1 = Replace(str1, "'", "")
> str1 = Replace(str1, Chr(34), "")
> str1 = Replace(str1, "&", "and")
> str1 = Replace(str1, " COMPANY ", " CO ")
> str1 = Replace(str1, " CORPORATION ", " CO ")
> str1 = Replace(str1, " a ", " ")
> str1 = Replace(str1, " an ", " ")
> str1 = Replace(str1, " and ", " ")
> str1 = Replace(str1, " the ", " ")
> str1 = Replace(str1, "the ", "")
> str1 = Replace(str1, " ", "")
> CleanString = str1
> End Function
>
> '********************************
> '*** Compute Levenshtein Distance
> '********************************
>
> Sub testLD()
> Dim i As Variant
> Dim str1 As String
> Dim str2 As String
> str1 = "Cat in the Hat"
> str2 = "the in Hat Cat"
> i = LD(str1, str2)
> Debug.Print i
> End Sub
>
> Public Function LD(ByVal s As String, ByVal t As String) As Integer
> Dim d() As Integer ' matrix
> Dim m As Integer ' length of t
> Dim n As Integer ' length of s
> Dim i As Integer ' iterates
> through s
> Dim j As Integer ' iterates
> through t
> Dim s_i As String ' ith character of s
> Dim t_j As String ' jth character of t
> Dim cost As Integer ' cost
> ' Step 1
> n = Len(s)
> m = Len(t)
> If n = 0 Then
> LD = m
> Exit Function
> End If
> If m = 0 Then
> LD = n
> Exit Function
> End If
> ReDim d(0 To n, 0 To m) As Integer
> ' Step 2
> For i = 0 To n
> d(i, 0) = i
> Next i
> For j = 0 To m
> d(0, j) = j
> Next j
> ' Step 3
> For i = 1 To n
> s_i = Mid$(s, i, 1)
> ' Step 4
> For j = 1 To m
> t_j = Mid$(t, j, 1)
> ' Step 5
> If s_i = t_j Then
> cost = 0
> Else
> cost = 1
> End If
> ' Step 6
> d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
> Next j
> Next i
> ' Step 7
> LD = d(n, m)
> Erase d
> End Function
> '*******************************
> '*** Get minimum of three values
> '*******************************
> Private Function Minimum(ByVal a As Integer, _
> ByVal b As Integer, _
> ByVal c As Integer) As Integer
> Dim mi As Integer
> mi = a
> If b < mi Then
> mi = b
> End If
> If c < mi Then
> mi = c
> End If
> Minimum = mi
> End Function
> '*** Code End ***
>
>
>
> Hypochondria is the only disease I haven't got.
>
> -----Original Message-----
> From: Microsoft Access Database Discussion List
> [mailto:ACCESS-L at PEACH.EASE.LSOFT.COM] On Behalf Of Kevin Waddle
> Sent: Sunday, September 14, 2008 12:08 AM
> To: ACCESS-L at PEACH.EASE.LSOFT.COM
> Subject: Speed Problems when looping through records or the terrors of
> deduping
>
> Hello,
>
> I have a table with 18,000 records. The records are titles stored as a
> string.
>
> I am trying compare each record to all of the other records in the table
> using a Jaro-Winkler Distance function to find similar / potential duplicate
> records.
>
> So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
> 'Composite Science and Technology' and 'Composites Science and Technology'
> would flag as similar.
>
> I have tried looping through the table twice:
> Set dbs = CurrentDb
> Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset)
> Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset)
> rst2.MoveFirst
> rst1.MoveFirst
> Do Until rst1.EOF
> If JaroDistance(rst1![CitationTitle], rst2![CitationTitle])> 0.8 Then
> 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE
> DoEvents
> rst2.MoveNext
> Loop
> rst1.MoveNext
> Loop
>
> I have tried wrapping the Table into a Class Module and Collection and looping through the collection twice:
> Set dbs = CurrentDb
> Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset)
> rst1.MoveFirst
> Do Until rst1.EOF
> Set cls = New clsCitation
> Set cls2 = New clsCitation2
> cls.CitID = rst1![CitationTitleID]
> cls.Citation = rst1![CitationTitle]
> cls.RecordCount = cnt
> Citation.Add cls
>
> cls2.CitID2 = rst1![CitationTitleID]
> cls2.Citation2 = rst1![CitationTitle]
> cls2.RecordCount2 = cnt
> Citation2.Add cls2
> rst1.MoveNext
> Loop
>
> For Each cls In Citation
> For Each cls2 In Citation2
> If JaroDistance(cls.Citation, cls2.Citation2)> 0.8 Then
> 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE"
> DoEvents
> Next
> Next
>
>
> Essentially, I end up looping through the entire table 18,000 times!
> No matter what I do it is taking from 45 to 60 seconds to make one pass through the table.
> This means that to complete the entire process would take about 12 days, 24 hours a day!
>
> Can someone point me to a faster method for applying fuzzy deduping?
>
> Thanks,
> Kevin
>
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com

_________________________________________________________________
Get more out of the Web. Learn 10 hidden secrets of Windows Live.
http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_domore_092008



More information about the AccessD mailing list