Kevin Waddle
thewaddles at sbcglobal.net
Mon Sep 15 01:07:55 CDT 2008
All, Thank you very much for your suggestions. I have spent way more time on this than I can recoup with the bid I placed. Now it has become more of an obsession than a project! I still havent found a solution that really works, however I would like to try to answer all those that took the time to reply. Thank you, Kevin Dan suggested only looping through records with higher Its, so Record 1 would be checked against records 2 to 18,000, Record 17,998 would be checked against 17,999 to 18,000. - Great idea. I have implemented this, but its still not fast enough. Emilia and Steve suggested doing the comparison in memory rather than looping. - I created a Class Module (clsCitation) containing the TitleID and Title and a collection (Citation) to store the class: Do Until rst1.EOF Set cls = New clsCitation If rst1![Checked] = False Then cls.CitID = rst1![CitationTitleID] cls.Citation = rst1![CitationTitle] cls.RecordCount = cnt Citation.Add cls End If rst1.MoveNext Loop Set Citation2 = Citation - I then looped through the two collections. Still took about 45 seconds per loop Rocky asked if most of the time was in the function. - I took the function and resulting SQL commands out and the time for each cycle was about the same. Gustav suggested cutting down on the number of comparisons based on string length. - Unfortunately, there appears to have been no normalization in the table structure. So the following records would be considered duplicates: European Journal of Social Psychology = Social Psychology Applied Developmental Psychology = Journal of Applied Developmental Psychology Stephen suggested a Cartesian query and then hit immediately on the problem with that route. My problem is I get a Not Enough Space on Temp Disk error before the query finishes. David asked for the functions: The Jaro Function uses two functions: JaroDistance and CleanString. I have also included a Levenshtein Distance function. The Levenshtein distance between two strings is the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. '*** Code Start *** Function JaroDistance(ByVal str1 As String, ByVal str2 As String) As Double Dim Len1 As Integer Dim Len2 As Integer Dim LenMin As Integer Dim LenMax As Integer Dim m As Integer Dim i As Integer Dim j As Integer Dim Common As Integer Dim tr As Double Dim a1 As String Dim a2 As String Dim Aux As String Dim AuxStr As String Dim f As Integer Dim l As Integer Dim wcd As Double Dim wrd As Double Dim wtr As Double Dim f1() As Boolean Dim f2() As Boolean str1 = CleanString(str1) str2 = CleanString(str2) Len1 = Len(str1) Len2 = Len(str2) If Len1 > Len2 Then Aux = Len2 Len2 = Len1 Len1 = Aux AuxStr = str1 str1 = str2 str2 = AuxStr End If LenMin = Len1 LenMax = Len2 ReDim f1(Len1), f2(Len2) For i = 1 To Len1 f1(i) = False Next i For j = 1 To Len2 f2(j) = False Next j m = Int((LenMax / 2) - 1) Common = 0 tr = 0 For i = 1 To Len1 a1 = Mid(str1, i, 1) If m >= i Then f = 1 l = i + m Else f = i - m l = i + m End If If l > LenMax Then l = LenMax End If For j = f To l a2 = Mid(str2, j, 1) If (a2 = a1) And (f2(j) = False) Then Common = Common + 1 f1(i) = True f2(j) = True GoTo linea_exit End If Next j linea_exit: Next i l = 1 For i = 1 To Len1 If f1(i) Then For j = l To Len2 If f2(j) Then l = j + 1 a1 = Mid(str1, i, 1) a2 = Mid(str2, j, 1) If a1 <> a2 Then tr = tr + 0.5 End If Exit For End If Next j End If Next i wcd = 1 / 3 wrd = 1 / 3 wtr = 1 / 3 If Common <> 0 Then JaroDistance = wcd * Common / Len1 + wrd * Common / Len2 + wtr * (Common - tr) / Common Else JaroDistance = 0 End If End Function Function CleanString(str1 As String) As String str1 = Replace(str1, ".", "") str1 = Replace(str1, ",", "") str1 = Replace(str1, "-", "") str1 = Replace(str1, ";", "") str1 = Replace(str1, ":", "") str1 = Replace(str1, "Á", "A") str1 = Replace(str1, "É", "E") str1 = Replace(str1, "Í", "I") str1 = Replace(str1, "Ó", "O") str1 = Replace(str1, "Ú", "U") str1 = Replace(str1, "'", "") str1 = Replace(str1, Chr(34), "") str1 = Replace(str1, "&", "and") str1 = Replace(str1, " COMPANY ", " CO ") str1 = Replace(str1, " CORPORATION ", " CO ") str1 = Replace(str1, " a ", " ") str1 = Replace(str1, " an ", " ") str1 = Replace(str1, " and ", " ") str1 = Replace(str1, " the ", " ") str1 = Replace(str1, "the ", "") str1 = Replace(str1, " ", "") CleanString = str1 End Function '******************************** '*** Compute Levenshtein Distance '******************************** Sub testLD() Dim i As Variant Dim str1 As String Dim str2 As String str1 = "Cat in the Hat" str2 = "the in Hat Cat" i = LD(str1, str2) Debug.Print i End Sub Public Function LD(ByVal s As String, ByVal t As String) As Integer Dim d() As Integer ' matrix Dim m As Integer ' length of t Dim n As Integer ' length of s Dim i As Integer ' iterates through s Dim j As Integer ' iterates through t Dim s_i As String ' ith character of s Dim t_j As String ' jth character of t Dim cost As Integer ' cost ' Step 1 n = Len(s) m = Len(t) If n = 0 Then LD = m Exit Function End If If m = 0 Then LD = n Exit Function End If ReDim d(0 To n, 0 To m) As Integer ' Step 2 For i = 0 To n d(i, 0) = i Next i For j = 0 To m d(0, j) = j Next j ' Step 3 For i = 1 To n s_i = Mid$(s, i, 1) ' Step 4 For j = 1 To m t_j = Mid$(t, j, 1) ' Step 5 If s_i = t_j Then cost = 0 Else cost = 1 End If ' Step 6 d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost) Next j Next i ' Step 7 LD = d(n, m) Erase d End Function '******************************* '*** Get minimum of three values '******************************* Private Function Minimum(ByVal a As Integer, _ ByVal b As Integer, _ ByVal c As Integer) As Integer Dim mi As Integer mi = a If b < mi Then mi = b End If If c < mi Then mi = c End If Minimum = mi End Function '*** Code End *** Hypochondria is the only disease I haven't got. -----Original Message----- From: Microsoft Access Database Discussion List [mailto:ACCESS-L at PEACH.EASE.LSOFT.COM] On Behalf Of Kevin Waddle Sent: Sunday, September 14, 2008 12:08 AM To: ACCESS-L at PEACH.EASE.LSOFT.COM Subject: Speed Problems when looping through records or the terrors of deduping Hello, I have a table with 18,000 records. The records are titles stored as a string. I am trying compare each record to all of the other records in the table using a Jaro-Winkler Distance function to find similar / potential duplicate records. So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and 'Composite Science and Technology' and 'Composites Science and Technology' would flag as similar. I have tried looping through the table twice: Set dbs = CurrentDb Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) rst2.MoveFirst rst1.MoveFirst Do Until rst1.EOF If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) > 0.8 Then 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE DoEvents rst2.MoveNext Loop rst1.MoveNext Loop I have tried wrapping the Table into a Class Module and Collection and looping through the collection twice: Set dbs = CurrentDb Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset) rst1.MoveFirst Do Until rst1.EOF Set cls = New clsCitation Set cls2 = New clsCitation2 cls.CitID = rst1![CitationTitleID] cls.Citation = rst1![CitationTitle] cls.RecordCount = cnt Citation.Add cls cls2.CitID2 = rst1![CitationTitleID] cls2.Citation2 = rst1![CitationTitle] cls2.RecordCount2 = cnt Citation2.Add cls2 rst1.MoveNext Loop For Each cls In Citation For Each cls2 In Citation2 If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then 'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE" DoEvents Next Next Essentially, I end up looping through the entire table 18,000 times! No matter what I do it is taking from 45 to 60 seconds to make one pass through the table. This means that to complete the entire process would take about 12 days, 24 hours a day! Can someone point me to a faster method for applying fuzzy deduping? Thanks, Kevin -------------------------------------------------------------------------- The ACCESS-L list is hosted on a Windows(R) 2000 Server running L-Soft international's LISTSERV(R) software. For subscription/signoff info and archives, see http://peach.ease.lsoft.com/archives/access-l.html . COPYRIGHT INFO: http://peach.ease.lsoft.com/scripts/wa.exe?SHOWTPL=COPYRIGHT&L=ACCESS-L