[AccessD] Speed Problems when looping through records or the terrors of deduping

Gustav Brock Gustav at cactus.dk
Mon Sep 15 01:37:15 CDT 2008


Hi Kevin

..  So the following records would be considered duplicates:
 'European Journal of Social Psychology' = 'Social Psychology'
 'Applied Developmental Psychology' = 'Journal of Applied Developmental Psychology'

But these comparison return a Jaro distance of 0.53 and 0.76, not fulfilling your requirement of 0.8 as the minimum.
So how should these match?

That said, such examples could be found much faster by a simple string comparison using InStr. This will return a positive value when the smaller string is found in the larger.

/gustav

>>> thewaddles at sbcglobal.net 15-09-2008 08:07 >>>
All,
Thank you very much for your suggestions.

I have spent way more time on this than I can recoup with the bid I placed.
Now it has become more of an obsession than a project!

I still haven't found a solution that really works, however I would like to
try to answer all those that took the time to reply.

Thank you,
Kevin

Dan suggested only looping through records with higher It's, so Record 1
would be checked against records 2 to 18,000, Record 17,998 would be checked
against 17,999 to 18,000.
	- Great idea.  I have implemented this, but it's still not fast enough.

Emilia and Steve suggested doing the comparison in memory rather than
looping.
- I created a Class Module (clsCitation) containing the TitleID and Title
and a collection (Citation) to store the class:
    Do Until rst1.EOF
Set cls = New clsCitation
        If rst1![Checked] = False Then
            cls.CitID = rst1![CitationTitleID]
            cls.Citation = rst1![CitationTitle]
            cls.RecordCount = cnt
            Citation.Add cls
        End If
        rst1.MoveNext
Loop
    Set Citation2 = Citation
- I then looped through the two collections.  Still took about 45 seconds per loop

Rocky asked if most of the time was in the function.
	- I took the function and resulting SQL commands out and the time
for each cycle was about the same.

Gustav suggested cutting down on the number of comparisons based on string length.
- Unfortunately, there appears to have been no normalization in the table structure.  
      So the following records would be considered duplicates:
         'European Journal of Social Psychology' = 'Social Psychology'
         'Applied Developmental Psychology' = 'Journal of Applied Developmental Psychology'

Stephen suggested a Cartesian query and then hit immediately on the problem with that route.
My problem is I get a "Not Enough Space on Temp Disk" error before the query finishes.

David asked for the functions:
The Jaro Function uses two functions: JaroDistance and CleanString.  I have
also included a Levenshtein Distance function. The Levenshtein distance
between two strings is the minimum number of operations needed to transform
one string into the other, where an operation is an insertion, deletion, or
substitution of a single character.

 '*** Code Start ***
Function JaroDistance(ByVal str1 As String, ByVal str2 As String) As Double
    Dim Len1                                 As Integer
    Dim Len2                                 As Integer
    Dim LenMin                               As Integer
    Dim LenMax                               As Integer
    Dim m                                    As Integer
    Dim i                                    As Integer
    Dim j                                    As Integer
    Dim Common                               As Integer
    Dim tr                                   As Double
    Dim a1                                   As String
    Dim a2                                   As String
    Dim Aux                                  As String
    Dim AuxStr                               As String
    Dim f                                    As Integer
    Dim l                                    As Integer
    Dim wcd                                  As Double
    Dim wrd                                  As Double
    Dim wtr                                  As Double
    Dim f1()                                 As Boolean
    Dim f2()                                 As Boolean
    str1 = CleanString(str1)
    str2 = CleanString(str2)
    Len1 = Len(str1)
    Len2 = Len(str2)
    If Len1 > Len2 Then
        Aux = Len2
        Len2 = Len1
        Len1 = Aux
        AuxStr = str1
        str1 = str2
        str2 = AuxStr
    End If
    LenMin = Len1
    LenMax = Len2
    ReDim f1(Len1), f2(Len2)
    For i = 1 To Len1
        f1(i) = False
    Next i
    For j = 1 To Len2
        f2(j) = False
    Next j
    m = Int((LenMax / 2) - 1)
    Common = 0
    tr = 0
    For i = 1 To Len1
        a1 = Mid(str1, i, 1)
        If m >= i Then
            f = 1
            l = i + m
        Else
            f = i - m
            l = i + m
        End If
        If l > LenMax Then
            l = LenMax
        End If
        For j = f To l
            a2 = Mid(str2, j, 1)
            If (a2 = a1) And (f2(j) = False) Then
                Common = Common + 1
                f1(i) = True
                f2(j) = True
                GoTo linea_exit
            End If
        Next j
linea_exit:
    Next i
    l = 1
    For i = 1 To Len1
        If f1(i) Then
            For j = l To Len2
                If f2(j) Then
                    l = j + 1
                    a1 = Mid(str1, i, 1)
                    a2 = Mid(str2, j, 1)
                    If a1 <> a2 Then
                        tr = tr + 0.5
                    End If
                    Exit For
                End If
            Next j
        End If
    Next i
    wcd = 1 / 3
    wrd = 1 / 3
    wtr = 1 / 3
    If Common <> 0 Then
        JaroDistance = wcd * Common / Len1 + wrd * Common / Len2 + wtr * (Common - tr) / Common
    Else
        JaroDistance = 0
    End If
End Function
Function CleanString(str1 As String) As String
    str1 = Replace(str1, ".", "")
    str1 = Replace(str1, ",", "")
    str1 = Replace(str1, "-", "")
    str1 = Replace(str1, ";", "")
    str1 = Replace(str1, ":", "")
    str1 = Replace(str1, "Á", "A")
    str1 = Replace(str1, "É", "E")
    str1 = Replace(str1, "Í", "I")
    str1 = Replace(str1, "Ó", "O")
    str1 = Replace(str1, "Ú", "U")
    str1 = Replace(str1, "'", "")
    str1 = Replace(str1, Chr(34), "")
    str1 = Replace(str1, "&", "and")
    str1 = Replace(str1, " COMPANY ", " CO ")
    str1 = Replace(str1, " CORPORATION ", " CO ")
    str1 = Replace(str1, " a ", " ")
    str1 = Replace(str1, " an ", " ")
    str1 = Replace(str1, " and ", " ")
    str1 = Replace(str1, " the ", " ")
    str1 = Replace(str1, "the ", "")
    str1 = Replace(str1, " ", "")
    CleanString = str1
End Function

'********************************
'*** Compute Levenshtein Distance
'********************************

Sub testLD()
    Dim i                                    As Variant
    Dim str1                                 As String
    Dim str2                                 As String
    str1 = "Cat in the Hat"
    str2 = "the in Hat Cat"
    i = LD(str1, str2)
Debug.Print i
End Sub

Public Function LD(ByVal s As String, ByVal t As String) As Integer
    Dim d()                                  As Integer    ' matrix
    Dim m                                    As Integer    ' length of t
    Dim n                                    As Integer    ' length of s
    Dim i                                    As Integer    ' iterates
through s
    Dim j                                    As Integer    ' iterates
through t
    Dim s_i                                  As String    ' ith character of s
    Dim t_j                                  As String    ' jth character of t
    Dim cost                                 As Integer    ' cost
    ' Step 1
    n = Len(s)
    m = Len(t)
    If n = 0 Then
        LD = m
        Exit Function
    End If
    If m = 0 Then
        LD = n
        Exit Function
    End If
    ReDim d(0 To n, 0 To m) As Integer
    ' Step 2
    For i = 0 To n
        d(i, 0) = i
    Next i
    For j = 0 To m
        d(0, j) = j
    Next j
    ' Step 3
    For i = 1 To n
        s_i = Mid$(s, i, 1)
        ' Step 4
        For j = 1 To m
            t_j = Mid$(t, j, 1)
            ' Step 5
            If s_i = t_j Then
                cost = 0
            Else
                cost = 1
            End If
            ' Step 6
            d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
        Next j
    Next i
    ' Step 7
    LD = d(n, m)
    Erase d
End Function
'*******************************
'*** Get minimum of three values
'*******************************
Private Function Minimum(ByVal a As Integer, _
                         ByVal b As Integer, _
                         ByVal c As Integer) As Integer
    Dim mi                                   As Integer
    mi = a
    If b < mi Then
        mi = b
    End If
    If c < mi Then
        mi = c
    End If
    Minimum = mi
End Function
'*** Code End ***



Hypochondria is the only disease I haven't got.

-----Original Message-----
From: Microsoft Access Database Discussion List
[mailto:ACCESS-L at PEACH.EASE.LSOFT.COM] On Behalf Of Kevin Waddle
Sent: Sunday, September 14, 2008 12:08 AM
To: ACCESS-L at PEACH.EASE.LSOFT.COM 
Subject: Speed Problems when looping through records or the terrors of
deduping

Hello,
 
I have a table with 18,000 records.  The records are titles stored as a
string.
 
I am trying compare each record to all of the other records in the table
using a Jaro-Winkler Distance function to find similar / potential duplicate
records.
 
So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
'Composite Science and Technology' and 'Composites Science and Technology'
would flag as similar.
 
I have tried looping through the table twice:
    Set dbs = CurrentDb
    Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset)
    Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset)
    rst2.MoveFirst
    rst1.MoveFirst
    Do Until rst1.EOF
            If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) > 0.8 Then
                'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE
            DoEvents
            rst2.MoveNext
        Loop
        rst1.MoveNext
    Loop

I have tried wrapping the Table into a Class Module and Collection and looping through the collection twice:
    Set dbs = CurrentDb
    Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]", dbOpenDynaset)
    rst1.MoveFirst
    Do Until rst1.EOF
        Set cls = New clsCitation
        Set cls2 = New clsCitation2
            cls.CitID = rst1![CitationTitleID]
            cls.Citation = rst1![CitationTitle]
            cls.RecordCount = cnt
            Citation.Add cls

        cls2.CitID2 = rst1![CitationTitleID]
        cls2.Citation2 = rst1![CitationTitle]
        cls2.RecordCount2 = cnt
        Citation2.Add cls2
        rst1.MoveNext
    Loop
 
    For Each cls In Citation
        For Each cls2 In Citation2
            If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then
                'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE"
            DoEvents
        Next
    Next
 
 
Essentially, I end up looping through the entire table 18,000 times!
No matter what I do it is taking from 45 to 60 seconds to make one pass through the table.  
This means that to complete the entire process would take about 12 days, 24 hours a day!
 
Can someone point me to a faster method for applying fuzzy deduping?
 
Thanks,
Kevin




More information about the AccessD mailing list