Kevin Waddle
thewaddles at sbcglobal.net
Mon Sep 15 01:07:55 CDT 2008
All,
Thank you very much for your suggestions.
I have spent way more time on this than I can recoup with the bid I placed.
Now it has become more of an obsession than a project!
I still havent found a solution that really works, however I would like to
try to answer all those that took the time to reply.
Thank you,
Kevin
Dan suggested only looping through records with higher Its, so Record 1
would be checked against records 2 to 18,000, Record 17,998 would be checked
against 17,999 to 18,000.
- Great idea. I have implemented this, but its still not fast
enough.
Emilia and Steve suggested doing the comparison in memory rather than
looping.
- I created a Class Module (clsCitation) containing the TitleID and Title
and a collection (Citation) to store the class:
Do Until rst1.EOF
Set cls = New clsCitation
If rst1![Checked] = False Then
cls.CitID = rst1![CitationTitleID]
cls.Citation = rst1![CitationTitle]
cls.RecordCount = cnt
Citation.Add cls
End If
rst1.MoveNext
Loop
Set Citation2 = Citation
- I then looped through the two collections. Still took about 45 seconds
per loop
Rocky asked if most of the time was in the function.
- I took the function and resulting SQL commands out and the time
for each cycle was about the same.
Gustav suggested cutting down on the number of comparisons based on string
length.
- Unfortunately, there appears to have been no normalization in the table
structure.
So the following records would be considered duplicates:
European Journal of Social Psychology = Social Psychology
Applied Developmental Psychology = Journal of Applied
Developmental Psychology
Stephen suggested a Cartesian query and then hit immediately on the problem
with that route.
My problem is I get a Not Enough Space on Temp Disk error before the query
finishes.
David asked for the functions:
The Jaro Function uses two functions: JaroDistance and CleanString. I have
also included a Levenshtein Distance function. The Levenshtein distance
between two strings is the minimum number of operations needed to transform
one string into the other, where an operation is an insertion, deletion, or
substitution of a single character.
'*** Code Start ***
Function JaroDistance(ByVal str1 As String, ByVal str2 As String) As Double
Dim Len1 As Integer
Dim Len2 As Integer
Dim LenMin As Integer
Dim LenMax As Integer
Dim m As Integer
Dim i As Integer
Dim j As Integer
Dim Common As Integer
Dim tr As Double
Dim a1 As String
Dim a2 As String
Dim Aux As String
Dim AuxStr As String
Dim f As Integer
Dim l As Integer
Dim wcd As Double
Dim wrd As Double
Dim wtr As Double
Dim f1() As Boolean
Dim f2() As Boolean
str1 = CleanString(str1)
str2 = CleanString(str2)
Len1 = Len(str1)
Len2 = Len(str2)
If Len1 > Len2 Then
Aux = Len2
Len2 = Len1
Len1 = Aux
AuxStr = str1
str1 = str2
str2 = AuxStr
End If
LenMin = Len1
LenMax = Len2
ReDim f1(Len1), f2(Len2)
For i = 1 To Len1
f1(i) = False
Next i
For j = 1 To Len2
f2(j) = False
Next j
m = Int((LenMax / 2) - 1)
Common = 0
tr = 0
For i = 1 To Len1
a1 = Mid(str1, i, 1)
If m >= i Then
f = 1
l = i + m
Else
f = i - m
l = i + m
End If
If l > LenMax Then
l = LenMax
End If
For j = f To l
a2 = Mid(str2, j, 1)
If (a2 = a1) And (f2(j) = False) Then
Common = Common + 1
f1(i) = True
f2(j) = True
GoTo linea_exit
End If
Next j
linea_exit:
Next i
l = 1
For i = 1 To Len1
If f1(i) Then
For j = l To Len2
If f2(j) Then
l = j + 1
a1 = Mid(str1, i, 1)
a2 = Mid(str2, j, 1)
If a1 <> a2 Then
tr = tr + 0.5
End If
Exit For
End If
Next j
End If
Next i
wcd = 1 / 3
wrd = 1 / 3
wtr = 1 / 3
If Common <> 0 Then
JaroDistance = wcd * Common / Len1 + wrd * Common / Len2 + wtr *
(Common - tr) / Common
Else
JaroDistance = 0
End If
End Function
Function CleanString(str1 As String) As String
str1 = Replace(str1, ".", "")
str1 = Replace(str1, ",", "")
str1 = Replace(str1, "-", "")
str1 = Replace(str1, ";", "")
str1 = Replace(str1, ":", "")
str1 = Replace(str1, "Á", "A")
str1 = Replace(str1, "É", "E")
str1 = Replace(str1, "Í", "I")
str1 = Replace(str1, "Ó", "O")
str1 = Replace(str1, "Ú", "U")
str1 = Replace(str1, "'", "")
str1 = Replace(str1, Chr(34), "")
str1 = Replace(str1, "&", "and")
str1 = Replace(str1, " COMPANY ", " CO ")
str1 = Replace(str1, " CORPORATION ", " CO ")
str1 = Replace(str1, " a ", " ")
str1 = Replace(str1, " an ", " ")
str1 = Replace(str1, " and ", " ")
str1 = Replace(str1, " the ", " ")
str1 = Replace(str1, "the ", "")
str1 = Replace(str1, " ", "")
CleanString = str1
End Function
'********************************
'*** Compute Levenshtein Distance
'********************************
Sub testLD()
Dim i As Variant
Dim str1 As String
Dim str2 As String
str1 = "Cat in the Hat"
str2 = "the in Hat Cat"
i = LD(str1, str2)
Debug.Print i
End Sub
Public Function LD(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim n As Integer ' length of s
Dim i As Integer ' iterates
through s
Dim j As Integer ' iterates
through t
Dim s_i As String ' ith character of
s
Dim t_j As String ' jth character of
t
Dim cost As Integer ' cost
' Step 1
n = Len(s)
m = Len(t)
If n = 0 Then
LD = m
Exit Function
End If
If m = 0 Then
LD = n
Exit Function
End If
ReDim d(0 To n, 0 To m) As Integer
' Step 2
For i = 0 To n
d(i, 0) = i
Next i
For j = 0 To m
d(0, j) = j
Next j
' Step 3
For i = 1 To n
s_i = Mid$(s, i, 1)
' Step 4
For j = 1 To m
t_j = Mid$(t, j, 1)
' Step 5
If s_i = t_j Then
cost = 0
Else
cost = 1
End If
' Step 6
d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j -
1) + cost)
Next j
Next i
' Step 7
LD = d(n, m)
Erase d
End Function
'*******************************
'*** Get minimum of three values
'*******************************
Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim mi As Integer
mi = a
If b < mi Then
mi = b
End If
If c < mi Then
mi = c
End If
Minimum = mi
End Function
'*** Code End ***
Hypochondria is the only disease I haven't got.
-----Original Message-----
From: Microsoft Access Database Discussion List
[mailto:ACCESS-L at PEACH.EASE.LSOFT.COM] On Behalf Of Kevin Waddle
Sent: Sunday, September 14, 2008 12:08 AM
To: ACCESS-L at PEACH.EASE.LSOFT.COM
Subject: Speed Problems when looping through records or the terrors of
deduping
Hello,
I have a table with 18,000 records. The records are titles stored as a
string.
I am trying compare each record to all of the other records in the table
using a Jaro-Winkler Distance function to find similar / potential duplicate
records.
So 'Cat in the Hat' and 'Hat in the Cat' would flag as similar and
'Composite Science and Technology' and 'Composites Science and Technology'
would flag as similar.
I have tried looping through the table twice:
Set dbs = CurrentDb
Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
Set rst2 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
rst2.MoveFirst
rst1.MoveFirst
Do Until rst1.EOF
If JaroDistance(rst1![CitationTitle], rst2![CitationTitle]) >
0.8 Then
'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE
DoEvents
rst2.MoveNext
Loop
rst1.MoveNext
Loop
I have tried wrapping the Table into a Class Module and Collection and
looping through the collection twice:
Set dbs = CurrentDb
Set rst1 = dbs.OpenRecordset("SELECT * FROM [tblCitationTitles]",
dbOpenDynaset)
rst1.MoveFirst
Do Until rst1.EOF
Set cls = New clsCitation
Set cls2 = New clsCitation2
cls.CitID = rst1![CitationTitleID]
cls.Citation = rst1![CitationTitle]
cls.RecordCount = cnt
Citation.Add cls
cls2.CitID2 = rst1![CitationTitleID]
cls2.Citation2 = rst1![CitationTitle]
cls2.RecordCount2 = cnt
Citation2.Add cls2
rst1.MoveNext
Loop
For Each cls In Citation
For Each cls2 In Citation2
If JaroDistance(cls.Citation, cls2.Citation2) > 0.8 Then
'INSERT INTO ANOTHER TABLE RELATING THE TITLE TABLE"
DoEvents
Next
Next
Essentially, I end up looping through the entire table 18,000 times!
No matter what I do it is taking from 45 to 60 seconds to make one pass
through the table.
This means that to complete the entire process would take about 12 days, 24
hours a day!
Can someone point me to a faster method for applying fuzzy deduping?
Thanks,
Kevin
--------------------------------------------------------------------------
The ACCESS-L list is hosted on a Windows(R) 2000 Server running L-Soft
international's LISTSERV(R) software. For subscription/signoff info and
archives, see http://peach.ease.lsoft.com/archives/access-l.html .
COPYRIGHT INFO:
http://peach.ease.lsoft.com/scripts/wa.exe?SHOWTPL=COPYRIGHT&L=ACCESS-L