[AccessD] Speed Problems when looping through records or theterrors of deduping

Mon Sep 15 12:01:37 CDT 2008

I've got a little different approach for you to try.  Instead of running
every record through code, to compare to every other record A slightly
different approach would be to break down each title into words.

So, let's say tblTitles is your table with titles, with TitleID (long
integer ID), Title as the field with the title, and add a field called
WordCount (long integer).  Then create tblWords (WordID (Autonumber),
and Word (text)), and tblWordsInTitles (TitleID, WordID).  Then you
would run the following functions (actually, run GetWordsFromTitles,
RemovePunctuation is used inside that function):

Function RemovePunctuation(ByVal strWord As String) As String
Dim intChar As Long
intChar = Asc(Right(strWord, 1))
Do Until intChar >= 65 And intChar <= 90
    If Len(strWord) >= 1 Then
        RemovePunctuation = ""
        Exit Function
    End If
    strWord = Left(strWord, Len(strWord) - 1)
    intChar = Asc(Right(strWord, 1))
Loop
RemovePunctuation = strWord
End Function
Function GetWordsFromTitles()
Dim rs As ADODB.Recordset
Dim strSQL As String
Dim rsWord As ADODB.Recordset
Dim strArray() As String
Dim i As Long
Dim intWordCount As Long
Dim strWord As String
strSQL = "SELECT TitleID, Title, WordCount FROM tblTitles WHERE
WordCount=0"
Set rs = New ADODB.Recordset
rs.Open strSQL, CurrentProject.Connection, adOpenKeyset,
adLockOptimistic
If rs.EOF = False Then rs.MoveFirst
Do Until rs.EOF = True
    strArray = Split(rs.Fields(1).Value)
    intWordCount = 0
    For i = 0 To UBound(strArray)
        strWord = RemovePunctuation(UCase(strArray(i)))
        If strWord <> "" Then
            strSQL = "SELECT WordID FROM tblWords WHERE Word=""" &
strWord & """"
            Set rsWord = New ADODB.Recordset
            rsWord.Open strSQL, CurrentProject.Connection, adOpenKeyset,
adLockReadOnly
            If rsWord.EOF = True And rsWord.BOF = True Then
                rsWord.Close
                Set rsWord = Nothing
                Set rsWord = New ADODB.Recordset
                rsWord.Open "tblWords", CurrentProject.Connection,
adOpenKeyset, adLockOptimistic
                rsWord.AddNew
                rsWord.Fields("Word").Value = strWord
                rsWord.Update
            Else
                rsWord.MoveFirst
            End If
            strSQL = "INSERT INTO tblWordsInTitles ([TitleID],[WordID])
VALUES(" & rs.Fields(0).Value & "," & rsWord.Fields("WordID").Value &
")"
            rsWord.Close
            Set rsWord = Nothing
            DoCmd.SetWarnings False
            DoCmd.RunSQL strSQL
            DoCmd.SetWarnings True
            intWordCount = intWordCount + 1
        End If
    Next i
    rs.Fields("WordCount").Value = intWordCount
    rs.Update
    rs.MoveNext
Loop
rs.Close
Set rs = Nothing
MsgBox "Done"
End Function 

Now you have a table with all the words from your titles, and you have a
table with links between the words and the titles.  You can then do a
'reverse lookup' of sorts, based on the commonality of the words, and
the number of words.  For example:

SELECT tblTitles.TitleID, tblTitles.Title
FROM tblWordsInTitles AS tblWordsInTitles_1 INNER JOIN (tblWordsInTitles
INNER JOIN tblTitles ON tblWordsInTitles.TitleID = tblTitles.TitleID) ON
tblWordsInTitles_1.WordID = tblWordsInTitles.WordID
WHERE (((tblTitles.TitleID)<>1) AND ((tblWordsInTitles_1.TitleID)=1))
GROUP BY tblTitles.TitleID, tblTitles.Title
HAVING (((Count(tblTitles.TitleID))>(0.8*4)));

Would display any common titles with at least 80% of the words in the
title with ID 1 (the WHERE clause designates the ID for the original
title) and the HAVING clause is the .8 (80%) of 4 (WordCount for record
with ID of 1)).

If you need a running record of matched titles, then you would just have
to run this query against each record.  Now, this is in no way a perfect
example, you may need to tweak things a bit. Maybe tweak it so that it
ignores common words like in, the, a, etc. or catches words that are
plurals of existing words.  

The advantage of this approach would be if you need to list 'common
titles' on the fly.  The code above would need to be run whenever new
titles are added, but it'll only run against new titles (with
WordCount=0), so it might take a minute on your first 18k records, but
from then on it's a second to run for a handful of records.  And any
querying against the 'reverse word lookup' would be pretty quick since
it's all indexed numbers.

Just a thought.

Drew
The information contained in this transmission is intended only for the person or entity to which it is addressed and may contain II-VI Proprietary and/or II-VI Business Sensitive material. If you are not the intended recipient, please contact the sender immediately and destroy the material in its entirety, whether electronic or hard copy. You are notified that any review, retransmission, copying, disclosure, dissemination, or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited.