[AccessD] deduplication

A.D.TEJPAL adtp at hotmail.com
Tue May 29 12:45:14 CDT 2007


    You are most welcome John!  I am glad that you were successful in evolving a solution on the desired lines.

Best wishes,
A.D.Tejpal
---------------

  ----- Original Message ----- 
  From: jwcolby 
  To: 'Access Developers discussion and problem solving' 
  Sent: Tuesday, May 29, 2007 20:03
  Subject: Re: [AccessD] deduplication


  A.D.

  Your SQL worked as advertised once I got past the "SELECT *" issue.  The following is the actual code that I ran to do the Family level deduping:

  DELETE FROM tblData
  WHERE (SELECT Count(*) FROM tblData AS T1
  WHERE T1.HashFamily = tblData.HashFamily
  AND T1.PKID <= tblData.PKID) > 1;

  I created a query to pull just the dupes (at the person level) into a
  separate table, then ran the code against that table, deleting dupes against the HashPerson field.  When finished there were no dupes in HashPerson, as would be expected.  BTW ~360K person level dupes were deleted.  

  I then ran a query displaying all of the records with dupes in HashFamily. There are still 11.7K dupe COUNTS > 1 in Hash Family which is expected, these being different first names, same last names.  I then modified the query to use HashFamily field to do the deletes and 12,416 records were deleted.  This means that there were several cases where there were 3 or more people with the same last name different first name in the result set. Once the delete completed there are no dupes in HashFamily.

  That is the desired result for this order, deduped to the family level.

  I do have to perform one last check, which is to ensure that the delete
  query did not simply remove ALL members with a given person or family hash.

  I believe that I have accomplished this test by isolating all of the dupes
  into a separate table and performing the dedupe against just the duplicate result set.  If there are records left, then the delete function did not delete all the dupes, but left one record for each hash value for the chosen hash field.  Since the table still contains 311K records (from an initial 660K duplicate HashPerson record set) I believe that the method works correctly.

  Thanks to you I now have a generalized method that I can build into a stored procedure where I can feed in the table name, HashField name, PKID name and cause a dupe delete to happen to that duplication level.

  Once again A.D., your assistance (and patience) is most appreciated.  

  John W. Colby
  Colby Consulting
  www.ColbyConsulting.com 


More information about the AccessD mailing list