[AccessD] deduplication

Tue May 29 09:33:50 CDT 2007

A.D.

Your SQL worked as advertised once I got past the "SELECT *" issue.  The
following is the actual code that I ran to do the Family level deduping:

DELETE FROM tblData
WHERE (
	SELECT Count(*) FROM tblData AS T1
	WHERE T1.HashFamily = tblData.HashFamily
		AND T1.PKID <= tblData.PKID
)
> 1;

I created a query to pull just the dupes (at the person level) into a
separate table, then ran the code against that table, deleting dupes against
the HashPerson field.  When finished there were no dupes in HashPerson, as
would be expected.  BTW ~360K person level dupes were deleted.  

I then ran a query displaying all of the records with dupes in HashFamily.
There are still 11.7K dupe COUNTS > 1 in Hash Family which is expected,
these being different first names, same last names.  I then modified the
query to use HashFamily field to do the deletes and 12,416 records were
deleted.  This means that there were several cases where there were 3 or
more people with the same last name different first name in the result set.
Once the delete completed there are no dupes in HashFamily.

That is the desired result for this order, deduped to the family level.

I do have to perform one last check, which is to ensure that the delete
query did not simply remove ALL members with a given person or family hash.
I believe that I have accomplished this test by isolating all of the dupes
into a separate table and performing the dedupe against just the duplicate
result set.  If there are records left, then the delete function did not
delete all the dupes, but left one record for each hash value for the chosen
hash field.  Since the table still contains 311K records (from an initial
660K duplicate HashPerson record set) I believe that the method works
correctly.

Thanks to you I now have a generalized method that I can build into a stored
procedure where I can feed in the table name, HashField name, PKID name and
cause a dupe delete to happen to that duplication level.

Once again A.D., your assistance (and patience) is most appreciated.  

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL
Sent: Tuesday, May 29, 2007 1:35 AM
To: Access Developers discussion and problem solving
Cc: ADT
Subject: Re: [AccessD] deduplication

John,

    PKID does not have to be sequential. All that is required is that it
should be unique. For all duplicate records pertaining to a given
HashPerson, the subquery generates calculated sequential numbers, enabling
deletion of all duplicates except the first one. 

    The delete query using count subquery, as suggested by me, was duly
tested and found to perform satisfactorily on Access 2003. The other
alternative, using a subquery with NOT IN clause, as suggested by Gustav in
his second post, is also found to perform smoothly. 

    Subquery with NOT IN clause is sometimes slow. However, in this
particular case, it might even prove faster than Count(*) subquery as the
cross-check involved is essentially against a single record.

    Either of the two alternative styles of delete query should have given
the desired results at your end. 

    If you are still stuck, and are in a position to send a zipped copy of
your table tblData with a few records (say 50 to 100 records), it could be
tested at my end. While doing so, the following tasks would get performed:

    (a) For each HashPerson, merge values across different records, so as to
fill in blanks if any. This refers to your other thread "Merge records".

    (b) After filling up blanks as per (a) above, delete all duplicates so
that there is only one record per person.

A.D.Tejpal
(adtp at airtelbroadband.in)