jwcolby
jwcolby at colbyconsulting.com
Tue May 29 09:33:50 CDT 2007
A.D. Your SQL worked as advertised once I got past the "SELECT *" issue. The following is the actual code that I ran to do the Family level deduping: DELETE FROM tblData WHERE ( SELECT Count(*) FROM tblData AS T1 WHERE T1.HashFamily = tblData.HashFamily AND T1.PKID <= tblData.PKID ) > 1; I created a query to pull just the dupes (at the person level) into a separate table, then ran the code against that table, deleting dupes against the HashPerson field. When finished there were no dupes in HashPerson, as would be expected. BTW ~360K person level dupes were deleted. I then ran a query displaying all of the records with dupes in HashFamily. There are still 11.7K dupe COUNTS > 1 in Hash Family which is expected, these being different first names, same last names. I then modified the query to use HashFamily field to do the deletes and 12,416 records were deleted. This means that there were several cases where there were 3 or more people with the same last name different first name in the result set. Once the delete completed there are no dupes in HashFamily. That is the desired result for this order, deduped to the family level. I do have to perform one last check, which is to ensure that the delete query did not simply remove ALL members with a given person or family hash. I believe that I have accomplished this test by isolating all of the dupes into a separate table and performing the dedupe against just the duplicate result set. If there are records left, then the delete function did not delete all the dupes, but left one record for each hash value for the chosen hash field. Since the table still contains 311K records (from an initial 660K duplicate HashPerson record set) I believe that the method works correctly. Thanks to you I now have a generalized method that I can build into a stored procedure where I can feed in the table name, HashField name, PKID name and cause a dupe delete to happen to that duplication level. Once again A.D., your assistance (and patience) is most appreciated. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL Sent: Tuesday, May 29, 2007 1:35 AM To: Access Developers discussion and problem solving Cc: ADT Subject: Re: [AccessD] deduplication John, PKID does not have to be sequential. All that is required is that it should be unique. For all duplicate records pertaining to a given HashPerson, the subquery generates calculated sequential numbers, enabling deletion of all duplicates except the first one. The delete query using count subquery, as suggested by me, was duly tested and found to perform satisfactorily on Access 2003. The other alternative, using a subquery with NOT IN clause, as suggested by Gustav in his second post, is also found to perform smoothly. Subquery with NOT IN clause is sometimes slow. However, in this particular case, it might even prove faster than Count(*) subquery as the cross-check involved is essentially against a single record. Either of the two alternative styles of delete query should have given the desired results at your end. If you are still stuck, and are in a position to send a zipped copy of your table tblData with a few records (say 50 to 100 records), it could be tested at my end. While doing so, the following tasks would get performed: (a) For each HashPerson, merge values across different records, so as to fill in blanks if any. This refers to your other thread "Merge records". (b) After filling up blanks as per (a) above, delete all duplicates so that there is only one record per person. A.D.Tejpal (adtp at airtelbroadband.in)