jwcolby
jwcolby at colbyconsulting.com
Mon May 28 14:56:23 CDT 2007
I created HashID fields which use the SHA1 hash function built in to SQL Server 2007. That gives me one single field to index and pull HashIDs from which uniquely identifies duplicates using count() grouping queries. These hash fields work quite well, though they are unreadable. The nice part is that they reduce multiple field strings - Addr, zip5, zip4 and even Lname & fName which are potentially 60 or 70 (or more) bytes long down to a string only about 20 bytes long. I then index those. I am building a query that returns HashIDs where count(HashID) > 0. Using that I pull all the PKIDs for those HashIDs. Using that I pull Max(PKID). You get the picture. Even starting with the HashID fields it is a multi-step process, kinda the 'brute force' approach, and I was hoping to come up with something simpler. So far A.D.s suggestion has refused to compile, regardless of how I muck around with it. I got a delete method working, but it does require running the delete query until no records delete. There are cases where I have 4,5, or even more duplicates. Once I figure out what I am doing, I hope to create a stored procedure where I can input the table name and just have the process run against that table. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock Sent: Monday, May 28, 2007 3:29 PM To: accessd at databaseadvisors.com Subject: Re: [AccessD] deduplication Hi John You may get better results by using NOT IN: DELETE * FROM tblData WHERE PKID NOT IN (SELECT TOP 1 FROM tblData As T1 WHERE T1.HashPerson = tblData.HashPerson AND T1.PKID <= tblData.PKID ORDER BY T1.PKID); Air code. Test with caution. /gustav >>> jwcolby at colbyconsulting.com 28-05-2007 21:00 >>> A.D. It seems to me that the subquery SELECT should be able to stand alone, and the subquery is giving a "Incorrect syntax near '>'" John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL Sent: Monday, May 28, 2007 1:29 PM To: Access Developers discussion and problem solving Subject: Re: [AccessD] deduplication John, Considering the table & field names mentioned by you, the SQL given below should get you the desired results. You might like to try it out & confirm the outcome. Best wishes, A.D.Tejpal --------------- ===================================== DELETE * FROM tblData WHERE (SELECT Count(*) FROM tblData As T1 WHERE T1.HashPerson = tblData.HashPerson AND T1.PKID <= tblData.PKID) > 1; ===================================== ----- Original Message ----- From: jwcolby To: 'Access Developers discussion and problem solving' Sent: Monday, May 28, 2007 20:36 Subject: Re: [AccessD] deduplication A.D. (or anyone willing to chime in) I am attempting to implement your solution. I created the following select which should be the sub query in the outer WHERE: SELECT count(*) FROM tblData AS tblData1 WHERE ((tblData1.HashPerson = tblData.HashPerson) AND (tblData1.PKID <=TblData.ID)) > 1; However I am getting the very infamous Incorrect syntax near '>'. Any thoughts? John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL Sent: Monday, January 15, 2007 12:05 AM To: Access Developers discussion and problem solving Subject: Re: [AccessD] deduplication Sample query as given below, would remove all duplicates (other than the first occurrence for each case). T_Data is the name of table, while F1, F2 & F3 are the names of fields, whose combined value determines whether a record is duplicate or not. ID is the primary key (number type). A.D.Tejpal --------------- ===================================== DELETE * FROM T_Data WHERE (SELECT Count(*) FROM T_Data As T1 WHERE (T1.F1 & T1.F2 & T1.F3 = T_Data.F1 & T_Data.F2 & T_Data.F3) AND (T1.ID <= T_Data.ID)) > 1; ===================================== ----- Original Message ----- From: JWColby To: 'Access Developers discussion and problem solving' ; dba-sqlserver at databaseadvisors.com Sent: Monday, January 15, 2007 09:19 Subject: [AccessD] deduplication Is there a SQL syntax or method for identifying duplicate records based on and automatically removing (or marking) the second and subsequent instances of that record. I am working with a database where there may be several records for the same person. Each record will have slightly different data in the record. I need to find all the records for a given person, remove or somehow flag the "duplicate" records, and eventually "merge" the duplicate records together. I know how to use groupby and count to find the dupes but I don't really know how to then merge fields F through M from records 2,3 and 4 back into fields F through M in record 1. This is going to be an ongoing project so I need to learn the methodology to get it done. I can do it with recordsets and code, but not with SQL, and given the large potential numbers of records SQL would be my first choice. John W. Colby Colby Consulting www.ColbyConsulting.com -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com