[AccessD] deduplication

Jonathan Gold johnnyclock at gmail.com
Mon May 28 15:05:57 CDT 2007


This can't be it, but I notice that in your statement you have a capital 'T'
(<=TblData.ID) where I expect a lower-case one:

SELECT count(*)
FROM tblData
AS tblData1
WHERE ((tblData1.HashPerson = tblData.HashPerson) AND (tblData1.PKID
<=TblData.ID)) > 1;

On 5/28/07, jwcolby <jwcolby at colbyconsulting.com> wrote:
>
> I created HashID fields which use the SHA1 hash function built in to SQL
> Server 2007.  That gives me one single field to index and pull HashIDs
> from
> which uniquely identifies duplicates using count() grouping
> queries.  These
> hash fields work quite well, though they are unreadable.  The nice part is
> that they reduce multiple field strings - Addr, zip5, zip4 and even Lname
> &
> fName which are potentially 60 or 70 (or more) bytes long down to a string
> only about 20 bytes long.  I then index those.
>
> I am building a query that returns HashIDs where count(HashID) > 0.  Using
> that I pull all the PKIDs for those HashIDs.  Using that I pull Max(PKID).
> You get the picture.  Even starting with the HashID fields it is a
> multi-step process, kinda the 'brute force' approach, and I was hoping to
> come up with something simpler.
>
> So far A.D.s suggestion has refused to compile, regardless of how I muck
> around with it.
>
> I got a delete method working, but it does require running the delete
> query
> until no records delete.  There are cases where I have 4,5, or even more
> duplicates.
>
> Once I figure out what I am doing, I hope to create a stored procedure
> where
> I can input the table name and just have the process run against that
> table.
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
> -----Original Message-----
> From: accessd-bounces at databaseadvisors.com
> [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock
> Sent: Monday, May 28, 2007 3:29 PM
> To: accessd at databaseadvisors.com
> Subject: Re: [AccessD] deduplication
>
> Hi John
>
> You may get better results by using NOT IN:
>
> DELETE * FROM tblData
> WHERE PKID NOT IN (SELECT TOP 1 FROM tblData As T1 WHERE T1.HashPerson =
> tblData.HashPerson AND T1.PKID <= tblData.PKID ORDER BY T1.PKID);
>
> Air code. Test with caution.
>
> /gustav
>
>
> >>> jwcolby at colbyconsulting.com 28-05-2007 21:00 >>>
> A.D.
>
> It seems to me that the subquery SELECT should be able to stand alone, and
> the subquery is giving a "Incorrect syntax near '>'"
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
> -----Original Message-----
> From: accessd-bounces at databaseadvisors.com
> [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL
> Sent: Monday, May 28, 2007 1:29 PM
> To: Access Developers discussion and problem solving
> Subject: Re: [AccessD] deduplication
>
> John,
>
>     Considering the table & field names mentioned by you, the SQL given
> below should get you the desired results.
>
>     You might like to try it out & confirm the outcome.
>
> Best wishes,
> A.D.Tejpal
> ---------------
>
> =====================================
> DELETE * FROM tblData
> WHERE (SELECT Count(*) FROM tblData As T1 WHERE T1.HashPerson =
> tblData.HashPerson AND T1.PKID <= tblData.PKID) > 1;
> =====================================
>
>   ----- Original Message -----
>   From: jwcolby
>   To: 'Access Developers discussion and problem solving'
>   Sent: Monday, May 28, 2007 20:36
>   Subject: Re: [AccessD] deduplication
>
>
>   A.D. (or anyone willing to chime in)
>
>   I am attempting to implement your solution.  I created the following
> select
>   which should be the sub query in the outer WHERE:
>
>   SELECT count(*)
>   FROM tblData
>   AS tblData1
>   WHERE ((tblData1.HashPerson = tblData.HashPerson) AND (tblData1.PKID
>   <=TblData.ID)) > 1;
>
>   However I am getting the very infamous Incorrect syntax near '>'.
>
>   Any thoughts?
>
>   John W. Colby
>   Colby Consulting
>   www.ColbyConsulting.com
>   -----Original Message-----
>   From: accessd-bounces at databaseadvisors.com
>   [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL
>   Sent: Monday, January 15, 2007 12:05 AM
>   To: Access Developers discussion and problem solving
>   Subject: Re: [AccessD] deduplication
>
>       Sample query as given below, would remove all duplicates (other than
> the first occurrence for each case).
>
>       T_Data is the name of table, while F1, F2 & F3 are the names of
> fields, whose combined value determines whether a record is duplicate or
> not. ID is the primary key (number type).
>
>   A.D.Tejpal
>   ---------------
>
>   =====================================
>   DELETE * FROM T_Data
>   WHERE (SELECT Count(*) FROM T_Data As T1 WHERE (T1.F1 & T1.F2 & T1.F3 =
>   T_Data.F1 & T_Data.F2 & T_Data.F3) AND (T1.ID <= T_Data.ID)) > 1;
>   =====================================
>
>     ----- Original Message -----
>     From: JWColby
>     To: 'Access Developers discussion and problem solving' ;
>   dba-sqlserver at databaseadvisors.com
>     Sent: Monday, January 15, 2007 09:19
>     Subject: [AccessD] deduplication
>
>     Is there a SQL syntax or method for identifying duplicate records
> based
> on and automatically removing (or marking) the second and subsequent
> instances of that record.
>
>     I am working with a database where there may be several records for
> the
> same person.  Each record will have slightly different data in the record.
> I
> need to find all the records for a given person, remove or somehow flag
> the
> "duplicate" records, and eventually "merge" the duplicate records
> together.
>
>     I know how to use groupby and count to find the dupes but I don't
> really
> know how to then merge fields F through M from records 2,3 and 4 back into
> fields F through M in record 1.  This is going to be an ongoing project so
> I
> need to learn the methodology to get it done.  I can do it with recordsets
> and code, but not with SQL, and given the large potential numbers of
> records
> SQL would be my first choice.
>
>     John W. Colby
>     Colby Consulting
>     www.ColbyConsulting.com
>
>
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com
>
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com
>



-- 
Jonathan
Berkeley, CA
http://home.pacbell.net/jonnygee/



More information about the AccessD mailing list