[AccessD] deduplication

jwcolby jwcolby at colbyconsulting.com
Mon May 28 15:48:14 CDT 2007


A.D.

I typed in your delete query exactly as specified and it too gave me
"compile" errors.

For starters, my PKIDs are not necessarily sequential. The data originated
in a 65 million record table (tblHSID).  That table had a sequential PKID.
I subsequently ran the address portion of those 65 million records through
an address validation package, coming out with about 51 million records with
valid addresses.  I created a brand new table (tblAZHSID) with the original
PKIDs, plus the name / address info, plus other data added in by the address
validation software (delivery route, lat/long etc).  So right there you can
see that the PKIDs are missing 14 million numbers out of the original 65
million sequential numbers.

I went back in to the tblAZHSID (validated address table) and created three
new fields called HashAddr, HashFamily and HashPerson.  I then filled in
these fields using the hash function in SQL Server 2007 - Hash(Addr + Zip5 +
zip4) for HashAddr, Hash(Addr + Zip5 + zip4 + Lname) for HashFamily and
Hash(Addr + Zip5 + zip4 + Lname + Fname) for HashPerson.  I then indexed
these three fields.

I then get an "order" where I have to join the validated tblAZHSID back to
tblHSID using the PKID, and then run WHERE clauses on the various data
fields back in tblHSID.  This resulted (in this particular order) in a new
tblData of about 4.3 million records.  This new table has only PKID, Lname,
Fname, Addr, City, St, Zip5, Zip4, HashAddr, HashFamily and HashPerson
fields.  It is used for mailing labels and needs nothing more to fill the
order.

It is against this tblData that I am attempting to dedupe.  I will need to
dedupe to the family level, in other words there should be one and only one
record in tblData with any given HashFamily value.  The 4.3 million records
have about 600k records with duplicate HashPerson values, usually 2 of the
same person, but occasionally 3,4,5 or more of the same person.  Obviously,
any record with the same HashPerson value will also have the same HashFamily
value since they have the same last name.  However there will probably be
even more duplicates in HashFamily since there can be multiple family
members in the table, Mary Colby and John Colby, both at the same address.

Now, the PKIDs as I have already discussed are not even close to
consecutive.  I have a fairly random selection (though not intentionally so)
of 4.3 million PKs out of an original 65 million PKs.  

My job is to dedupe these 4.3 million records to the FAMILY level.  That may
in fact leave duplicates an the HashAddr level.  This would occur where
there are people with different last names living at the same address.  By
de-duping to the family level though I will in fact also dedupe to the
person level.  In order to dedupe to the person level I simply use the
HashPerson field in all of my queries where I am looking for duplicate Hash
IDs.  To dedupe to the Family level, I use HashFamily field.  To dedupe to
the address level (only one label per address) I simply use HashAddr as the
HashID field where I look for duplicates.

Regardless of which level of de-duping I am attempting to accomplish, the
strategy is the same.  

1) Select a Hash field to use as the duplicate search mechanism
2) Find all HASHIDs in the selected hash field where
count(SelectedHashField) > 1.  
3) Get PKIDs for all such HashIDs in step 2 above.  
4) Obtain Max(PKID) using grouping on HashID in step 3 above.  
5) Obtain PKIDs in step 3 above but not in Max(PKID) query step 4 above.  
6) Delete records with those PKIDs.

Or at least that is my brute force method.  

The brute force method seems to work but obviously takes some serious work
to make happen on a given table.  I am working on standardizing my field
names such that I can then use stored procedures to feed in a table name and
get a delete to happen.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL
Sent: Monday, May 28, 2007 4:09 PM
To: Access Developers discussion and problem solving
Subject: Re: [AccessD] deduplication

John,

    If you are in a position to clarify as to what type of output you are
aiming at (by trying to convert the subquery into a stand-alone select
query), a way could be found.

    The syntax error faced by you is explained by the fact that in the
original SQL, the ">" operator was testing the sequential count of duplicate
records as given by the output of subquery (whether more than one). 

    Now, with removal of subquery part as attempted by you, the WHERE clause
has become mathematically flawed, missing an argument before the last
operator. It looks like:
WHERE a = b AND c <= d  > 1;

    If you attempt to rectify this part by putting:
WHERE a = b AND c <= d;
you will overcome the syntax problem, but run into a new one, getting
prompted for tblData.

    Moreover, as soon as you change the status from subquery to a stand
alone select query, alias table T1 is no longer able to perform  dynamically
out of synch with tblData. It is this dynamic feature, available via
subquery,  that enables us to get sequential count.

A.D.Tejpal
---------------

  ----- Original Message -----
  From: jwcolby
  To: 'Access Developers discussion and problem solving' 
  Sent: Tuesday, May 29, 2007 00:30
  Subject: Re: [AccessD] deduplication


  A.D.

  It seems to me that the subquery SELECT should be able to stand alone, and
the subquery is giving a "Incorrect syntax near '>'" 

  John W. Colby
  Colby Consulting
  www.ColbyConsulting.com 

  -----Original Message-----
  From: accessd-bounces at databaseadvisors.com
  [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL
  Sent: Monday, May 28, 2007 1:29 PM
  To: Access Developers discussion and problem solving
  Subject: Re: [AccessD] deduplication

  John,

      Considering the table & field names mentioned by you, the SQL given
below should get you the desired results. 

      You might like to try it out & confirm the outcome.

  Best wishes,
  A.D.Tejpal
  ---------------

  =====================================
  DELETE * FROM tblData
  WHERE (SELECT Count(*) FROM tblData As T1 WHERE T1.HashPerson =
  tblData.HashPerson AND T1.PKID <= tblData.PKID) > 1;
  =====================================

    ----- Original Message -----
    From: jwcolby
    To: 'Access Developers discussion and problem solving' 
    Sent: Monday, May 28, 2007 20:36
    Subject: Re: [AccessD] deduplication

    A.D. (or anyone willing to chime in)

    I am attempting to implement your solution.  I created the following
  select
    which should be the sub query in the outer WHERE:

    SELECT count(*)
    FROM tblData
    AS tblData1
    WHERE ((tblData1.HashPerson = tblData.HashPerson) AND (tblData1.PKID
    <=TblData.ID)) > 1; 

    However I am getting the very infamous Incorrect syntax near '>'.

    Any thoughts?

    John W. Colby
    Colby Consulting
    www.ColbyConsulting.com
--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com




More information about the AccessD mailing list