[AccessD] deduplication

jwcolby jwcolby at colbyconsulting.com
Tue May 29 06:18:30 CDT 2007


A.D.

It is possible that the reason for the problem is the environment.  Due to
the size of the databases, I do all of this stuff in SQL Server.  All I know
is that when I do the "verify sql syntax" it returns an error.  

I am still working on it.  I pulled a table of just the duplicates.  I could
send just those in a separate database container, zipped up.  I'll let you
know if I need to do that.  I have to get this stuff working today or I will
be in trouble. 


John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL
Sent: Tuesday, May 29, 2007 1:35 AM
To: Access Developers discussion and problem solving
Cc: ADT
Subject: Re: [AccessD] deduplication

John,

    PKID does not have to be sequential. All that is required is that it
should be unique. For all duplicate records pertaining to a given
HashPerson, the subquery generates calculated sequential numbers, enabling
deletion of all duplicates except the first one. 

    The delete query using count subquery, as suggested by me, was duly
tested and found to perform satisfactorily on Access 2003. The other
alternative, using a subquery with NOT IN clause, as suggested by Gustav in
his second post, is also found to perform smoothly. 

    Subquery with NOT IN clause is sometimes slow. However, in this
particular case, it might even prove faster than Count(*) subquery as the
cross-check involved is essentially against a single record.

    Either of the two alternative styles of delete query should have given
the desired results at your end. 

    If you are still stuck, and are in a position to send a zipped copy of
your table tblData with a few records (say 50 to 100 records), it could be
tested at my end. While doing so, the following tasks would get performed:

    (a) For each HashPerson, merge values across different records, so as to
fill in blanks if any. This refers to your other thread "Merge records".

    (b) After filling up blanks as per (a) above, delete all duplicates so
that there is only one record per person.

A.D.Tejpal
(adtp at airtelbroadband.in)

  ----- Original Message -----
  From: jwcolby
  To: 'Access Developers discussion and problem solving' 
  Sent: Tuesday, May 29, 2007 02:18
  Subject: Re: [AccessD] deduplication


  A.D.

  I typed in your delete query exactly as specified and it too gave me
  "compile" errors.

  For starters, my PKIDs are not necessarily sequential. The data originated
  in a 65 million record table (tblHSID).  That table had a sequential PKID.
  I subsequently ran the address portion of those 65 million records through
  an address validation package, coming out with about 51 million records
with
  valid addresses.  I created a brand new table (tblAZHSID) with the
original
  PKIDs, plus the name / address info, plus other data added in by the
address
  validation software (delivery route, lat/long etc).  So right there you
can
  see that the PKIDs are missing 14 million numbers out of the original 65
  million sequential numbers.

  I went back in to the tblAZHSID (validated address table) and created
three
  new fields called HashAddr, HashFamily and HashPerson.  I then filled in
  these fields using the hash function in SQL Server 2007 - Hash(Addr + Zip5
+
  zip4) for HashAddr, Hash(Addr + Zip5 + zip4 + Lname) for HashFamily and
  Hash(Addr + Zip5 + zip4 + Lname + Fname) for HashPerson.  I then indexed
  these three fields.

  I then get an "order" where I have to join the validated tblAZHSID back to
  tblHSID using the PKID, and then run WHERE clauses on the various data
  fields back in tblHSID.  This resulted (in this particular order) in a new
  tblData of about 4.3 million records.  This new table has only PKID,
Lname,
  Fname, Addr, City, St, Zip5, Zip4, HashAddr, HashFamily and HashPerson
  fields.  It is used for mailing labels and needs nothing more to fill the
  order.

  It is against this tblData that I am attempting to dedupe.  I will need to
  dedupe to the family level, in other words there should be one and only
one
  record in tblData with any given HashFamily value.  The 4.3 million
records
  have about 600k records with duplicate HashPerson values, usually 2 of the
  same person, but occasionally 3,4,5 or more of the same person.
Obviously,
  any record with the same HashPerson value will also have the same
HashFamily
  value since they have the same last name.  However there will probably be
  even more duplicates in HashFamily since there can be multiple family
  members in the table, Mary Colby and John Colby, both at the same address.

  Now, the PKIDs as I have already discussed are not even close to
  consecutive.  I have a fairly random selection (though not intentionally
so)
  of 4.3 million PKs out of an original 65 million PKs.  

  My job is to dedupe these 4.3 million records to the FAMILY level.  That
may
  in fact leave duplicates an the HashAddr level.  This would occur where
  there are people with different last names living at the same address.  By
  de-duping to the family level though I will in fact also dedupe to the
  person level.  In order to dedupe to the person level I simply use the
  HashPerson field in all of my queries where I am looking for duplicate
Hash
  IDs.  To dedupe to the Family level, I use HashFamily field.  To dedupe to
  the address level (only one label per address) I simply use HashAddr as
the
  HashID field where I look for duplicates.

  Regardless of which level of de-duping I am attempting to accomplish, the
  strategy is the same.  

  1) Select a Hash field to use as the duplicate search mechanism
  2) Find all HASHIDs in the selected hash field where
  count(SelectedHashField) > 1.  
  3) Get PKIDs for all such HashIDs in step 2 above.  
  4) Obtain Max(PKID) using grouping on HashID in step 3 above.  
  5) Obtain PKIDs in step 3 above but not in Max(PKID) query step 4 above.  
  6) Delete records with those PKIDs.

  Or at least that is my brute force method.  

  The brute force method seems to work but obviously takes some serious work
  to make happen on a given table.  I am working on standardizing my field
  names such that I can then use stored procedures to feed in a table name
and
  get a delete to happen.

  John W. Colby
  Colby Consulting
  www.ColbyConsulting.com
  -----Original Message-----
  From: accessd-bounces at databaseadvisors.com
  [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of A.D.TEJPAL
  Sent: Monday, May 28, 2007 4:09 PM
  To: Access Developers discussion and problem solving
  Subject: Re: [AccessD] deduplication

  John,

      If you are in a position to clarify as to what type of output you are
  aiming at (by trying to convert the subquery into a stand-alone select
  query), a way could be found.

      The syntax error faced by you is explained by the fact that in the
  original SQL, the ">" operator was testing the sequential count of
duplicate records as given by the output of subquery (whether more than
one). 

      Now, with removal of subquery part as attempted by you, the WHERE
clause has become mathematically flawed, missing an argument before the last
operator. It looks like:
  WHERE a = b AND c <= d  > 1;

      If you attempt to rectify this part by putting:
  WHERE a = b AND c <= d;
  you will overcome the syntax problem, but run into a new one, getting
  prompted for tblData.

      Moreover, as soon as you change the status from subquery to a stand
alone select query, alias table T1 is no longer able to perform  dynamically
out of synch with tblData. It is this dynamic feature, available via
subquery,  that enables us to get sequential count.

  A.D.Tejpal
  ---------------
--
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com




More information about the AccessD mailing list