[AccessD] deduplication

Gustav Brock Gustav at cactus.dk
Mon Jan 15 06:20:50 CST 2007


Hi John

Ouch, it's that monster again. That explains.

You might consider extracting the data to temp tables - with only the fields you need.
Again, DAO may be very fast, even with 50 mio. records but, of course, not with 700 fields. With the 256 field limit that will take 3 tables for JET to hold.

If you need to aggregate from 50 mio. records more than a few times, I think you will have to put the data on a server engine.
However, reading through the records writing these to a second table with the ID and the calculated match codes, you should be able from the second table to write to a third table either the IDs to keep or those to delete. Then run a final delete query on the master table.

/gustav

>>> jwcolby at colbyconsulting.com 15-01-2007 12:47:33 >>>
Gustav,

My problem here is that the database in question is (I believe) outside the
capabilities of Access to link to and understand.  This is that monster db I
have discussed in the past.  It will contain (after preliminary validation
processing) about 50 million records and ~700 fields.  I do have to more
closely define what I mean, but in general, the table has a small handful of
"person" data fields - the typical name address stuff.  It will be this that
defines duplicates.  

The industry that provided the table uses a system called "match codes"
where you take the first X characters of a set of relevant fields and merge
them all into one field.  For example, to create an "Address match code",
you might take the Address field, plus the zip5 plus the zip 4.  Given that
the addresses have been validated as deliverable (they have) then this
uniquely identifies an address.

The data actually discusses people however so in addition to the address,
you have to take the first 5 or so characters of the last name plus the
address match code to create a "family" match code, then add the first 5 or
so characters of the first name plus the "family" match code to create a
person match code.  All three match codes are useful in various scenarios,
but for "deduping" of the database you want to perform this operation at the
"person" level.  

So in essence, you create these three different match code fields.  You then
compare the person match code to all other person match codes to fine "the
same person" in multiple records.  It is a single field compare across the
database.

What you DO is then the crux of the matter.  "Deduplication" is a
catchphrase used to denote an actual process of removing duplicate records
from the database.  However in my life at least, it is a little more
complicated than that.  The reason is that people get in the database
because they filled out a survey somewhere.  Someone took all of these
surveys and started denormalizing them, i.e. adding all of the fields into
one big table (thus the ~700 fields in the table).  

Unfortunately, as often happens in such cases, things don't always go as
intended and people end up in the database several times, with only the
fields for a specific survey filled in for each time they are in the
database.  They might have John Colby in there 3 times, once because he
filled in a survey about what soft drinks he liked, another because of a
surveys about what automobile he drove, and another because of a survey
about what electronics he buys.  What is SUPPOSED to happen is that all
three are merged into one "john Colby" record, but alas, this is not always
the case.  

Thus my job is to now identify all three of these "john Colby" records and
merge all of the "survey" fields back into a single record, deleting the two
now un-needed extra records in the process.  If it weren't for the fact that
there are about 650 "survey" fields this would not be such a big task.
Because there are so many survey fields, I really need to do this in the
least human time consuming way possible.  

So in essence I need to create a view of the data with just the PKID, the
People Match Code, and the survey fields.  I then need to take "john Colby"
record ONE and update the SURVEY FIELDS with the survey fields from "John
Colby" record 2 and 3, then delete "John Colby" record 2 and 3.

Out of ~50 million records, find only the duplicate "sets" for any people in
the database more than once, then "dedupe" those "sets" using the procedure
described above.

It is specifically the ~650 fields that causes the issue for Access as (I
believe) Access cannot handle that many fields at a time.  Of course I could
figure out a way to break the table down into smaller sets of fields but
that just adds to the human time required to solve the problem.  So I am
trying to discover whether this is a problem that SQL syntax can directly
handle.  If so, then I can have SQL Server do it instead of going to
programming.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 

-----Original Message-----
From: accessd-bounces at databaseadvisors.com 
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock
Sent: Monday, January 15, 2007 5:37 AM
To: accessd at databaseadvisors.com 
Subject: Re: [AccessD] deduplication

Hi John

I think you need to tighten this a bit ...

Deleting dupes is quite easy, so is to flag (update) the dupes, but - of
course - the outcome would be quite different.
If the dupes are deleted, nothing is to be merged. Further, what do you mean
by "merge"? Concatenate the field contents? Add more fields to the remaining
single occurrence of every record? Or build a new parent table with a child
table with a record for each dupe?

As far as I know, there is no general SQL method or syntax for any of these
situations except for deleting dupes if you have a unique key in addition to
the fields of interest containing the duplicated information.
I believe you easily could figure out how to accomplish what you wish in
SQL. That said, I'm not so sure SQL will be fastest; if you can move through
the recordset in a single run reading each record and write to another
recordset as needed, that will be very fast, even with ADO. Wrap it in a
transaction and it may run even faster.

/gustav

>>> jwcolby at colbyconsulting.com 15-01-2007 04:49:28 >>>
Is there a SQL syntax or method for identifying duplicate records based on
and automatically removing (or marking) the second and subsequent instances
of that record.
 
I am working with a database where there may be several records for the same
person.  Each record will have slightly different data in the record.  I
need to find all the records for a given person, remove or somehow flag the
"duplicate" records, and eventually "merge" the duplicate records together.

 
I know how to use groupby and count to find the dupes but I don't really
know how to then merge fields F through M from records 2,3 and 4 back into
fields F through M in record 1.  This is going to be an ongoing project so I
need to learn the methodology to get it done.  I can do it with recordsets
and code, but not with SQL, and given the large potential numbers of records
SQL would be my first choice.
 
John W. Colby
Colby Consulting
www.ColbyConsulting.com 



More information about the AccessD mailing list