[AccessD] deduplication

Stuart McLachlan stuart at lexacorp.com.pg
Mon Jan 15 06:31:44 CST 2007


How about creating a new table with the final data and then deleting all of 
the initial data.

It looks as though you want a "Select Distinct" on the match codes and a 
MAX() on all the other fields. 


On 15 Jan 2007 at 6:47, JWColby wrote:

> Gustav,
> 
> My problem here is that the database in question is (I believe) outside the
> capabilities of Access to link to and understand.  This is that monster db I
> have discussed in the past.  It will contain (after preliminary validation
> processing) about 50 million records and ~700 fields.  I do have to more
> closely define what I mean, but in general, the table has a small handful of
> "person" data fields - the typical name address stuff.  It will be this that
> defines duplicates.  
> 
> The industry that provided the table uses a system called "match codes"
> where you take the first X characters of a set of relevant fields and merge
> them all into one field.  For example, to create an "Address match code",
> you might take the Address field, plus the zip5 plus the zip 4.  Given that
> the addresses have been validated as deliverable (they have) then this
> uniquely identifies an address.
> 
> The data actually discusses people however so in addition to the address,
> you have to take the first 5 or so characters of the last name plus the
> address match code to create a "family" match code, then add the first 5 or
> so characters of the first name plus the "family" match code to create a
> person match code.  All three match codes are useful in various scenarios,
> but for "deduping" of the database you want to perform this operation at the
> "person" level.  
> 
> So in essence, you create these three different match code fields.  You then
> compare the person match code to all other person match codes to fine "the
> same person" in multiple records.  It is a single field compare across the
> database.
> 
> What you DO is then the crux of the matter.  "Deduplication" is a
> catchphrase used to denote an actual process of removing duplicate records
> from the database.  However in my life at least, it is a little more
> complicated than that.  The reason is that people get in the database
> because they filled out a survey somewhere.  Someone took all of these
> surveys and started denormalizing them, i.e. adding all of the fields into
> one big table (thus the ~700 fields in the table).  
> 
> Unfortunately, as often happens in such cases, things don't always go as
> intended and people end up in the database several times, with only the
> fields for a specific survey filled in for each time they are in the
> database.  They might have John Colby in there 3 times, once because he
> filled in a survey about what soft drinks he liked, another because of a
> surveys about what automobile he drove, and another because of a survey
> about what electronics he buys.  What is SUPPOSED to happen is that all
> three are merged into one "john Colby" record, but alas, this is not always
> the case.  
> 
> Thus my job is to now identify all three of these "john Colby" records and
> merge all of the "survey" fields back into a single record, deleting the two
> now un-needed extra records in the process.  If it weren't for the fact that
> there are about 650 "survey" fields this would not be such a big task.
> Because there are so many survey fields, I really need to do this in the
> least human time consuming way possible.  
> 
> So in essence I need to create a view of the data with just the PKID, the
> People Match Code, and the survey fields.  I then need to take "john Colby"
> record ONE and update the SURVEY FIELDS with the survey fields from "John
> Colby" record 2 and 3, then delete "John Colby" record 2 and 3.
> 
> Out of ~50 million records, find only the duplicate "sets" for any people in
> the database more than once, then "dedupe" those "sets" using the procedure
> described above.
> 
> It is specifically the ~650 fields that causes the issue for Access as (I
> believe) Access cannot handle that many fields at a time.  Of course I could
> figure out a way to break the table down into smaller sets of fields but
> that just adds to the human time required to solve the problem.  So I am
> trying to discover whether this is a problem that SQL syntax can directly
> handle.  If so, then I can have SQL Server do it instead of going to
> programming.
> 
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
> 
> -----Original Message-----
> From: accessd-bounces at databaseadvisors.com
> [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock
> Sent: Monday, January 15, 2007 5:37 AM To: accessd at databaseadvisors.com
> Subject: Re: [AccessD] deduplication
> 
> Hi John
> 
> I think you need to tighten this a bit ...
> 
> Deleting dupes is quite easy, so is to flag (update) the dupes, but - of
> course - the outcome would be quite different. If the dupes are deleted,
> nothing is to be merged. Further, what do you mean by "merge"? Concatenate
> the field contents? Add more fields to the remaining single occurrence of
> every record? Or build a new parent table with a child table with a record
> for each dupe?
> 
> As far as I know, there is no general SQL method or syntax for any of these
> situations except for deleting dupes if you have a unique key in addition to
> the fields of interest containing the duplicated information. I believe you
> easily could figure out how to accomplish what you wish in SQL. That said,
> I'm not so sure SQL will be fastest; if you can move through the recordset
> in a single run reading each record and write to another recordset as
> needed, that will be very fast, even with ADO. Wrap it in a transaction and
> it may run even faster.
> 
> /gustav
> 
> >>> jwcolby at colbyconsulting.com 15-01-2007 04:49:28 >>>
> Is there a SQL syntax or method for identifying duplicate records based on
> and automatically removing (or marking) the second and subsequent instances
> of that record.
> 
> I am working with a database where there may be several records for the same
> person.  Each record will have slightly different data in the record.  I
> need to find all the records for a given person, remove or somehow flag the
> "duplicate" records, and eventually "merge" the duplicate records together.
> 
> 
> I know how to use groupby and count to find the dupes but I don't really
> know how to then merge fields F through M from records 2,3 and 4 back into
> fields F through M in record 1.  This is going to be an ongoing project so I
> need to learn the methodology to get it done.  I can do it with recordsets
> and code, but not with SQL, and given the large potential numbers of records
> SQL would be my first choice.
> 
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com 
> 
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com
> 
> -- 
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com

-- 
Stuart





More information about the AccessD mailing list