JWColby
jwcolby at colbyconsulting.com
Mon Jan 15 05:47:33 CST 2007
Gustav, My problem here is that the database in question is (I believe) outside the capabilities of Access to link to and understand. This is that monster db I have discussed in the past. It will contain (after preliminary validation processing) about 50 million records and ~700 fields. I do have to more closely define what I mean, but in general, the table has a small handful of "person" data fields - the typical name address stuff. It will be this that defines duplicates. The industry that provided the table uses a system called "match codes" where you take the first X characters of a set of relevant fields and merge them all into one field. For example, to create an "Address match code", you might take the Address field, plus the zip5 plus the zip 4. Given that the addresses have been validated as deliverable (they have) then this uniquely identifies an address. The data actually discusses people however so in addition to the address, you have to take the first 5 or so characters of the last name plus the address match code to create a "family" match code, then add the first 5 or so characters of the first name plus the "family" match code to create a person match code. All three match codes are useful in various scenarios, but for "deduping" of the database you want to perform this operation at the "person" level. So in essence, you create these three different match code fields. You then compare the person match code to all other person match codes to fine "the same person" in multiple records. It is a single field compare across the database. What you DO is then the crux of the matter. "Deduplication" is a catchphrase used to denote an actual process of removing duplicate records from the database. However in my life at least, it is a little more complicated than that. The reason is that people get in the database because they filled out a survey somewhere. Someone took all of these surveys and started denormalizing them, i.e. adding all of the fields into one big table (thus the ~700 fields in the table). Unfortunately, as often happens in such cases, things don't always go as intended and people end up in the database several times, with only the fields for a specific survey filled in for each time they are in the database. They might have John Colby in there 3 times, once because he filled in a survey about what soft drinks he liked, another because of a surveys about what automobile he drove, and another because of a survey about what electronics he buys. What is SUPPOSED to happen is that all three are merged into one "john Colby" record, but alas, this is not always the case. Thus my job is to now identify all three of these "john Colby" records and merge all of the "survey" fields back into a single record, deleting the two now un-needed extra records in the process. If it weren't for the fact that there are about 650 "survey" fields this would not be such a big task. Because there are so many survey fields, I really need to do this in the least human time consuming way possible. So in essence I need to create a view of the data with just the PKID, the People Match Code, and the survey fields. I then need to take "john Colby" record ONE and update the SURVEY FIELDS with the survey fields from "John Colby" record 2 and 3, then delete "John Colby" record 2 and 3. Out of ~50 million records, find only the duplicate "sets" for any people in the database more than once, then "dedupe" those "sets" using the procedure described above. It is specifically the ~650 fields that causes the issue for Access as (I believe) Access cannot handle that many fields at a time. Of course I could figure out a way to break the table down into smaller sets of fields but that just adds to the human time required to solve the problem. So I am trying to discover whether this is a problem that SQL syntax can directly handle. If so, then I can have SQL Server do it instead of going to programming. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock Sent: Monday, January 15, 2007 5:37 AM To: accessd at databaseadvisors.com Subject: Re: [AccessD] deduplication Hi John I think you need to tighten this a bit ... Deleting dupes is quite easy, so is to flag (update) the dupes, but - of course - the outcome would be quite different. If the dupes are deleted, nothing is to be merged. Further, what do you mean by "merge"? Concatenate the field contents? Add more fields to the remaining single occurrence of every record? Or build a new parent table with a child table with a record for each dupe? As far as I know, there is no general SQL method or syntax for any of these situations except for deleting dupes if you have a unique key in addition to the fields of interest containing the duplicated information. I believe you easily could figure out how to accomplish what you wish in SQL. That said, I'm not so sure SQL will be fastest; if you can move through the recordset in a single run reading each record and write to another recordset as needed, that will be very fast, even with ADO. Wrap it in a transaction and it may run even faster. /gustav >>> jwcolby at colbyconsulting.com 15-01-2007 04:49:28 >>> Is there a SQL syntax or method for identifying duplicate records based on and automatically removing (or marking) the second and subsequent instances of that record. I am working with a database where there may be several records for the same person. Each record will have slightly different data in the record. I need to find all the records for a given person, remove or somehow flag the "duplicate" records, and eventually "merge" the duplicate records together. I know how to use groupby and count to find the dupes but I don't really know how to then merge fields F through M from records 2,3 and 4 back into fields F through M in record 1. This is going to be an ongoing project so I need to learn the methodology to get it done. I can do it with recordsets and code, but not with SQL, and given the large potential numbers of records SQL would be my first choice. John W. Colby Colby Consulting www.ColbyConsulting.com -- AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com