Stuart McLachlan
stuart at lexacorp.com.pg
Mon Jan 15 06:31:44 CST 2007
How about creating a new table with the final data and then deleting all of the initial data. It looks as though you want a "Select Distinct" on the match codes and a MAX() on all the other fields. On 15 Jan 2007 at 6:47, JWColby wrote: > Gustav, > > My problem here is that the database in question is (I believe) outside the > capabilities of Access to link to and understand. This is that monster db I > have discussed in the past. It will contain (after preliminary validation > processing) about 50 million records and ~700 fields. I do have to more > closely define what I mean, but in general, the table has a small handful of > "person" data fields - the typical name address stuff. It will be this that > defines duplicates. > > The industry that provided the table uses a system called "match codes" > where you take the first X characters of a set of relevant fields and merge > them all into one field. For example, to create an "Address match code", > you might take the Address field, plus the zip5 plus the zip 4. Given that > the addresses have been validated as deliverable (they have) then this > uniquely identifies an address. > > The data actually discusses people however so in addition to the address, > you have to take the first 5 or so characters of the last name plus the > address match code to create a "family" match code, then add the first 5 or > so characters of the first name plus the "family" match code to create a > person match code. All three match codes are useful in various scenarios, > but for "deduping" of the database you want to perform this operation at the > "person" level. > > So in essence, you create these three different match code fields. You then > compare the person match code to all other person match codes to fine "the > same person" in multiple records. It is a single field compare across the > database. > > What you DO is then the crux of the matter. "Deduplication" is a > catchphrase used to denote an actual process of removing duplicate records > from the database. However in my life at least, it is a little more > complicated than that. The reason is that people get in the database > because they filled out a survey somewhere. Someone took all of these > surveys and started denormalizing them, i.e. adding all of the fields into > one big table (thus the ~700 fields in the table). > > Unfortunately, as often happens in such cases, things don't always go as > intended and people end up in the database several times, with only the > fields for a specific survey filled in for each time they are in the > database. They might have John Colby in there 3 times, once because he > filled in a survey about what soft drinks he liked, another because of a > surveys about what automobile he drove, and another because of a survey > about what electronics he buys. What is SUPPOSED to happen is that all > three are merged into one "john Colby" record, but alas, this is not always > the case. > > Thus my job is to now identify all three of these "john Colby" records and > merge all of the "survey" fields back into a single record, deleting the two > now un-needed extra records in the process. If it weren't for the fact that > there are about 650 "survey" fields this would not be such a big task. > Because there are so many survey fields, I really need to do this in the > least human time consuming way possible. > > So in essence I need to create a view of the data with just the PKID, the > People Match Code, and the survey fields. I then need to take "john Colby" > record ONE and update the SURVEY FIELDS with the survey fields from "John > Colby" record 2 and 3, then delete "John Colby" record 2 and 3. > > Out of ~50 million records, find only the duplicate "sets" for any people in > the database more than once, then "dedupe" those "sets" using the procedure > described above. > > It is specifically the ~650 fields that causes the issue for Access as (I > believe) Access cannot handle that many fields at a time. Of course I could > figure out a way to break the table down into smaller sets of fields but > that just adds to the human time required to solve the problem. So I am > trying to discover whether this is a problem that SQL syntax can directly > handle. If so, then I can have SQL Server do it instead of going to > programming. > > John W. Colby > Colby Consulting > www.ColbyConsulting.com > > -----Original Message----- > From: accessd-bounces at databaseadvisors.com > [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock > Sent: Monday, January 15, 2007 5:37 AM To: accessd at databaseadvisors.com > Subject: Re: [AccessD] deduplication > > Hi John > > I think you need to tighten this a bit ... > > Deleting dupes is quite easy, so is to flag (update) the dupes, but - of > course - the outcome would be quite different. If the dupes are deleted, > nothing is to be merged. Further, what do you mean by "merge"? Concatenate > the field contents? Add more fields to the remaining single occurrence of > every record? Or build a new parent table with a child table with a record > for each dupe? > > As far as I know, there is no general SQL method or syntax for any of these > situations except for deleting dupes if you have a unique key in addition to > the fields of interest containing the duplicated information. I believe you > easily could figure out how to accomplish what you wish in SQL. That said, > I'm not so sure SQL will be fastest; if you can move through the recordset > in a single run reading each record and write to another recordset as > needed, that will be very fast, even with ADO. Wrap it in a transaction and > it may run even faster. > > /gustav > > >>> jwcolby at colbyconsulting.com 15-01-2007 04:49:28 >>> > Is there a SQL syntax or method for identifying duplicate records based on > and automatically removing (or marking) the second and subsequent instances > of that record. > > I am working with a database where there may be several records for the same > person. Each record will have slightly different data in the record. I > need to find all the records for a given person, remove or somehow flag the > "duplicate" records, and eventually "merge" the duplicate records together. > > > I know how to use groupby and count to find the dupes but I don't really > know how to then merge fields F through M from records 2,3 and 4 back into > fields F through M in record 1. This is going to be an ongoing project so I > need to learn the methodology to get it done. I can do it with recordsets > and code, but not with SQL, and given the large potential numbers of records > SQL would be my first choice. > > John W. Colby > Colby Consulting > www.ColbyConsulting.com > > -- > AccessD mailing list > AccessD at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/accessd > Website: http://www.databaseadvisors.com > > -- > AccessD mailing list > AccessD at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/accessd > Website: http://www.databaseadvisors.com -- Stuart