jwcolby
jwcolby at colbyconsulting.com
Sat Jul 21 23:51:17 CDT 2007
LOL. I have been doing this stuff in SQL Server and VB.Net for about a year now and so far have not even used a form except to have a button to push. So much for bound. Bound is only applicable when the data is being entered / displayed. Mine is not. When you deal with 90 million records you don't "view it" (though I can if I want to, it is in SQL Server). >As for the quantity of record handling capabilities of XML, Banks use XML... XML is designed for pushing data around between systems. I would be very surprised if banks did much if any data storage or manipulation in XML. Banks have many different systems, none of which talk to each other. XML is a widely recognized standard, designed for pushing data around between systems that otherwise don't talk well, and is a natural fit (for that). Banks also have entire departments with monthly budgets for servers, people and entire buildings to house said departments which could buy my entire house (in CASH) every week. I am one person, with one client using this stuff. I run all my stuff on dual core desktop machines. If you think this is not an issue let me put it this way. AN ACTUAL DATA FILE FROM A CUSTOMER: ACTUAL FIELD NAMES - Data size (characters) FIPS CODE - 5 FIPS SUB CODE 3 MUNICIPALITY CODE 3 APN UNFORMATTED 13 APN FORMATTED 13 MORTGAGE INTEREST RATE TYPE CODE 3 There are NINETY columns with names like this. Now, you can see that the data is MINISCULE compared to the size of the field names. NOW, turn that into XML. Now store NINETY MILLION RECORDS LIKE THAT. Now MANIPULATE NINETY MILLION XML RECORDS like that in a desktop machine with 4 gig of ram. >As for the quantity of record handling capabilities of XML, Banks use XML... That is an incredibly ignorant statement. enough said. I actually use XML as we speak, to store process log data out to files on my machine. It is quick, easy and standard. But it is tiny amounts of data, perhaps a Kbyte per file processed. John <> XML Not true. I just choose not to take 70 gigabytes of raw data in a nice neat | delimited CSV file and turn it into 700 gigabytes of XML and then try to process that in 2 gigabytes of process memory on a desktop machine. THAT sounds like a recipe for tying up my server for the next... Oh... Two weeks. It takes me roughly 90 seconds per file to import each CSV file into a SQL Server database. I have to get it in to SQL Server, that is what my client hires me to do. The fact that I export the name / address back out for address processing immediately is irrelevant, I will do so every month for the rest of eternity. Why on god's green earth would I screw around turning this into a MONSTER XML file (or files) just so I could preprocess a handful of fields? John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Jim Lawrence Sent: Saturday, July 21, 2007 10:14 PM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem Hi John: Given: XML = unbound John = Bound Therefore: John <> XML As for the quantity of record handling capabilities of XML, Banks use XML... enough said. Jim -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Thursday, July 19, 2007 2:02 PM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem My view of XML is that it just isn't viable for large data sets. These data sets contain 5 to 100 MILLION records, with 10 to 700 fields. Now think about XML where each field is wrapped with begin / end field name tags. Any given data table starts out at 300 megs of DATA. Now wrap that in 2 Gigs of XML trash... Now multiply by 100 files... I actually do end up parking the rejects, the client wants them for some reason. Eventually I will quietly delete them (they have never asked for me to use them in any way). In the end though the name / address stuff has to be processed separately. I cannot simply merge it back in because (remember the 600 other fields) it turns the inevitable table scan into a 24 hour experience. Also the original address may be valid and they just moved. Stuff like that. This is a HUGE process, although each individual piece is not so big. It is the sheer size of the data that makes it hard to manage. It turns out that the import into SQL server is time consuming but not tough once I bought a library to do that. At least the ones I have done so far are now easy. The lib pulls the data into arrays and processes chunks. I haven't seen the code but I suspect that it does X records at a time. The resulting tables are large. My biggest is 65 million records, 740 fields. My next biggest is 98 million records, 149 fields. In the end, the name / address table is the same size regardless of which raw table the data came from.