[dba-VB] How I'm approaching the problem

Sat Jul 21 23:51:17 CDT 2007

LOL.  I have been doing this stuff in SQL Server and VB.Net for about a year
now and so far have not even used a form except to have a button to push.
So much for bound.  Bound is only applicable when the data is being entered
/ displayed.  Mine is not.  When you deal with 90 million records you don't
"view it" (though I can if I want to, it is in SQL Server).

>As for the quantity of record handling capabilities of XML, Banks use
XML...

XML is designed for pushing data around between systems.  I would be very
surprised if banks did much if any data storage or manipulation in XML.
Banks have many different systems, none of which talk to each other.  XML is
a widely recognized standard, designed for pushing data around between
systems that otherwise don't talk well, and is a natural fit (for that).
Banks also have entire departments with monthly budgets for servers, people
and entire buildings to house said departments which could buy my entire
house (in CASH) every week.  I am one person, with one client using this
stuff.  I run all my stuff on dual core desktop machines.  If you think this
is not an issue let me put it this way.  

AN ACTUAL DATA FILE FROM A CUSTOMER:

ACTUAL FIELD NAMES	- 	Data size (characters)
FIPS CODE - 				5
FIPS SUB CODE				3
MUNICIPALITY CODE				3
APN UNFORMATTED				13
APN FORMATTED				13
MORTGAGE INTEREST RATE TYPE CODE	3

There are NINETY columns with names like this.  Now, you can see that the
data is MINISCULE compared to the size of the field names.

NOW, turn that into XML.  Now store NINETY MILLION RECORDS LIKE THAT.  Now
MANIPULATE NINETY MILLION XML RECORDS like that in a desktop machine with 4
gig of ram.

>As for the quantity of record handling capabilities of XML, Banks use
XML...

That is an incredibly ignorant statement.

enough said.

I actually use XML as we speak, to store process log data out to files on my
machine.  It is quick, easy and standard.  But it is tiny amounts of data,
perhaps a Kbyte per file processed.

John <> XML

Not true.  I just choose not to take 70 gigabytes of raw data in a nice neat
| delimited CSV file and turn it into 700 gigabytes of XML and then try to
process that in 2 gigabytes of process memory on a desktop machine.  THAT
sounds like a recipe for tying up my server for the next... Oh... Two weeks.
It takes me roughly 90 seconds per file to import each CSV file into a SQL
Server database.  I have to get it in to SQL Server, that is what my client
hires me to do.  The fact that I export the name / address back out for
address processing immediately is irrelevant, I will do so every month for
the rest of eternity.  Why on god's green earth would I screw around turning
this into a MONSTER XML file (or files) just so I could preprocess a handful
of fields?

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Jim Lawrence
Sent: Saturday, July 21, 2007 10:14 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Hi John:

Given:
XML = unbound
John = Bound

Therefore:
John <> XML

As for the quantity of record handling capabilities of XML, Banks use XML...
enough said.

Jim

-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Thursday, July 19, 2007 2:02 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

My view of XML is that it just isn't viable for large data sets.  These data
sets contain 5 to 100 MILLION records, with 10 to 700 fields.  Now think
about XML where each field is wrapped with begin / end field name tags.  Any
given data table starts out at 300 megs of DATA.  Now wrap that in 2 Gigs of
XML trash...  Now multiply by 100 files...

I actually do end up parking the rejects, the client wants them for some
reason.  Eventually I will quietly delete them (they have never asked for me
to use them in any way).

In the end though the name / address stuff has to be processed separately.
I cannot simply merge it back in because (remember the 600 other fields) it
turns the inevitable table scan into a 24 hour experience.  Also the
original address may be valid and they just moved.  Stuff like that.

This is a HUGE process, although each individual piece is not so big.  It is
the sheer size of the data that makes it hard to manage.

It turns out that the import into SQL server is time consuming but not tough
once I bought a library to do that.  At least the ones I have done so far
are now easy.  The lib pulls the data into arrays and processes chunks.  I
haven't seen the code but I suspect that it does X records at a time.  The
resulting tables are large.  My biggest is 65 million records, 740 fields.
My next biggest is 98 million records, 149 fields.  In the end, the name /
address table is the same size regardless of which raw table the data came
from.