[dba-VB] How I'm approaching the problem

jwcolby jwcolby at colbyconsulting.com
Sun Jul 22 20:53:57 CDT 2007


;-)

Sorry if I snapped buddy.  This whole system is just a tad overwhelming.
There are soooo many pieces and steps and things to do.  I am writing a
system in VB.Net to automate the process, where on a form I can specify the
server, name of a new database and table, and a directory where the files
are stored and the software will do the import from all these files into SQL
Server.  I am building another piece that exports a table (or fields in a
table) out to a set of files in a directory, kind of the inverse of the
first piece.  By running those two pieces in order, I can import, export,
address validate, re-import all in one operation.  The I can also schedule
the export / address validate / import on a periodic basis, with luck
completely automated.  All of this has to have process logging so that if
anything fails I can go see what failed, and where in the process.  It also
has to do logging to my billing database so that all this stuff gets billed
to my client automatically, whenever any piece of the process runs.

I am perhaps overly sensitive for a variety of reasons starting with the
fact that I have gotten a lot of flack on the SQL Server list about not
understanding enough SQL Server to do this stuff (true, but when has that
ever stopped me), how the wizards are toys meant for beginners and my needs
far exceed their capabilities (also true) etc.  I am struggling with
learning two entire new systems - SQL Server and VB.Net / ADO.Net AND doing
it on hardware / software that truly is inadequate (or barely adequate) for
the task.  These databases are HUGE by any datasets I have ever encountered
in the past.  I am accustomed to doing systems with hundreds of tables but
under a million records in the largest table.  Here it is a handful of
tables but tens of millions of records in each one.  Desktop machines with
32 bit OS / Sql Server just don't cut it.

On the bright side the quad core machines are out and a price war is on.
The price of memory is dropping like a rock, and I can now build a dual
processor 8 core system with 32 and up to 64 gb of ram for a "reasonable"
price.  It appears that I will be doing so before the end of the year.  I
found 64 bit SQL Server at a price I could afford and now if I can get a
copy of Windows 2003 x64 at a price I can afford (and get it to install and
run - drivers are still an issue) I should finally have a SQL Server system
that will have the oomph to handle my data.

I am a one man show, trying to do a pretty huge job (in my universe anyway)
and I am a little stressed.  But things are finally coming together.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Jim Lawrence
Sent: Sunday, July 22, 2007 6:42 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Hi John:

Just a note; I have been working on bank hardware and software installation
for about two years and have worked with and in conversation with many of
the senior software techs, when updating their systems. And yes they do use
XML almost exclusively for transferring data from Cashers stations, desktops
and bank machines to their centralized storage and banking systems. It is so
that regardless of what type of OS or hardware being used information can be
transferred seamlessly back and forth. The transport and/or VPN are all
heavy encrypted as you would expect.

I am sorry you felt insulted by the comments but it was totally meant in a
fun way and not in any way designed as a disparaging remark. At worse it was
supposed to bring a smile... but I was hoping you would get a laugh out of
it. I must definitely be more careful in the future when a comment could
potentially be taken in more than one way.

I think in your case it would not be of any advance to use XML for data
manipulation as it would just add an extra layer of complexity and for what
reason.

Jim
     

-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Saturday, July 21, 2007 9:51 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

LOL.  I have been doing this stuff in SQL Server and VB.Net for about a year
now and so far have not even used a form except to have a button to push.
So much for bound.  Bound is only applicable when the data is being entered
/ displayed.  Mine is not.  When you deal with 90 million records you don't
"view it" (though I can if I want to, it is in SQL Server).

>As for the quantity of record handling capabilities of XML, Banks use
XML...

XML is designed for pushing data around between systems.  I would be very
surprised if banks did much if any data storage or manipulation in XML.
Banks have many different systems, none of which talk to each other.  XML is
a widely recognized standard, designed for pushing data around between
systems that otherwise don't talk well, and is a natural fit (for that).
Banks also have entire departments with monthly budgets for servers, people
and entire buildings to house said departments which could buy my entire
house (in CASH) every week.  I am one person, with one client using this
stuff.  I run all my stuff on dual core desktop machines.  If you think this
is not an issue let me put it this way.  

AN ACTUAL DATA FILE FROM A CUSTOMER:

ACTUAL FIELD NAMES	- 	Data size (characters)
FIPS CODE - 				5
FIPS SUB CODE				3
MUNICIPALITY CODE				3
APN UNFORMATTED				13
APN FORMATTED				13
MORTGAGE INTEREST RATE TYPE CODE	3

There are NINETY columns with names like this.  Now, you can see that the
data is MINISCULE compared to the size of the field names.

NOW, turn that into XML.  Now store NINETY MILLION RECORDS LIKE THAT.  Now
MANIPULATE NINETY MILLION XML RECORDS like that in a desktop machine with 4
gig of ram.

>As for the quantity of record handling capabilities of XML, Banks use
XML...

That is an incredibly ignorant statement.
 
enough said.

I actually use XML as we speak, to store process log data out to files on my
machine.  It is quick, easy and standard.  But it is tiny amounts of data,
perhaps a Kbyte per file processed.

John <> XML

Not true.  I just choose not to take 70 gigabytes of raw data in a nice neat
| delimited CSV file and turn it into 700 gigabytes of XML and then try 
| to
process that in 2 gigabytes of process memory on a desktop machine.  THAT
sounds like a recipe for tying up my server for the next... Oh... Two weeks.
It takes me roughly 90 seconds per file to import each CSV file into a SQL
Server database.  I have to get it in to SQL Server, that is what my client
hires me to do.  The fact that I export the name / address back out for
address processing immediately is irrelevant, I will do so every month for
the rest of eternity.  Why on god's green earth would I screw around turning
this into a MONSTER XML file (or files) just so I could preprocess a handful
of fields?

John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Jim Lawrence
Sent: Saturday, July 21, 2007 10:14 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem


Hi John:

Given:
XML = unbound
John = Bound

Therefore:
John <> XML

As for the quantity of record handling capabilities of XML, Banks use XML...
enough said.

Jim

-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Thursday, July 19, 2007 2:02 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

My view of XML is that it just isn't viable for large data sets.  These data
sets contain 5 to 100 MILLION records, with 10 to 700 fields.  Now think
about XML where each field is wrapped with begin / end field name tags.  Any
given data table starts out at 300 megs of DATA.  Now wrap that in 2 Gigs of
XML trash...  Now multiply by 100 files...

I actually do end up parking the rejects, the client wants them for some
reason.  Eventually I will quietly delete them (they have never asked for me
to use them in any way).

In the end though the name / address stuff has to be processed separately.
I cannot simply merge it back in because (remember the 600 other fields) it
turns the inevitable table scan into a 24 hour experience.  Also the
original address may be valid and they just moved.  Stuff like that.

This is a HUGE process, although each individual piece is not so big.  It is
the sheer size of the data that makes it hard to manage.

It turns out that the import into SQL server is time consuming but not tough
once I bought a library to do that.  At least the ones I have done so far
are now easy.  The lib pulls the data into arrays and processes chunks.  I
haven't seen the code but I suspect that it does X records at a time.  The
resulting tables are large.  My biggest is 65 million records, 740 fields.
My next biggest is 98 million records, 149 fields.  In the end, the name /
address table is the same size regardless of which raw table the data came
from.

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com




More information about the dba-VB mailing list