[dba-VB] How I'm approaching the problem

jwcolby jwcolby at colbyconsulting.com
Thu Jul 19 16:02:26 CDT 2007


My view of XML is that it just isn't viable for large data sets.  These data
sets contain 5 to 100 MILLION records, with 10 to 700 fields.  Now think
about XML where each field is wrapped with begin / end field name tags.  Any
given data table starts out at 300 megs of DATA.  Now wrap that in 2 Gigs of
XML trash...  Now multiply by 100 files...

I actually do end up parking the rejects, the client wants them for some
reason.  Eventually I will quietly delete them (they have never asked for me
to use them in any way).

In the end though the name / address stuff has to be processed separately.
I cannot simply merge it back in because (remember the 600 other fields) it
turns the inevitable table scan into a 24 hour experience.  Also the
original address may be valid and they just moved.  Stuff like that.

This is a HUGE process, although each individual piece is not so big.  It is
the sheer size of the data that makes it hard to manage.

It turns out that the import into SQL server is time consuming but not tough
once I bought a library to do that.  At least the ones I have done so far
are now easy.  The lib pulls the data into arrays and processes chunks.  I
haven't seen the code but I suspect that it does X records at a time.  The
resulting tables are large.  My biggest is 65 million records, 740 fields.
My next biggest is 98 million records, 149 fields.  In the end, the name /
address table is the same size regardless of which raw table the data came
from.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Charlotte Foust
Sent: Thursday, July 19, 2007 4:04 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

I see.  You had mentioned that before but I forgot (blame my age LOL).
I know about that kind of list from bitter experience, although the biggest
ones I ever had to import were less than 100K names per list.
But I also didn't have the benefits of .Net and had to work strictly in
Access 97, which didn't make it fun.  I still don't see the need to import
the data into a database immediately when you could manipulate them in xml
form until you got them the way you wanted and THEN import the data once and
for all.  You could park the rejects in another database for reference.

Charlotte

-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Thursday, July 19, 2007 12:56 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

>Why import them and then export them back out right away?

Because these lists can come from anywhere.  I don't know until they are in
the database what the field names even are.  ATM I am building a table that
maps "their" field name to my field name, then using that table to do the
export back out to CSV for address processing.  

Furthermore these lists ALWAYS have name / address fields PLUS 10 to 600
OTHER fields which are demographics.  Stuff like age / income, race, age,
number of children, preferences in products etc.  These are marketing lists,
collected from all over the universe.  Thus the data has to be imported into
a raw data table because I use that demographics information for where
clauses in selection queries.  The name / address has to go back out for
address processing because who knows if the addresses were EVER valid?  We
get mis-spellings, but we also get intentional bad data - John Colby at 1234
None of your business lane.
Obviously we don't want to spend money sending mail to "none of your
business Lane".

Remember, these lists come from all over the (United States) universe.
We have no clue as to the validity of the address.  If the address is bad
(non-deliverable) then the entire data record is tossed.

Once I export and do address validation, then we know that we can at least
deliver mail to the address.  We still don't know that the person really
lives there, or even that the name is valid.  It might be "Micky Mouse" at
1723 Twin Pines Drive.  Or the person might have lived there 10 years ago
when they took a survey about smoking (for example) but moved 3 months
later...

My job is to attempt to filter the trash as best I can, and address
validation is the one major filter point that really works, but it only
works on the address itself.  OTOH, some of the lists I get are
"pre-validated" in some fashion.  For example I just got a list of 100
million deeds.  The data is "guaranteed to be correct" although even then
the guarantee is not absolute.  But it is far better than "micky mouse at
1234 none of your business way".

Have you ever joined a web site where they INSISTED on getting your full
name, address, phone, email, etc. and wouldn't save the form unless you
filled in each and every field?  And have you ever said to yourself - "I
want to join this place but I do not want them having all of that".  I have,
and guess what I use?  Micky mouse at 1234 None of your business way...

So when they sell that name to the wrong person I don't get mail from them.

John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Charlotte Foust
Sent: Thursday, July 19, 2007 3:37 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Why import them and then export them back out right away?  You could build
the hash in a calculated column of a dataset before you import it in the
first place, I think.  Read it into a typeddataset and add a calculated
column for the hash.  The import the enhanced typeddataset into your table,
which includes a column for the hash code.

Charlotte 

-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Thursday, July 19, 2007 12:32 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

I will be doing a lot of processing.  There are several scenarios as to the
"source".  It all starts with raw data files brought in from DVDs containing
text files.  Once imported to a database they are immediately exported back
out for address processing.  Once processed they are immediately reimported,
in my current method I import into a separate database.  I also build a hash
of the address, family and person which goes into 3 fields in the address
database / table.  Sometime in the future, those hashes will be compared
across the databases to see where the various objects match, a person in
this table is the same as a person in that table.  An address in this table
is the same as an address in that table.  That will allow me to build a
normalized system where I can "see" data about the same person, family or
address across lists.

Additionally, the address lists will be exported back out for address
processing periodically (monthly?) to find moves.


John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Charlotte Foust
Sent: Thursday, July 19, 2007 3:07 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Are you doing any processing on the data, besides importing it into a table?

Charlotte 

-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Thursday, July 19, 2007 11:42 AM
To: dba-vb at databaseadvisors.com
Subject: [dba-VB] How I'm approaching the problem

The problem with that is it takes a LOT of explaining to even get started.

I am trying to develop a system where I can select a SQL Server instance on
the network.  Once I select a server, I need to select a database if one
exists, or build one if it does not exist.  Once I select (or build) a
database I need to select a table if one exists or build one if it does not
exist.  

Imagine the following scenario:

I have a directory of CSV files that I need to import into SQL Server.
I need to specify the server instance, the database, and a table into which
I want to import the data.  The server instance has to already exist
obviously.  However the database may or may not already exist.  I need to
select the server instance and create the db if none already else select the
db.  Either way I have then selected a db and then need to see if a table
exists, if not create it.  In the end select it.

So...

I have a form.  On the form I have:

1) A control to specify the path to the database files.
2) A control to select a server instance.
3) A control to select a database.
4) A control to select a table.

5) A control to specify a path to the DATA files to be imported.


#2 above needs code (and a combo as the control) to select the server.

#3 above needs code and a combo as the control to select a database within
that server

#4 above needs code and a combo to select a table within that database

1 and 5 already have controls and code to select a directory.

The entire form needs (already has as of today) code to store the current
settings as the form closes and read back out when the form opens.

A byproduct of the way I work (I need to get real work done) is that I often
build systems like this bottom up instead of top down, though I think top
down.  Thus I build code that does the bottom piece, passing in hard coded
values for server / database / table and then come back and either EDIT
those values for the next set of data or BUILD the pieces to select those
data dynamically (where I am now).

So I have a program already in place that imports an entire directory of CSV
files into a server / database / table.  The server / database / table
pieces are passed in but hard coded in the form classes header.  

I am now trying to create the user interface to allow me to use controls to
browse the server / database / tables.  This seems like something that .NET
would just have as a class.  Dim a clsDBBrowser tree control and click on
the node, and back comes the values in a form that could be used in an ADO
connect string.  Given that I do not see that this is true, I am building my
own (not a tree control atm, but separate combos).  A tree that looks
identical to what already exists in SQL Server (except with a server node)
would be so much cleaner.

So there you have it, what I am trying to accomplish and how I am going
about it.  I have found code to obtain a server list and am working on
getting it functioning.  Why does it NEVER "just work"?  Once that is
working I have seen hints of code that obtains a list of databases in a
server, tables in a database etc.  I am working my way down the food chain
as we speak.  In each case except the server I need to be able to "type in"
a value if I don't see what I need.  The database will just be created (I
have code to do that) but the table has to be created based on the first row
of a CSV file (I have code written to do that as well).  Bits and pieces
written, filling in the rest.  LOTS of work.

However once I get this finished, I will have a system where I can "find /
describe" a server / database / table structure, specify where the files
come from and sit back as the files import.  I do this import on a weekly
basis, often more often than that.  It has to be easy.

John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Charlotte Foust
Sent: Thursday, July 19, 2007 1:50 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] VB.Net - Option Strict disallows implicit conversion

Well, if you were REALLY lucky, you'd have someone saying "that works, but I
think I'd do it this way".  There isn't a single right way, there are just
more suitable ways for the present problem.  That's what can get really
maddening.  Of course you do have this list to play in, and it might help
others to see how you're approaching a problem too.

Charlotte 

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com


_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com


_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com


_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com




More information about the dba-VB mailing list