[dba-SQLServer] Infutor Statistics - was RE: Bulk insert

JWColby jwcolby at colbyconsulting.com
Wed May 9 12:01:01 CDT 2007


Just an FYI.  The table that I have been building this whole time contains
97.5 million records, exactly 149 (imported) fields and requires 62.6
Gigabytes of data space inside of SQL Server. It took 2 hours and 28 minutes
just to build the auto increment PK field after the table was finished
importing records.  The index space for the table (with just this single
index) is 101 Megabytes.  

There were 56 raw data files which required 75 gigabytes of disk space to
hold.  There were 56 CSV files created after stripping out the spaces, which
required 40.8 Gigabytes of disk space to hold.  Thus by my calculations, 35
gigs of disk space was required to hold JUST THE SPACES in the original
fixed width file, with the real data occupying 40.8 GB.  It is interesting
to note that the raw data in the CSV file was 41gb while the data space
required in SQL Server is 62 gb.

As the process was built over time, I do not have accurate specs for each
and every file, but the process of stripping the spaces off of the fields
happened at about 1K records / second.  Given 97.5 million records, this
equates to 97.5 thousand seconds to do the space stripping, which is 27.77
hours.  That of course is done in a VBA application.  Again I don't have
accurate specs for all of the bulk inserts, however those that I recorded
the times for summed to 71.2 million records, which took 4674 seconds (1.3
hours) to import using a BULK INSERT statement, which equates to
approximately 15K records / second.  Remember that this BULK INSERT is
importing precleaned data with pipe delimiters.  Also remember that the BULK
INSERT itself took 1.3 hours but due to the lack of automation in feeding
the Sproc file names, I had to manually edit the SPROC each time I wanted to
import a new file so the actual import took much longer, since I wasn't
necessarily watching the computer as the last SPROC run finished.

So there you go, that is what I have been trying to accomplish this last few
weeks.

John W. Colby
Colby Consulting
www.ColbyConsulting.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of JWColby
Sent: Wednesday, May 09, 2007 11:02 AM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] Bulk insert

Robert,

That is much appreciated.  Why don't you work with me instead?  I am all for
doing it "in SQL Server" (I own and use SQL Server 2005) but it has to be
something that I understand and can modify or I will not be able to use it
in the end, which would make your efforts wasted.  

My high level spec:

1) Use a "from directory" which is monitored to pick up files from.  
1a) The from directory changes for each import data set.
2) Each file in the "from directory" will contain identical formatted data,
but the number of records in the file may vary.
3) Each file will be defined by an import spec table, which contains Field
Name, Field Width and data type.  If no data type is specified, then
VarChar() is used.  The size of the VarChar() field is determined by the
spec table.  This allows me to only have to spec data types (in advance)
that I know is not VarChar(), but allows me to spec as many as I need in
advance of the import.
4) Once the table is built and populated, a new field called PKID needs to
be built.  PKID will be INT (32 bit) PRIMARY KEY, IDENTITY.

Understand that at this time I actually have a functioning system.  It takes
a pair of tables which specify the file info (from / to directories, field
delimiter etc) and the field info (field name / start position in the fixed
width data / field length).  This program (written in VBA) does a conversion
from fixed width to a pipe delimited "CSV" file, reading a line, stripping
off the spaces, and writing the stripped data lines back out to a CSV file
in the "TO Directory", complete with the first line containing field names.
At the moment I have a hand constructed table in a hand constructed
database, which is created by an initial use of the wizard from inside of
SQL Server, pulling in the first CSV files created by my program.  Once that
table is created, I use a hand created BULK INSERT Sproc to import the CSV
files.  Once the table is fully populated with the contents of all the
files, I hand build an ALTER TABLE query to build a PKID INT PRIMARY KEY,
IDENTITY.

As you can see, anywhere you see "hand created", that is an area that needs
to be automated.  My thoughts are that creating the table initially will be
relatively easy, and in fact I know how, building up a make table query with
the field names and widths taken from the spec table.  I just did not do
that because I did not have the time.  The next issue is dynamically
creating the Sproc that does the Bulk Insert.

Now, as to whether the process of importing the data (inside of SQL Server)
strips off the spaces is really somewhat irrelevant at this point since I
have working code to do this.  It is not blazing fast at about 1000 lines /
second (for 150 fields) but it is "fast enough".  If I port that to VB.Net I
hope / expect to get a speed increase.  The BULK INSERT SProc that I hand
build is currently running about 12K records / sec (for 150 fields)

In the end, this really needs to be an external application driving SQL
Server functionality.  I need a place to go to fill in the import spec
table, set the from / to directories, set up the name of the table etc.  My
heartache to this point has been the inability to get the SQL Server
built-in import stuff to import the data without the external strip /
rebuild step, or at least without an elaborate dance to get around any
limitations of SQL Server to do that stuff for me.

The very next thing I have to do is start exporting just the name / address
(and PK) of this 100 million record table back out for CASS / NCOA
processing.  This requires exporting 2 million records at a time, to a
destination directory, with a unique file name (Infutor01.csv,
Infutor02.csv...).  Once my CASS program finishes processing I will have a
new set of files in yet another directory that I need to pull back in to SQL
Server.  Those files will not require the space stripping piece since they
will not be fixed width.

I do appreciate all of the advice from all the folks out there that have
contributed.  I am slowly but surely learning the pieces and parts that I
need to do this part of my job.

John W. Colby
Colby Consulting
www.ColbyConsulting.com

-----Original Message-----
From: dba-sqlserver-bounces at databaseadvisors.com
[mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Robert L.
Stewart
Sent: Wednesday, May 09, 2007 9:31 AM
To: dba-sqlserver at databaseadvisors.com
Subject: Re: [dba-SQLServer] Bulk insert

John,

I am working on an example of doing this with SQL Server for you.  But,
since I have the same amount of free time as you do, it is going to take a
week or so to complete.  I am using the same concept as you did with a
couple of table to hold the Import spec and the column definitions for the
spec. I am going to only build it for one table to show how it can be done.
The rest will be up to you if you want to expand it.

Robert

At 09:27 PM 5/8/2007, you wrote:
>Date: Tue, 8 May 2007 13:40:59 -0400
>From: "JWColby" <jwcolby at colbyconsulting.com>
>Subject: Re: [dba-SQLServer] Bulk insert
>To: <dba-sqlserver at databaseadvisors.com>
>Message-ID: <00d901c79198$0a703210$657aa8c0 at m6805>
>Content-Type: text/plain;       charset="us-ascii"
>
>Jim,
>
>At this point it is not necessary.  I built a preprocessor in a few 
>hours using my toy (Access).  My toy application handles everything 
>exactly as described.  Someday (soon I hope) I will port that to VB.Net 
>which I hope will be much quicker in the preprocessing department.
>Then I will be considered by some as being a real man, playing with 
>real tools.  ;-) Others will still consider me a child, playing with 
>toys because I didn't take it straight to C#.  SOMEDAY (far in the
>future) perhaps I will embed those pieces directly in CLR programming 
>inside of SQL Server 2005.  Then I will be able to look down my nose at
those children still playing with toys.
>
>For now, it works and with the addition of driving the Sproc from the 
>vba will be an integrated application like what I described.
>
>John W. Colby


_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com

_______________________________________________
dba-SQLServer mailing list
dba-SQLServer at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
http://www.databaseadvisors.com




More information about the dba-SQLServer mailing list