[AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk insert

Wed May 9 13:17:01 CDT 2007

VB's weakness is string manipulation. I believe that could be where the
speed issues are in your white space stripping operation. This is especially
evident when you have a loop that concatenates strings. .NET's StringBuilder
class is much more efficient. 

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of JWColby
Sent: Wednesday, May 09, 2007 10:53 AM
To: 'Access Developers discussion and problem solving'
Subject: Re: [AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk
insert

Gustav,

>It stresses that BULK INSERT is mandatory for this level of data sizes.

Oh yea!!!  I haven't found any other way that makes this size import doable.

>One thing that strikes me, however, is the slow performance of your 
>space
stripping; 1000 lines/s is not very much. But I guess you do more than just
removing spaces.

Nope, just remove spaces and write back out to a pipe delimited file.  I
have a pair of classes that do this.  One loads clsFile - the file spec info
(file name stuff, from / to dirs etc) and the other uses one instance per
field, and loads one field of the field spec table - clsField.  

The basic operation is load the filespec class, then a collection of field
spec classes.  Each field spec class knows what it's field name is, where in
the string it's field starts, and how wide it's field is.  The file spec
then opens a stream object and does a readline into a strLineIn variable.  

The clsFileSpec iterates its collection of clsField instances, and this
strLineIn variable is passed in turn to each field class instance.  The
field class does a midstr() to pull out precisely the data section that it
has to process and stores it in a strData variable  / property (pData).  The
field class then strips off the leading and trailing spaces.  

Once clsFileSpec has read the strLineIn and passed that in turn to each
clsField, it has a collection of clsField instances each holding a stripped
section of the original strLine.  clsFile then iterates that clsField
collection appending each clsField.pData plus a "|" to strLineOut.  When it
has processed each clsField instance it is done assembling the strLineOut,
which it then writes to an output stream.

Line in, parse / strip, line out, repeat until done.  I do a little logging
of the file name / time to do the entire operation on the file etc.  99.99%
of the time is in the parse / strip operation out in the clsField instances.

Remember that the time to do this varies with the data and the data file.
The first file I did had well over SEVEN HUNDRED fields / line. This
specific file had 149 fields in it.  How many lines per second will be most
heavily influenced by the number of fields per line.

Not all of them have spaces, but how do I tell?  This is a generic solution,
so that I can use it on the next file, not custom programmed for one
specific file.  

I think this application will port quite easily to VB.Net though I haven't
done so yet.  When I do I will run the thing again and give comparison
numbers.  I do hope / expect that VB.Net will be significantly faster in
processing the field - parse / strip.

John W. Colby
Colby Consulting
www.ColbyConsulting.com

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock
Sent: Wednesday, May 09, 2007 1:18 PM
To: accessd at databaseadvisors.com
Subject: [AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk insert

Hi John

Thanks for sharing. Quite a story.
It stresses that BULK INSERT is mandatory for this level of data sizes.

One thing that strikes me, however, is the slow performance of your space
stripping; 1000 lines/s is not very much. But I guess you do more than just
removing spaces.

/gustav

>>> jwcolby at colbyconsulting.com 09-05-2007 19:01 >>>
Just an FYI.  The table that I have been building this whole time contains
97.5 million records, exactly 149 (imported) fields and requires 62.6
Gigabytes of data space inside of SQL Server. It took 2 hours and 28 minutes
just to build the auto increment PK field after the table was finished
importing records.  The index space for the table (with just this single
index) is 101 Megabytes.  

There were 56 raw data files which required 75 gigabytes of disk space to
hold.  There were 56 CSV files created after stripping out the spaces, which
required 40.8 Gigabytes of disk space to hold.  Thus by my calculations, 35
gigs of disk space was required to hold JUST THE SPACES in the original
fixed width file, with the real data occupying 40.8 GB.  It is interesting
to note that the raw data in the CSV file was 41gb while the data space
required in SQL Server is 62 gb.

As the process was built over time, I do not have accurate specs for each
and every file, but the process of stripping the spaces off of the fields
happened at about 1K records / second.  Given 97.5 million records, this
equates to 97.5 thousand seconds to do the space stripping, which is 27.77
hours.  That of course is done in a VBA application.  Again I don't have
accurate specs for all of the bulk inserts, however those that I recorded
the times for summed to 71.2 million records, which took 4674 seconds (1.3
hours) to import using a BULK INSERT statement, which equates to
approximately 15K records / second.  Remember that this BULK INSERT is
importing precleaned data with pipe delimiters.  Also remember that the BULK
INSERT itself took 1.3 hours but due to the lack of automation in feeding
the Sproc file names, I had to manually edit the SPROC each time I wanted to
import a new file so the actual import took much longer, since I wasn't
necessarily watching the computer as the last SPROC run finished.

So there you go, that is what I have been trying to accomplish this last few
weeks.

John W. Colby
Colby Consulting
www.ColbyConsulting.com