[AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk insert

Wed May 9 15:50:39 CDT 2007

...go ahead ...tell us you did this fixed price :)

William Hindman

----- Original Message ----- 
From: "JWColby" <jwcolby at colbyconsulting.com>
To: <dba-sqlserver at databaseadvisors.com>; "'Access Developers discussion and 
problem solving'" <accessd at databaseadvisors.com>
Sent: Wednesday, May 09, 2007 1:01 PM
Subject: [AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk insert


> Just an FYI.  The table that I have been building this whole time contains
> 97.5 million records, exactly 149 (imported) fields and requires 62.6
> Gigabytes of data space inside of SQL Server. It took 2 hours and 28 
> minutes
> just to build the auto increment PK field after the table was finished
> importing records.  The index space for the table (with just this single
> index) is 101 Megabytes.
>
> There were 56 raw data files which required 75 gigabytes of disk space to
> hold.  There were 56 CSV files created after stripping out the spaces, 
> which
> required 40.8 Gigabytes of disk space to hold.  Thus by my calculations, 
> 35
> gigs of disk space was required to hold JUST THE SPACES in the original
> fixed width file, with the real data occupying 40.8 GB.  It is interesting
> to note that the raw data in the CSV file was 41gb while the data space
> required in SQL Server is 62 gb.
>
> As the process was built over time, I do not have accurate specs for each
> and every file, but the process of stripping the spaces off of the fields
> happened at about 1K records / second.  Given 97.5 million records, this
> equates to 97.5 thousand seconds to do the space stripping, which is 27.77
> hours.  That of course is done in a VBA application.  Again I don't have
> accurate specs for all of the bulk inserts, however those that I recorded
> the times for summed to 71.2 million records, which took 4674 seconds (1.3
> hours) to import using a BULK INSERT statement, which equates to
> approximately 15K records / second.  Remember that this BULK INSERT is
> importing precleaned data with pipe delimiters.  Also remember that the 
> BULK
> INSERT itself took 1.3 hours but due to the lack of automation in feeding
> the Sproc file names, I had to manually edit the SPROC each time I wanted 
> to
> import a new file so the actual import took much longer, since I wasn't
> necessarily watching the computer as the last SPROC run finished.
>
> So there you go, that is what I have been trying to accomplish this last 
> few
> weeks.
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
>
> -----Original Message-----
> From: dba-sqlserver-bounces at databaseadvisors.com
> [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of JWColby
> Sent: Wednesday, May 09, 2007 11:02 AM
> To: dba-sqlserver at databaseadvisors.com
> Subject: Re: [dba-SQLServer] Bulk insert
>
> Robert,
>
> That is much appreciated.  Why don't you work with me instead?  I am all 
> for
> doing it "in SQL Server" (I own and use SQL Server 2005) but it has to be
> something that I understand and can modify or I will not be able to use it
> in the end, which would make your efforts wasted.
>
> My high level spec:
>
> 1) Use a "from directory" which is monitored to pick up files from.
> 1a) The from directory changes for each import data set.
> 2) Each file in the "from directory" will contain identical formatted 
> data,
> but the number of records in the file may vary.
> 3) Each file will be defined by an import spec table, which contains Field
> Name, Field Width and data type.  If no data type is specified, then
> VarChar() is used.  The size of the VarChar() field is determined by the
> spec table.  This allows me to only have to spec data types (in advance)
> that I know is not VarChar(), but allows me to spec as many as I need in
> advance of the import.
> 4) Once the table is built and populated, a new field called PKID needs to
> be built.  PKID will be INT (32 bit) PRIMARY KEY, IDENTITY.
>
> Understand that at this time I actually have a functioning system.  It 
> takes
> a pair of tables which specify the file info (from / to directories, field
> delimiter etc) and the field info (field name / start position in the 
> fixed
> width data / field length).  This program (written in VBA) does a 
> conversion
> from fixed width to a pipe delimited "CSV" file, reading a line, stripping
> off the spaces, and writing the stripped data lines back out to a CSV file
> in the "TO Directory", complete with the first line containing field 
> names.
> At the moment I have a hand constructed table in a hand constructed
> database, which is created by an initial use of the wizard from inside of
> SQL Server, pulling in the first CSV files created by my program.  Once 
> that
> table is created, I use a hand created BULK INSERT Sproc to import the CSV
> files.  Once the table is fully populated with the contents of all the
> files, I hand build an ALTER TABLE query to build a PKID INT PRIMARY KEY,
> IDENTITY.
>
> As you can see, anywhere you see "hand created", that is an area that 
> needs
> to be automated.  My thoughts are that creating the table initially will 
> be
> relatively easy, and in fact I know how, building up a make table query 
> with
> the field names and widths taken from the spec table.  I just did not do
> that because I did not have the time.  The next issue is dynamically
> creating the Sproc that does the Bulk Insert.
>
> Now, as to whether the process of importing the data (inside of SQL 
> Server)
> strips off the spaces is really somewhat irrelevant at this point since I
> have working code to do this.  It is not blazing fast at about 1000 lines 
> /
> second (for 150 fields) but it is "fast enough".  If I port that to VB.Net 
> I
> hope / expect to get a speed increase.  The BULK INSERT SProc that I hand
> build is currently running about 12K records / sec (for 150 fields)
>
> In the end, this really needs to be an external application driving SQL
> Server functionality.  I need a place to go to fill in the import spec
> table, set the from / to directories, set up the name of the table etc. 
> My
> heartache to this point has been the inability to get the SQL Server
> built-in import stuff to import the data without the external strip /
> rebuild step, or at least without an elaborate dance to get around any
> limitations of SQL Server to do that stuff for me.
>
> The very next thing I have to do is start exporting just the name / 
> address
> (and PK) of this 100 million record table back out for CASS / NCOA
> processing.  This requires exporting 2 million records at a time, to a
> destination directory, with a unique file name (Infutor01.csv,
> Infutor02.csv...).  Once my CASS program finishes processing I will have a
> new set of files in yet another directory that I need to pull back in to 
> SQL
> Server.  Those files will not require the space stripping piece since they
> will not be fixed width.
>
> I do appreciate all of the advice from all the folks out there that have
> contributed.  I am slowly but surely learning the pieces and parts that I
> need to do this part of my job.
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
>
> -----Original Message-----
> From: dba-sqlserver-bounces at databaseadvisors.com
> [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Robert L.
> Stewart
> Sent: Wednesday, May 09, 2007 9:31 AM
> To: dba-sqlserver at databaseadvisors.com
> Subject: Re: [dba-SQLServer] Bulk insert
>
> John,
>
> I am working on an example of doing this with SQL Server for you.  But,
> since I have the same amount of free time as you do, it is going to take a
> week or so to complete.  I am using the same concept as you did with a
> couple of table to hold the Import spec and the column definitions for the
> spec. I am going to only build it for one table to show how it can be 
> done.
> The rest will be up to you if you want to expand it.
>
> Robert
>
> At 09:27 PM 5/8/2007, you wrote:
>>Date: Tue, 8 May 2007 13:40:59 -0400
>>From: "JWColby" <jwcolby at colbyconsulting.com>
>>Subject: Re: [dba-SQLServer] Bulk insert
>>To: <dba-sqlserver at databaseadvisors.com>
>>Message-ID: <00d901c79198$0a703210$657aa8c0 at m6805>
>>Content-Type: text/plain;       charset="us-ascii"
>>
>>Jim,
>>
>>At this point it is not necessary.  I built a preprocessor in a few
>>hours using my toy (Access).  My toy application handles everything
>>exactly as described.  Someday (soon I hope) I will port that to VB.Net
>>which I hope will be much quicker in the preprocessing department.
>>Then I will be considered by some as being a real man, playing with
>>real tools.  ;-) Others will still consider me a child, playing with
>>toys because I didn't take it straight to C#.  SOMEDAY (far in the
>>future) perhaps I will embed those pieces directly in CLR programming
>>inside of SQL Server 2005.  Then I will be able to look down my nose at
> those children still playing with toys.
>>
>>For now, it works and with the addition of driving the Sproc from the
>>vba will be an integrated application like what I described.
>>
>>John W. Colby
>
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>
> -- 
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com
>