[AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulkinsert

JWColby jwcolby at colbyconsulting.com
Wed May 9 15:59:37 CDT 2007


Whadayathinkimstupid?

However... Once I am done with my import widget I will do these fixed price,
and make a killing.

John W. Colby
Colby Consulting
www.ColbyConsulting.com

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of William Hindman
Sent: Wednesday, May 09, 2007 4:51 PM
To: Access Developers discussion and problem solving
Subject: Re: [AccessD] Infutor Statistics - was RE: [dba-SQLServer]
Bulkinsert

...go ahead ...tell us you did this fixed price :)

William Hindman

----- Original Message -----
From: "JWColby" <jwcolby at colbyconsulting.com>
To: <dba-sqlserver at databaseadvisors.com>; "'Access Developers discussion and
problem solving'" <accessd at databaseadvisors.com>
Sent: Wednesday, May 09, 2007 1:01 PM
Subject: [AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk insert


> Just an FYI.  The table that I have been building this whole time 
> contains
> 97.5 million records, exactly 149 (imported) fields and requires 62.6 
> Gigabytes of data space inside of SQL Server. It took 2 hours and 28 
> minutes just to build the auto increment PK field after the table was 
> finished importing records.  The index space for the table (with just 
> this single
> index) is 101 Megabytes.
>
> There were 56 raw data files which required 75 gigabytes of disk space 
> to hold.  There were 56 CSV files created after stripping out the 
> spaces, which required 40.8 Gigabytes of disk space to hold.  Thus by 
> my calculations,
> 35
> gigs of disk space was required to hold JUST THE SPACES in the 
> original fixed width file, with the real data occupying 40.8 GB.  It 
> is interesting to note that the raw data in the CSV file was 41gb 
> while the data space required in SQL Server is 62 gb.
>
> As the process was built over time, I do not have accurate specs for 
> each and every file, but the process of stripping the spaces off of 
> the fields happened at about 1K records / second.  Given 97.5 million 
> records, this equates to 97.5 thousand seconds to do the space 
> stripping, which is 27.77 hours.  That of course is done in a VBA 
> application.  Again I don't have accurate specs for all of the bulk 
> inserts, however those that I recorded the times for summed to 71.2 
> million records, which took 4674 seconds (1.3
> hours) to import using a BULK INSERT statement, which equates to 
> approximately 15K records / second.  Remember that this BULK INSERT is 
> importing precleaned data with pipe delimiters.  Also remember that 
> the BULK INSERT itself took 1.3 hours but due to the lack of 
> automation in feeding the Sproc file names, I had to manually edit the 
> SPROC each time I wanted to import a new file so the actual import 
> took much longer, since I wasn't necessarily watching the computer as 
> the last SPROC run finished.
>
> So there you go, that is what I have been trying to accomplish this 
> last few weeks.
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
>
> -----Original Message-----
> From: dba-sqlserver-bounces at databaseadvisors.com
> [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of 
> JWColby
> Sent: Wednesday, May 09, 2007 11:02 AM
> To: dba-sqlserver at databaseadvisors.com
> Subject: Re: [dba-SQLServer] Bulk insert
>
> Robert,
>
> That is much appreciated.  Why don't you work with me instead?  I am 
> all for doing it "in SQL Server" (I own and use SQL Server 2005) but 
> it has to be something that I understand and can modify or I will not 
> be able to use it in the end, which would make your efforts wasted.
>
> My high level spec:
>
> 1) Use a "from directory" which is monitored to pick up files from.
> 1a) The from directory changes for each import data set.
> 2) Each file in the "from directory" will contain identical formatted 
> data, but the number of records in the file may vary.
> 3) Each file will be defined by an import spec table, which contains 
> Field Name, Field Width and data type.  If no data type is specified, 
> then
> VarChar() is used.  The size of the VarChar() field is determined by 
> the spec table.  This allows me to only have to spec data types (in 
> advance) that I know is not VarChar(), but allows me to spec as many 
> as I need in advance of the import.
> 4) Once the table is built and populated, a new field called PKID 
> needs to be built.  PKID will be INT (32 bit) PRIMARY KEY, IDENTITY.
>
> Understand that at this time I actually have a functioning system.  It 
> takes a pair of tables which specify the file info (from / to 
> directories, field delimiter etc) and the field info (field name / 
> start position in the fixed width data / field length).  This program 
> (written in VBA) does a conversion from fixed width to a pipe 
> delimited "CSV" file, reading a line, stripping off the spaces, and 
> writing the stripped data lines back out to a CSV file in the "TO 
> Directory", complete with the first line containing field names.
> At the moment I have a hand constructed table in a hand constructed 
> database, which is created by an initial use of the wizard from inside 
> of SQL Server, pulling in the first CSV files created by my program.  
> Once that table is created, I use a hand created BULK INSERT Sproc to 
> import the CSV files.  Once the table is fully populated with the 
> contents of all the files, I hand build an ALTER TABLE query to build 
> a PKID INT PRIMARY KEY, IDENTITY.
>
> As you can see, anywhere you see "hand created", that is an area that 
> needs to be automated.  My thoughts are that creating the table 
> initially will be relatively easy, and in fact I know how, building up 
> a make table query with the field names and widths taken from the spec 
> table.  I just did not do that because I did not have the time.  The 
> next issue is dynamically creating the Sproc that does the Bulk 
> Insert.
>
> Now, as to whether the process of importing the data (inside of SQL
> Server)
> strips off the spaces is really somewhat irrelevant at this point 
> since I have working code to do this.  It is not blazing fast at about 
> 1000 lines / second (for 150 fields) but it is "fast enough".  If I 
> port that to VB.Net I hope / expect to get a speed increase.  The BULK 
> INSERT SProc that I hand build is currently running about 12K records 
> / sec (for 150 fields)
>
> In the end, this really needs to be an external application driving 
> SQL Server functionality.  I need a place to go to fill in the import 
> spec table, set the from / to directories, set up the name of the table
etc.
> My
> heartache to this point has been the inability to get the SQL Server 
> built-in import stuff to import the data without the external strip / 
> rebuild step, or at least without an elaborate dance to get around any 
> limitations of SQL Server to do that stuff for me.
>
> The very next thing I have to do is start exporting just the name / 
> address (and PK) of this 100 million record table back out for CASS / 
> NCOA processing.  This requires exporting 2 million records at a time, 
> to a destination directory, with a unique file name (Infutor01.csv, 
> Infutor02.csv...).  Once my CASS program finishes processing I will 
> have a new set of files in yet another directory that I need to pull 
> back in to SQL Server.  Those files will not require the space 
> stripping piece since they will not be fixed width.
>
> I do appreciate all of the advice from all the folks out there that 
> have contributed.  I am slowly but surely learning the pieces and 
> parts that I need to do this part of my job.
>
> John W. Colby
> Colby Consulting
> www.ColbyConsulting.com
>
> -----Original Message-----
> From: dba-sqlserver-bounces at databaseadvisors.com
> [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Robert L.
> Stewart
> Sent: Wednesday, May 09, 2007 9:31 AM
> To: dba-sqlserver at databaseadvisors.com
> Subject: Re: [dba-SQLServer] Bulk insert
>
> John,
>
> I am working on an example of doing this with SQL Server for you.  
> But, since I have the same amount of free time as you do, it is going 
> to take a week or so to complete.  I am using the same concept as you 
> did with a couple of table to hold the Import spec and the column 
> definitions for the spec. I am going to only build it for one table to 
> show how it can be done.
> The rest will be up to you if you want to expand it.
>
> Robert
>
> At 09:27 PM 5/8/2007, you wrote:
>>Date: Tue, 8 May 2007 13:40:59 -0400
>>From: "JWColby" <jwcolby at colbyconsulting.com>
>>Subject: Re: [dba-SQLServer] Bulk insert
>>To: <dba-sqlserver at databaseadvisors.com>
>>Message-ID: <00d901c79198$0a703210$657aa8c0 at m6805>
>>Content-Type: text/plain;       charset="us-ascii"
>>
>>Jim,
>>
>>At this point it is not necessary.  I built a preprocessor in a few 
>>hours using my toy (Access).  My toy application handles everything 
>>exactly as described.  Someday (soon I hope) I will port that to 
>>VB.Net which I hope will be much quicker in the preprocessing department.
>>Then I will be considered by some as being a real man, playing with 
>>real tools.  ;-) Others will still consider me a child, playing with 
>>toys because I didn't take it straight to C#.  SOMEDAY (far in the
>>future) perhaps I will embed those pieces directly in CLR programming 
>>inside of SQL Server 2005.  Then I will be able to look down my nose 
>>at
> those children still playing with toys.
>>
>>For now, it works and with the addition of driving the Sproc from the 
>>vba will be an integrated application like what I described.
>>
>>John W. Colby
>
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>
> _______________________________________________
> dba-SQLServer mailing list
> dba-SQLServer at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/dba-sqlserver
> http://www.databaseadvisors.com
>
> --
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com
> 



-- 
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com




More information about the AccessD mailing list