JWColby
jwcolby at colbyconsulting.com
Wed May 9 12:01:01 CDT 2007
Just an FYI. The table that I have been building this whole time contains 97.5 million records, exactly 149 (imported) fields and requires 62.6 Gigabytes of data space inside of SQL Server. It took 2 hours and 28 minutes just to build the auto increment PK field after the table was finished importing records. The index space for the table (with just this single index) is 101 Megabytes. There were 56 raw data files which required 75 gigabytes of disk space to hold. There were 56 CSV files created after stripping out the spaces, which required 40.8 Gigabytes of disk space to hold. Thus by my calculations, 35 gigs of disk space was required to hold JUST THE SPACES in the original fixed width file, with the real data occupying 40.8 GB. It is interesting to note that the raw data in the CSV file was 41gb while the data space required in SQL Server is 62 gb. As the process was built over time, I do not have accurate specs for each and every file, but the process of stripping the spaces off of the fields happened at about 1K records / second. Given 97.5 million records, this equates to 97.5 thousand seconds to do the space stripping, which is 27.77 hours. That of course is done in a VBA application. Again I don't have accurate specs for all of the bulk inserts, however those that I recorded the times for summed to 71.2 million records, which took 4674 seconds (1.3 hours) to import using a BULK INSERT statement, which equates to approximately 15K records / second. Remember that this BULK INSERT is importing precleaned data with pipe delimiters. Also remember that the BULK INSERT itself took 1.3 hours but due to the lack of automation in feeding the Sproc file names, I had to manually edit the SPROC each time I wanted to import a new file so the actual import took much longer, since I wasn't necessarily watching the computer as the last SPROC run finished. So there you go, that is what I have been trying to accomplish this last few weeks. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of JWColby Sent: Wednesday, May 09, 2007 11:02 AM To: dba-sqlserver at databaseadvisors.com Subject: Re: [dba-SQLServer] Bulk insert Robert, That is much appreciated. Why don't you work with me instead? I am all for doing it "in SQL Server" (I own and use SQL Server 2005) but it has to be something that I understand and can modify or I will not be able to use it in the end, which would make your efforts wasted. My high level spec: 1) Use a "from directory" which is monitored to pick up files from. 1a) The from directory changes for each import data set. 2) Each file in the "from directory" will contain identical formatted data, but the number of records in the file may vary. 3) Each file will be defined by an import spec table, which contains Field Name, Field Width and data type. If no data type is specified, then VarChar() is used. The size of the VarChar() field is determined by the spec table. This allows me to only have to spec data types (in advance) that I know is not VarChar(), but allows me to spec as many as I need in advance of the import. 4) Once the table is built and populated, a new field called PKID needs to be built. PKID will be INT (32 bit) PRIMARY KEY, IDENTITY. Understand that at this time I actually have a functioning system. It takes a pair of tables which specify the file info (from / to directories, field delimiter etc) and the field info (field name / start position in the fixed width data / field length). This program (written in VBA) does a conversion from fixed width to a pipe delimited "CSV" file, reading a line, stripping off the spaces, and writing the stripped data lines back out to a CSV file in the "TO Directory", complete with the first line containing field names. At the moment I have a hand constructed table in a hand constructed database, which is created by an initial use of the wizard from inside of SQL Server, pulling in the first CSV files created by my program. Once that table is created, I use a hand created BULK INSERT Sproc to import the CSV files. Once the table is fully populated with the contents of all the files, I hand build an ALTER TABLE query to build a PKID INT PRIMARY KEY, IDENTITY. As you can see, anywhere you see "hand created", that is an area that needs to be automated. My thoughts are that creating the table initially will be relatively easy, and in fact I know how, building up a make table query with the field names and widths taken from the spec table. I just did not do that because I did not have the time. The next issue is dynamically creating the Sproc that does the Bulk Insert. Now, as to whether the process of importing the data (inside of SQL Server) strips off the spaces is really somewhat irrelevant at this point since I have working code to do this. It is not blazing fast at about 1000 lines / second (for 150 fields) but it is "fast enough". If I port that to VB.Net I hope / expect to get a speed increase. The BULK INSERT SProc that I hand build is currently running about 12K records / sec (for 150 fields) In the end, this really needs to be an external application driving SQL Server functionality. I need a place to go to fill in the import spec table, set the from / to directories, set up the name of the table etc. My heartache to this point has been the inability to get the SQL Server built-in import stuff to import the data without the external strip / rebuild step, or at least without an elaborate dance to get around any limitations of SQL Server to do that stuff for me. The very next thing I have to do is start exporting just the name / address (and PK) of this 100 million record table back out for CASS / NCOA processing. This requires exporting 2 million records at a time, to a destination directory, with a unique file name (Infutor01.csv, Infutor02.csv...). Once my CASS program finishes processing I will have a new set of files in yet another directory that I need to pull back in to SQL Server. Those files will not require the space stripping piece since they will not be fixed width. I do appreciate all of the advice from all the folks out there that have contributed. I am slowly but surely learning the pieces and parts that I need to do this part of my job. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-sqlserver-bounces at databaseadvisors.com [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Robert L. Stewart Sent: Wednesday, May 09, 2007 9:31 AM To: dba-sqlserver at databaseadvisors.com Subject: Re: [dba-SQLServer] Bulk insert John, I am working on an example of doing this with SQL Server for you. But, since I have the same amount of free time as you do, it is going to take a week or so to complete. I am using the same concept as you did with a couple of table to hold the Import spec and the column definitions for the spec. I am going to only build it for one table to show how it can be done. The rest will be up to you if you want to expand it. Robert At 09:27 PM 5/8/2007, you wrote: >Date: Tue, 8 May 2007 13:40:59 -0400 >From: "JWColby" <jwcolby at colbyconsulting.com> >Subject: Re: [dba-SQLServer] Bulk insert >To: <dba-sqlserver at databaseadvisors.com> >Message-ID: <00d901c79198$0a703210$657aa8c0 at m6805> >Content-Type: text/plain; charset="us-ascii" > >Jim, > >At this point it is not necessary. I built a preprocessor in a few >hours using my toy (Access). My toy application handles everything >exactly as described. Someday (soon I hope) I will port that to VB.Net >which I hope will be much quicker in the preprocessing department. >Then I will be considered by some as being a real man, playing with >real tools. ;-) Others will still consider me a child, playing with >toys because I didn't take it straight to C#. SOMEDAY (far in the >future) perhaps I will embed those pieces directly in CLR programming >inside of SQL Server 2005. Then I will be able to look down my nose at those children still playing with toys. > >For now, it works and with the addition of driving the Sproc from the >vba will be an integrated application like what I described. > >John W. Colby _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com _______________________________________________ dba-SQLServer mailing list dba-SQLServer at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-sqlserver http://www.databaseadvisors.com