William Hindman
wdhindman at dejpolsystems.com
Wed May 9 15:50:39 CDT 2007
...go ahead ...tell us you did this fixed price :) William Hindman ----- Original Message ----- From: "JWColby" <jwcolby at colbyconsulting.com> To: <dba-sqlserver at databaseadvisors.com>; "'Access Developers discussion and problem solving'" <accessd at databaseadvisors.com> Sent: Wednesday, May 09, 2007 1:01 PM Subject: [AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk insert > Just an FYI. The table that I have been building this whole time contains > 97.5 million records, exactly 149 (imported) fields and requires 62.6 > Gigabytes of data space inside of SQL Server. It took 2 hours and 28 > minutes > just to build the auto increment PK field after the table was finished > importing records. The index space for the table (with just this single > index) is 101 Megabytes. > > There were 56 raw data files which required 75 gigabytes of disk space to > hold. There were 56 CSV files created after stripping out the spaces, > which > required 40.8 Gigabytes of disk space to hold. Thus by my calculations, > 35 > gigs of disk space was required to hold JUST THE SPACES in the original > fixed width file, with the real data occupying 40.8 GB. It is interesting > to note that the raw data in the CSV file was 41gb while the data space > required in SQL Server is 62 gb. > > As the process was built over time, I do not have accurate specs for each > and every file, but the process of stripping the spaces off of the fields > happened at about 1K records / second. Given 97.5 million records, this > equates to 97.5 thousand seconds to do the space stripping, which is 27.77 > hours. That of course is done in a VBA application. Again I don't have > accurate specs for all of the bulk inserts, however those that I recorded > the times for summed to 71.2 million records, which took 4674 seconds (1.3 > hours) to import using a BULK INSERT statement, which equates to > approximately 15K records / second. Remember that this BULK INSERT is > importing precleaned data with pipe delimiters. Also remember that the > BULK > INSERT itself took 1.3 hours but due to the lack of automation in feeding > the Sproc file names, I had to manually edit the SPROC each time I wanted > to > import a new file so the actual import took much longer, since I wasn't > necessarily watching the computer as the last SPROC run finished. > > So there you go, that is what I have been trying to accomplish this last > few > weeks. > > John W. Colby > Colby Consulting > www.ColbyConsulting.com > > -----Original Message----- > From: dba-sqlserver-bounces at databaseadvisors.com > [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of JWColby > Sent: Wednesday, May 09, 2007 11:02 AM > To: dba-sqlserver at databaseadvisors.com > Subject: Re: [dba-SQLServer] Bulk insert > > Robert, > > That is much appreciated. Why don't you work with me instead? I am all > for > doing it "in SQL Server" (I own and use SQL Server 2005) but it has to be > something that I understand and can modify or I will not be able to use it > in the end, which would make your efforts wasted. > > My high level spec: > > 1) Use a "from directory" which is monitored to pick up files from. > 1a) The from directory changes for each import data set. > 2) Each file in the "from directory" will contain identical formatted > data, > but the number of records in the file may vary. > 3) Each file will be defined by an import spec table, which contains Field > Name, Field Width and data type. If no data type is specified, then > VarChar() is used. The size of the VarChar() field is determined by the > spec table. This allows me to only have to spec data types (in advance) > that I know is not VarChar(), but allows me to spec as many as I need in > advance of the import. > 4) Once the table is built and populated, a new field called PKID needs to > be built. PKID will be INT (32 bit) PRIMARY KEY, IDENTITY. > > Understand that at this time I actually have a functioning system. It > takes > a pair of tables which specify the file info (from / to directories, field > delimiter etc) and the field info (field name / start position in the > fixed > width data / field length). This program (written in VBA) does a > conversion > from fixed width to a pipe delimited "CSV" file, reading a line, stripping > off the spaces, and writing the stripped data lines back out to a CSV file > in the "TO Directory", complete with the first line containing field > names. > At the moment I have a hand constructed table in a hand constructed > database, which is created by an initial use of the wizard from inside of > SQL Server, pulling in the first CSV files created by my program. Once > that > table is created, I use a hand created BULK INSERT Sproc to import the CSV > files. Once the table is fully populated with the contents of all the > files, I hand build an ALTER TABLE query to build a PKID INT PRIMARY KEY, > IDENTITY. > > As you can see, anywhere you see "hand created", that is an area that > needs > to be automated. My thoughts are that creating the table initially will > be > relatively easy, and in fact I know how, building up a make table query > with > the field names and widths taken from the spec table. I just did not do > that because I did not have the time. The next issue is dynamically > creating the Sproc that does the Bulk Insert. > > Now, as to whether the process of importing the data (inside of SQL > Server) > strips off the spaces is really somewhat irrelevant at this point since I > have working code to do this. It is not blazing fast at about 1000 lines > / > second (for 150 fields) but it is "fast enough". If I port that to VB.Net > I > hope / expect to get a speed increase. The BULK INSERT SProc that I hand > build is currently running about 12K records / sec (for 150 fields) > > In the end, this really needs to be an external application driving SQL > Server functionality. I need a place to go to fill in the import spec > table, set the from / to directories, set up the name of the table etc. > My > heartache to this point has been the inability to get the SQL Server > built-in import stuff to import the data without the external strip / > rebuild step, or at least without an elaborate dance to get around any > limitations of SQL Server to do that stuff for me. > > The very next thing I have to do is start exporting just the name / > address > (and PK) of this 100 million record table back out for CASS / NCOA > processing. This requires exporting 2 million records at a time, to a > destination directory, with a unique file name (Infutor01.csv, > Infutor02.csv...). Once my CASS program finishes processing I will have a > new set of files in yet another directory that I need to pull back in to > SQL > Server. Those files will not require the space stripping piece since they > will not be fixed width. > > I do appreciate all of the advice from all the folks out there that have > contributed. I am slowly but surely learning the pieces and parts that I > need to do this part of my job. > > John W. Colby > Colby Consulting > www.ColbyConsulting.com > > -----Original Message----- > From: dba-sqlserver-bounces at databaseadvisors.com > [mailto:dba-sqlserver-bounces at databaseadvisors.com] On Behalf Of Robert L. > Stewart > Sent: Wednesday, May 09, 2007 9:31 AM > To: dba-sqlserver at databaseadvisors.com > Subject: Re: [dba-SQLServer] Bulk insert > > John, > > I am working on an example of doing this with SQL Server for you. But, > since I have the same amount of free time as you do, it is going to take a > week or so to complete. I am using the same concept as you did with a > couple of table to hold the Import spec and the column definitions for the > spec. I am going to only build it for one table to show how it can be > done. > The rest will be up to you if you want to expand it. > > Robert > > At 09:27 PM 5/8/2007, you wrote: >>Date: Tue, 8 May 2007 13:40:59 -0400 >>From: "JWColby" <jwcolby at colbyconsulting.com> >>Subject: Re: [dba-SQLServer] Bulk insert >>To: <dba-sqlserver at databaseadvisors.com> >>Message-ID: <00d901c79198$0a703210$657aa8c0 at m6805> >>Content-Type: text/plain; charset="us-ascii" >> >>Jim, >> >>At this point it is not necessary. I built a preprocessor in a few >>hours using my toy (Access). My toy application handles everything >>exactly as described. Someday (soon I hope) I will port that to VB.Net >>which I hope will be much quicker in the preprocessing department. >>Then I will be considered by some as being a real man, playing with >>real tools. ;-) Others will still consider me a child, playing with >>toys because I didn't take it straight to C#. SOMEDAY (far in the >>future) perhaps I will embed those pieces directly in CLR programming >>inside of SQL Server 2005. Then I will be able to look down my nose at > those children still playing with toys. >> >>For now, it works and with the addition of driving the Sproc from the >>vba will be an integrated application like what I described. >> >>John W. Colby > > > _______________________________________________ > dba-SQLServer mailing list > dba-SQLServer at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/dba-sqlserver > http://www.databaseadvisors.com > > _______________________________________________ > dba-SQLServer mailing list > dba-SQLServer at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/dba-sqlserver > http://www.databaseadvisors.com > > -- > AccessD mailing list > AccessD at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/accessd > Website: http://www.databaseadvisors.com >