Eric Barro
ebarro at verizon.net
Wed May 9 13:17:01 CDT 2007
VB's weakness is string manipulation. I believe that could be where the speed issues are in your white space stripping operation. This is especially evident when you have a loop that concatenates strings. .NET's StringBuilder class is much more efficient. -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of JWColby Sent: Wednesday, May 09, 2007 10:53 AM To: 'Access Developers discussion and problem solving' Subject: Re: [AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk insert Gustav, >It stresses that BULK INSERT is mandatory for this level of data sizes. Oh yea!!! I haven't found any other way that makes this size import doable. >One thing that strikes me, however, is the slow performance of your >space stripping; 1000 lines/s is not very much. But I guess you do more than just removing spaces. Nope, just remove spaces and write back out to a pipe delimited file. I have a pair of classes that do this. One loads clsFile - the file spec info (file name stuff, from / to dirs etc) and the other uses one instance per field, and loads one field of the field spec table - clsField. The basic operation is load the filespec class, then a collection of field spec classes. Each field spec class knows what it's field name is, where in the string it's field starts, and how wide it's field is. The file spec then opens a stream object and does a readline into a strLineIn variable. The clsFileSpec iterates its collection of clsField instances, and this strLineIn variable is passed in turn to each field class instance. The field class does a midstr() to pull out precisely the data section that it has to process and stores it in a strData variable / property (pData). The field class then strips off the leading and trailing spaces. Once clsFileSpec has read the strLineIn and passed that in turn to each clsField, it has a collection of clsField instances each holding a stripped section of the original strLine. clsFile then iterates that clsField collection appending each clsField.pData plus a "|" to strLineOut. When it has processed each clsField instance it is done assembling the strLineOut, which it then writes to an output stream. Line in, parse / strip, line out, repeat until done. I do a little logging of the file name / time to do the entire operation on the file etc. 99.99% of the time is in the parse / strip operation out in the clsField instances. Remember that the time to do this varies with the data and the data file. The first file I did had well over SEVEN HUNDRED fields / line. This specific file had 149 fields in it. How many lines per second will be most heavily influenced by the number of fields per line. Not all of them have spaces, but how do I tell? This is a generic solution, so that I can use it on the next file, not custom programmed for one specific file. I think this application will port quite easily to VB.Net though I haven't done so yet. When I do I will run the thing again and give comparison numbers. I do hope / expect that VB.Net will be significantly faster in processing the field - parse / strip. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Gustav Brock Sent: Wednesday, May 09, 2007 1:18 PM To: accessd at databaseadvisors.com Subject: [AccessD] Infutor Statistics - was RE: [dba-SQLServer] Bulk insert Hi John Thanks for sharing. Quite a story. It stresses that BULK INSERT is mandatory for this level of data sizes. One thing that strikes me, however, is the slow performance of your space stripping; 1000 lines/s is not very much. But I guess you do more than just removing spaces. /gustav >>> jwcolby at colbyconsulting.com 09-05-2007 19:01 >>> Just an FYI. The table that I have been building this whole time contains 97.5 million records, exactly 149 (imported) fields and requires 62.6 Gigabytes of data space inside of SQL Server. It took 2 hours and 28 minutes just to build the auto increment PK field after the table was finished importing records. The index space for the table (with just this single index) is 101 Megabytes. There were 56 raw data files which required 75 gigabytes of disk space to hold. There were 56 CSV files created after stripping out the spaces, which required 40.8 Gigabytes of disk space to hold. Thus by my calculations, 35 gigs of disk space was required to hold JUST THE SPACES in the original fixed width file, with the real data occupying 40.8 GB. It is interesting to note that the raw data in the CSV file was 41gb while the data space required in SQL Server is 62 gb. As the process was built over time, I do not have accurate specs for each and every file, but the process of stripping the spaces off of the fields happened at about 1K records / second. Given 97.5 million records, this equates to 97.5 thousand seconds to do the space stripping, which is 27.77 hours. That of course is done in a VBA application. Again I don't have accurate specs for all of the bulk inserts, however those that I recorded the times for summed to 71.2 million records, which took 4674 seconds (1.3 hours) to import using a BULK INSERT statement, which equates to approximately 15K records / second. Remember that this BULK INSERT is importing precleaned data with pipe delimiters. Also remember that the BULK INSERT itself took 1.3 hours but due to the lack of automation in feeding the Sproc file names, I had to manually edit the SPROC each time I wanted to import a new file so the actual import took much longer, since I wasn't necessarily watching the computer as the last SPROC run finished. So there you go, that is what I have been trying to accomplish this last few weeks. John W. Colby Colby Consulting www.ColbyConsulting.com