Shamil Salakhetdinov
shamil at users.mns.ru
Tue Jul 24 14:37:04 CDT 2007
Hello John, <<< I don't understand the context - would what have any economical sense? >>> I meant does your customer need to make bulk loading any faster or not? You know one may fly Paris - NY using ordinary Boeing planes, one can use (in the past) Concord or even TU-144(!?) (http://en.wikipedia.org/wiki/Tupolev_Tu-144) - IOW does your customer still need Concord for their task or current very good Boeing solution is good enough for them now and for a long time in the future? I also mean we can try to find what is (if any) bottleneck of the current solution and "break" this bottleneck? Some tests (Dual Core 3MHz, 2GB, W2003 Server, simple IDE/SATA 7400rpm HDD) - the raw performance of reading a delimited by TAB char CSV file in .NET: Total Counter = 1011852 (1 million...) ElapsedTime = 3,359375 seconds ElapsedTime = 0,0559895833333333 minutes Total Counter = 10006092 (10 million...) ElapsedTime = 32,890625 seconds ElapsedTime = 0,548177083333333 minutes Total Counter = 100010952 (100 million...) ElapsedTime = 327,875 seconds ElapsedTime = 5,46458333333333 minutes Splitting of CSV file lines into array added... Total Counter = 1011852 (1 million...) ElapsedTime = 8,375 seconds ElapsedTime = 0,139583333333333 minutes ... (I didn't have time to test other cases of 10 and 100 million source file lines - please do if you have time...) Of course that is just reading the file - the open question is how closely this extreme files reading performance can be approached when such a file is bulk loaded into MS SQL Database? Join, I'm sorry I have to suspend my participation in this thread for several days - very urgent project is here waiting for release... I must say I'm very interested to return here next week, sorry for any inconvenience... -- Shamil P.S. Simple C# code used for testing (watch line wraps): using System; using System.Collections.Generic; using System.Text; namespace ConsoleApplication1 { class Program { static void Main(string[] args) { long MAX_LIMIT = 1000000; // 00; string s = @"F:\Temp\XL\ConsoleApplication1\IN\testfile.txt"; long totalCounter = 0; long lineCounter = 0; DateTime startTime = DateTime.Now; while (totalCounter < MAX_LIMIT) { System.IO.TextReader rdr = new System.IO.StreamReader(s); lineCounter = 0; string inline = ""; char delimiter = (char)9; while ((inline = rdr.ReadLine())!=null) { string[] fields = inline.Split(delimiter); ++totalCounter; ++lineCounter; } rdr.Close(); } DateTime endTime = DateTime.Now ; TimeSpan elapsedTime = endTime - startTime; Console.WriteLine("File Line Counter = {0}", lineCounter.ToString()); Console.WriteLine("Total Counter = {0}", totalCounter.ToString()); Console.WriteLine("ElapsedTime = {0} seconds", elapsedTime.TotalSeconds.ToString()); Console.WriteLine("ElapsedTime = {0} minutes", elapsedTime.TotalMinutes.ToString()); } } } -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Tuesday, July 24, 2007 7:02 PM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem Shamil, >Would that have any (economical) sense for your customer to improve DB bulk loading time? I don't understand the context - would what have any economical sense? >Why you do not use 64 bit .Net Framework to run your code on the server side? My understanding is that in order to run anything in 64 bit mode I have to have 64 bit Windows and 64 bit SQL Server. I only have 32 bit installed at this time. I have obtained a copy of 64 bit SQL Server and am looking for 64 bit Windows 2003. I have several problems here. First, I am a one man show with a month by month budget. Second, In order to get maximum bang for my hard earned dollar I build my own systems. I was talking to the owner of the company that does my address validation and they just spend $25,000 for a DELL server with 32 megs of data and windows 2003 x64. I don't have that kind of money. When you buy a DELL (as an example) then getting Windows x64 to install is a simple click of the button on the order page. Getting windows x64 to install on a desktop is not so simple, with issues for ALL of the drivers from video, network, disk etc. When I built the workstations I use to run SQL Server I did not understand the size of the task; Now I do. I will be building a new server this fall with a server motherboard, designed and certified to run the X64 versions of windows, with drivers supplied etc. It will be a dual processor / 8 core machine which will contain 32 megs or ram to start, 64 megs eventually. I MAY be able to get Windows 2003 x64 to run on the current servers, and if I can then I will but I am not holding my breath. Even if I can, they max out at 8 megs of ram that they can address so they will still be underpowered for my purpose. >Do you need to do any mapping in your code between third-party data reader (IDataReader) object and SqlBilkCopy SqlBulkCopyColumnMappingCollection or the sequence of the fields in the source CSV file is the same as the sequence of the fields in the target database table? I am still working through that. The third party lib, DataStreams from www.csvreader.com Has a set of objects (readers) which can read and write streams to and from files. The docs are complete but the examples are sparse! However the author is available on his forums to answer questions and generally responds within a few hours. Basically what happens is that you instantiate one of the reader objects, tell the object what data type each column will be and then pass in a stream from that object to the SQLBulkCopy object. The DataStreams reader object handles all of the opening of the file, parsing the CSV file into an array and writing the array data into the stream. It appears to handle chunks of data so that the entire file does not have to fit into memory at once. With a CSV file, typically there is a header line in the first row. DataStreams reads that in and does the mapping for you if that row exists. He also has methods for obtaining the collection of field names etc, so I am actually building code to automatically build a table in SQL Server based on the names in the field names collection, plus a PK column of my own (long autoincrement) out at the end. His code correctly feeds the data into the table IF the columns from the file are the first columns, i.e. the PK is at the end of the table, but does not correctly handle the data feed if the PK is the first field in the table (and there is no PK field in the data). Remember that I have two distinct cases. 1) Data (lists) from vendors. These may have some "PK" of their own, but I do not know what it is or whether there will be one (typically not) so I just routinely build my own. 2) Data that I export from SQL Server, process, and get back from the address validation software. Those CSVs already have a PKID because my source tables have one and I export that PKID into the CSV file when I do the export from SQL Server. Given the costs of the lib, I am very impressed with the implementation so far. He provides a record event to allow preprocessing data within each record being read before sending it off to the SQLBulkCopy. You can do anything you might imagine with such control - look for specific data values in a given field, skip the record based on such evaluations, modify the data in specific fields etc, all on-the-fly as the data moves from CSV to SQL Server. I believe it also handles fixed width files which is also on my plate. So basically I just outsourced one small but complex part of the project. It was easy to get working and seems pretty fast so I am happy with the results so far. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Shamil Salakhetdinov Sent: Tuesday, July 24, 2007 3:12 AM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem Hello John, Would that have any (economical) sense for your customer to improve DB bulk loading time? Why you do not use 64 bit .Net Framework to run your code on the server side? Do you need to do any mapping in your code between third-party data reader (IDataReader) object and SqlBilkCopy SqlBulkCopyColumnMappingCollection or the sequence of the fields in the source CSV file is the same as the sequence of the fields in the target database table? -- Shamil -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Tuesday, July 24, 2007 12:17 AM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem Well... The SQL Server system is an AMD X64 3.8 ghz with 4 gig ram running Windows 2003. It is using a Raid 6 array with a dedicated controller. The actual VB.Net program is running on my laptop a Dell m90 which is a dual proc Intel with 2 gig of ram running Windows XP Pro. Thus the raw data files come off the server into the laptop across a 1 gbit LAN, the laptop does the text processing and feeds the data back to the SQL Server for storing in the database. All of this will be effected by things like the number of fields, type of fields written to in SQL Server (what the data actually is and data type of the destination field), indexes that exist on the table etc. Mine are very rough timings done by using the clock at the beginning / end of the bulk import. The custom library that I use was optimized (according to the author) by using arrays to store the data coming out of the text file, on the way to the SQL Server. Again I haven't seen his code (I can pay more money to get source but I don't need that), so I can't really comment on the "how" of what he is doing. I do know that I open the file using his lib and pass a resulting stream (property of his reader object) to the SQLBulkCopy. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Shamil Salakhetdinov Sent: Monday, July 23, 2007 3:57 PM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem <<< I am doing tables of 50 to 100 million records so the total process can still run 60 to 90 minutes but it was taking 2 days. >>> Yes, that's speedy! Do you use any special hardware? Here are some results of using SQLDMO.BulkCopy under C#/VB.NET: SQLDMO.BulkCopy ================ BCP export - 141 seconds 1,216,138 reported rows BCP Import - 174 seconds 1,256,819 reported rows Extrapolating to 100 million records: BCP Export: 141*100,000,000/1,216,138 = 11594,17 seconds = 3.22 hours ~= 3h10m BCP Import: 173*1000,000,000/1,256,819 = 12844,48 seconds = 3,85 hours ~= 3h50min Well performance will probably degrade and total time will be higher... These results are not that quick as on your system for SQLBulkCopy but this my test PC is "pretty simple" for these days Dual Core Pentium 3GHz and 2GB RAM and simple 7400 rpm harddisk.... -- Shamil -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby Sent: Monday, July 23, 2007 9:38 PM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem Shamil, I am likewise using SQLBulkCopy to import data from CSV files using a third party library to handle the conversion of the file to a stream and vv. I am quite impressed. Using my hand written code for doing something similar I was getting between around 1k records / second loading "row by row" whereas using SQLBulkCopy and this third party lib I am getting about 16K records / second = about 60 seconds for 1 million records. That is quite usable. I am doing tables of 50 to 100 million records so the total process can still run 60 to 90 minutes but it was taking 2 days. John W. Colby Colby Consulting www.ColbyConsulting.com -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Shamil Salakhetdinov Sent: Monday, July 23, 2007 1:25 PM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem All, To keep "XML - go/no go discussion on" I just wanted to post some real stats on MS SQL 2000/2005 and XML I have got today out of the project I'm working currently on - this is a real database (84 tables), real production data, quantity of source rows in DB = [RowsQty]/3. The test gets data from MS SQL db tables via stored procedures into in memory List<...> collections via SqlReader (.NET) etc. - usual modern multi-tier architecture. Test PC is Intel Dual Core 3Mhz + 2GM RAM with MS Windows 2003 Server - nothing special for these days. In more details the test does the following: MS SQL 2000 Database -> SqlDataReader (.NET) -> DL custom classes/list (C#) -> BL custom classes/list (C#) -> XmlSerializer (.NET) -> XML file (written via TextWriter (.NET) -> deserialize using XmlSerailizer and TextReader (..NET) -> serialize into another XML file using XmlSerializer and TextWriter (.NET) Below are some results sorted descending by elapsed time in seconds (Note: table names are "censored" - it's a real project and I'm not sure customer will like to see real table names published) Performance of this test looks here very impressive - have a look of stats (Note: results are cross tested by using serialized data to store back into db then extract again etc. - no mistakes here. And BTW loading these data using stored procedures row by row takes quite some time (several hours) but when loading these data using special .NET Framework class (System.Data.SqlClient.SqlBulkCopy), which is working much like BCP then the loading of this db takes less than 10 minutes - total qty of rows in all tables is 1,131,004 (one million...) Elapsed time for this test: 122,421875 seconds (2 minutes), total elements (rows) processed - 3,393,012 (three million...) I will be experimenting more in the coming days (end of August probably) and I will post more results... And this is not pure experimenting/"mad scientist games" - this code will be used in production for stress testing, for data replication etc. etc. Finally stats: DB -> Memory -> XML file -> Memory -> another Xml File (to use in first cross-test): TableName Elapsed Time (s) RowsQty ====== ========== ===== Table53 20.7500000 260286 Table71 12.6562500 327819 Table29 08.2500000 105000 Table79 07.4843750 204243 Table80 07.3750000 299556 Table84 06.7656250 262908 Table82 06.6718750 263337 Table75 06.5000000 250155 Table37 05.4062500 318555 Table68 04.0156250 189810 Table69 03.7500000 83490 Table72 03.1562500 74598 Table81 03.0312500 199287 Table73 02.8281250 135150 Table38 02.6406250 108636 Table30 01.8593750 108726 Here are individual steps stats: Database -> custom objects in List<...> Elapsed Time(s) RowsQty ========== ====== 05.9218750 86762 03.7656250 109273 02.6406250 35000 02.2968750 68081 02.2812500 99852 02.2343750 87636 02.1875000 87779 02.0312500 83385 01.7656250 106185 01.4062500 24866 01.3906250 63270 01.2343750 27830 01.0468750 45050 01.0312500 66429 Memory (custom objects) -> XML files ElapsedTime(s) RowsQty ========= ======= 08.8750000 86762 05.0468750 109273 03.2812500 35000 02.9375000 99852 02.7187500 83385 02.6718750 68081 02.5156250 87779 02.4687500 87636 02.1250000 106185 01.5000000 27830 01.4218750 63270 01.1562500 66429 01.00000 24866 00.9687500 45050 00.8750000 36212 00.6718750 36242 XML File -> Memory (custom objects) ElapsedTime(s) RowsQty ======= ===== 05.9531250 86762 03.8437500 109273 02.5156250 68081 02.3281250 35000 02.1562500 99852 02.0625000 87636 01.9687500 87779 01.7500000 83385 01.5156250 106185 01.2031250 63270 01.0156250 27830 00.8437500 66429 00.8125000 45050 00.7812500 36212 And here is "microscopic" generic DAL, which is built using only three(!) ADO.NET classes SqlConnection, SqlCommand, IDataReader used here: public int ExecuteNonQuery(DbCommand cmd) { return ExecuteNonQuery(cmd, false); } public IDataReader ExecuteReader(DbCommand cmd) { return ExecuteReader(cmd, CommandBehavior.Default); } public IDataReader ExecuteReader(DbCommand cmd, CommandBehavior behavior) { return cmd.ExecuteReader(behavior); } That's it. -- Shamil -----Original Message----- From: dba-vb-bounces at databaseadvisors.com [mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Jim Lawrence Sent: Monday, July 23, 2007 7:01 AM To: dba-vb at databaseadvisors.com Subject: Re: [dba-VB] How I'm approaching the problem Hi John: You have taken on a massively complex project single-handedly. When I was working full-time for a company and a similar sized project appeared I assign at least 2 people to the project. It seems that 2 people can do the work of three when they work together. MS SQL people tend to think their a little better than the standard Access grunts. Why that is so I have no idea. Considering that MS SQL developers have the luxury of working with a faster and better product that is much easier to obtain positive results than from an equally complex project written totally in Access. That is why I write most of my new apps in a combination of Access FE and MS SQL BE because I get the best of all worlds. MS SQL is more rugged than the MDB, handles unbound connections without the absolute need for a complex locking scheme as MS SQL is designed to work it this type of environment. It internally handles locking, multi-access to a single record or group of records. It is a professional level DB and is actually easier to work with. Unfortunately, ADO is the best connection protocol for performance and reliability but if you do not know it, it is just another major item to learn. If we throw learn .Net from scratch into the mix and you have to hold on with both hands just to keep your sanity. I am amazed at how far you have come in such a short time. Nothing like a baptism in fire... If you are a little stressed, it is to be expected. Hope your day has gone well. Regards Jim <<< tail of thread trimmed>> _______________________________________________ dba-VB mailing list dba-VB at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-vb http://www.databaseadvisors.com _______________________________________________ dba-VB mailing list dba-VB at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-vb http://www.databaseadvisors.com _______________________________________________ dba-VB mailing list dba-VB at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-vb http://www.databaseadvisors.com _______________________________________________ dba-VB mailing list dba-VB at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-vb http://www.databaseadvisors.com _______________________________________________ dba-VB mailing list dba-VB at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-vb http://www.databaseadvisors.com _______________________________________________ dba-VB mailing list dba-VB at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/dba-vb http://www.databaseadvisors.com