[dba-VB] How I'm approaching the problem

Shamil Salakhetdinov shamil at users.mns.ru
Tue Jul 24 14:37:04 CDT 2007


Hello John,

<<<
I don't understand the context - would what have any economical sense?
>>>
I meant does your customer need to make bulk loading any faster or not? You
know one may fly Paris - NY using ordinary Boeing planes, one can use (in
the past) Concord or even TU-144(!?)
(http://en.wikipedia.org/wiki/Tupolev_Tu-144) - IOW does your customer still
need Concord for their task or current very good Boeing solution is good
enough for them now and for a long time in the future?

I also mean we can try to find what is (if any) bottleneck of the current
solution and "break" this bottleneck?

Some tests (Dual Core 3MHz, 2GB, W2003 Server, simple IDE/SATA 7400rpm HDD)
- the raw performance of reading a delimited by TAB char CSV file in .NET:

Total Counter = 1011852 (1 million...)
ElapsedTime = 3,359375 seconds
ElapsedTime = 0,0559895833333333 minutes

Total Counter = 10006092 (10 million...)
ElapsedTime = 32,890625 seconds
ElapsedTime = 0,548177083333333 minutes

Total Counter = 100010952 (100 million...)
ElapsedTime = 327,875 seconds
ElapsedTime = 5,46458333333333 minutes


Splitting of CSV file lines into array added...

Total Counter = 1011852 (1 million...)
ElapsedTime = 8,375 seconds
ElapsedTime = 0,139583333333333 minutes

... (I didn't have time to test other cases of 10 and 100 million source
file lines - please do if you have time...)


Of course that is just reading the file - the open question is how closely
this extreme files reading performance can be approached when such a file is
bulk loaded into MS SQL Database?

Join, I'm sorry I have to suspend my participation in this thread for
several days - very urgent project is here waiting for release...


I must say I'm very interested to return here next week, sorry for any
inconvenience...

--
Shamil

P.S. Simple C# code used for testing (watch line wraps):


using System;
using System.Collections.Generic;
using System.Text;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            long MAX_LIMIT = 1000000; // 00;
            string s = @"F:\Temp\XL\ConsoleApplication1\IN\testfile.txt";
            long totalCounter = 0;
            long lineCounter = 0;
            DateTime startTime = DateTime.Now;
            while (totalCounter < MAX_LIMIT)
            {
                System.IO.TextReader rdr = new System.IO.StreamReader(s);
                lineCounter = 0;
                string inline = "";
                char delimiter = (char)9;
                while ((inline = rdr.ReadLine())!=null) {
                    string[] fields = inline.Split(delimiter);  
                    ++totalCounter;
                    ++lineCounter;
                }
                rdr.Close();
            }

            DateTime endTime = DateTime.Now ;
            TimeSpan elapsedTime = endTime - startTime;
            Console.WriteLine("File Line Counter = {0}",
lineCounter.ToString());
            Console.WriteLine("Total Counter = {0}",
totalCounter.ToString());
            Console.WriteLine("ElapsedTime = {0} seconds",
elapsedTime.TotalSeconds.ToString());
            Console.WriteLine("ElapsedTime = {0} minutes",
elapsedTime.TotalMinutes.ToString());
        }
    }
}
 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Tuesday, July 24, 2007 7:02 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Shamil,

>Would that have any (economical) sense for your customer to improve DB bulk
loading time?

I don't understand the context - would what have any economical sense?

>Why you do not use 64 bit .Net Framework to run your code on the server
side?

My understanding is that in order to run anything in 64 bit mode I have to
have 64 bit Windows and 64 bit SQL Server.  I only have 32 bit installed at
this time.  I have obtained a copy of 64 bit SQL Server and am looking for
64 bit Windows 2003.  I have several problems here.  

First, I am a one man show with a month by month budget.  Second, In order
to get maximum bang for my hard earned dollar I build my own systems.  

I was talking to the owner of the company that does my address validation
and they just spend $25,000 for a DELL server with 32 megs of data and
windows 2003 x64.  I don't have that kind of money.

When you buy a DELL (as an example) then getting Windows x64 to install is a
simple click of the button on the order page.  Getting windows x64 to
install on a desktop is not so simple, with issues for ALL of the drivers
from video, network, disk etc.  When I built the workstations I use to run
SQL Server I did not understand the size of the task; Now I do.  I will be
building a new server this fall with a server motherboard, designed and
certified to run the X64 versions of windows, with drivers supplied etc.  It
will be a dual processor / 8 core machine which will contain 32 megs or ram
to start, 64 megs eventually.  I MAY be able to get Windows 2003 x64 to run
on the current servers, and if I can then I will but I am not holding my
breath.  Even if I can, they max out at 8 megs of ram that they can address
so they will still be underpowered for my purpose.

>Do you need to do any mapping in your code between third-party data reader
(IDataReader) object and SqlBilkCopy SqlBulkCopyColumnMappingCollection or
the sequence of the fields in the source CSV file is the same as the
sequence of the fields in the target database table?

I am still working through that.  The third party lib, DataStreams from

www.csvreader.com

Has a set of objects (readers) which can read and write streams to and from
files.  The docs are complete but the examples are sparse!  However the
author is available on his forums to answer questions and generally responds
within a few hours.  

Basically what happens is that you instantiate one of the reader objects,
tell the object what data type each column will be and then pass in a stream
from that object to the SQLBulkCopy object.  The DataStreams reader object
handles all of the opening of the file, parsing the CSV file into an array
and writing the array data into the stream.  It appears to handle chunks of
data so that the entire file does not have to fit into memory at once.  With
a CSV file, typically there is a header line in the first row.  DataStreams
reads that in and does the mapping for you if that row exists.  He also has
methods for obtaining the collection of field names etc, so I am actually
building code to automatically build a table in SQL Server based on the
names in the field names collection, plus a PK column of my own (long
autoincrement) out at the end.  His code correctly feeds the data into the
table IF the columns from the file are the first columns, i.e. the PK is at
the end of the table, but does not correctly handle the data feed if the PK
is the first field in the table (and there is no PK field in the data).

Remember that I have two distinct cases.

1) Data (lists) from vendors.  These may have some "PK" of their own, but I
do not know what it is or whether there will be one (typically not) so I
just routinely build my own.
2) Data that I export from SQL Server, process, and get back from the
address validation software.  Those CSVs already have a PKID because my
source tables have one and I export that PKID into the CSV file when I do
the export from SQL Server.

Given the costs of the lib, I am very impressed with the implementation so
far.  He provides a record event to allow preprocessing data within each
record being read before sending it off to the SQLBulkCopy.  You can do
anything you might imagine with such control - look for specific data values
in a given field, skip the record based on such evaluations, modify the data
in specific fields etc, all on-the-fly as the data moves from CSV to SQL
Server.  I believe it also handles fixed width files which is also on my
plate.

So basically I just outsourced one small but complex part of the project.
It was easy to get working and seems pretty fast so I am happy with the
results so far.

John W. Colby
Colby Consulting
www.ColbyConsulting.com 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Shamil
Salakhetdinov
Sent: Tuesday, July 24, 2007 3:12 AM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Hello John,

Would that have any (economical) sense for your customer to improve DB bulk
loading time?

Why you do not use 64 bit .Net Framework to run your code on the server
side?

Do you need to do any mapping in your code between third-party data reader
(IDataReader) object and SqlBilkCopy SqlBulkCopyColumnMappingCollection or
the sequence of the fields in the source CSV file is the same as the
sequence of the fields in the target database table?

--
Shamil

 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Tuesday, July 24, 2007 12:17 AM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Well... The SQL Server system is an AMD X64 3.8 ghz with 4 gig ram running
Windows 2003.  It is using a Raid 6 array with a dedicated controller.

The actual VB.Net program is running on my laptop a Dell m90 which is a dual
proc Intel with 2 gig of ram running Windows XP Pro.  Thus the raw data
files come off the server into the laptop across a 1 gbit LAN, the laptop
does the text processing and feeds the data back to the SQL Server for
storing in the database.

All of this will be effected by things like the number of fields, type of
fields written to in SQL Server (what the data actually is and data type of
the destination field), indexes that exist on the table etc.

Mine are very rough timings done by using the clock at the beginning / end
of the bulk import.  The custom library that I use was optimized (according
to the author) by using arrays to store the data coming out of the text
file, on the way to the SQL Server.  Again I haven't seen his code (I can
pay more money to get source but I don't need that), so I can't really
comment on the "how" of what he is doing.  I do know that I open the file
using his lib and pass a resulting stream (property of his reader object) to
the SQLBulkCopy.

John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Shamil
Salakhetdinov
Sent: Monday, July 23, 2007 3:57 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

<<<
I am doing tables of 50 to 100 million records so the total process can
still run 60 to 90 minutes but it was taking 2 days.
>>>
Yes, that's speedy!

Do you use any special hardware?

Here are some results of using SQLDMO.BulkCopy under C#/VB.NET:

SQLDMO.BulkCopy
================

BCP export - 141 seconds 1,216,138 reported rows

BCP Import - 174 seconds 1,256,819 reported rows

Extrapolating to 100 million records:

BCP Export: 141*100,000,000/1,216,138 = 11594,17 seconds = 3.22 hours ~=
3h10m

BCP Import: 173*1000,000,000/1,256,819 = 12844,48 seconds = 3,85 hours ~=
3h50min

Well performance will probably degrade and total time will be higher...

These results are not that quick as on your system for SQLBulkCopy but this
my test PC is "pretty simple" for these days Dual Core Pentium 3GHz and 2GB
RAM and simple 7400 rpm harddisk....


--
Shamil
 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of jwcolby
Sent: Monday, July 23, 2007 9:38 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Shamil,

I am likewise using SQLBulkCopy to import data from CSV files using a third
party library to handle the conversion of the file to a stream and vv.  I am
quite impressed.  Using my hand written code for doing something similar I
was getting between around 1k records / second loading "row by row" whereas
using SQLBulkCopy and this third party lib I am getting about 16K records /
second = about 60 seconds for 1 million records.  That is quite usable.

I am doing tables of 50 to 100 million records so the total process can
still run 60 to 90 minutes but it was taking 2 days.

John W. Colby
Colby Consulting
www.ColbyConsulting.com
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Shamil
Salakhetdinov
Sent: Monday, July 23, 2007 1:25 PM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

All,

To keep "XML - go/no go discussion on" I just wanted to post some real stats
on MS SQL 2000/2005 and XML I have got today out of the project I'm working
currently on - this is a real database (84 tables), real production data,
quantity of source rows in DB = [RowsQty]/3. The test gets data from MS SQL
db tables via stored procedures into in memory List<...> collections via
SqlReader (.NET) etc. - usual modern multi-tier architecture.

Test PC is Intel Dual Core 3Mhz + 2GM RAM with MS Windows 2003 Server -
nothing special for these days.

In more details the test does the following:

MS SQL 2000 Database
 -> SqlDataReader (.NET)
  -> DL custom classes/list (C#)
    -> BL custom classes/list (C#)
     -> XmlSerializer (.NET)
       -> XML file (written via TextWriter (.NET)
        -> deserialize using XmlSerailizer and TextReader (..NET)
         -> serialize into another XML file using XmlSerializer and
TextWriter (.NET) 

Below are some results sorted descending by elapsed time in seconds  (Note:
table names are "censored" - it's a real project and I'm not sure customer
will like to see real table names published)

Performance of this test looks here very impressive - have a look of stats
(Note: results are cross tested by using serialized data to store back into
db then extract again etc. - no mistakes here. And BTW loading these data
using stored procedures row by row takes quite some time (several hours) but
when loading these data using special .NET Framework class
(System.Data.SqlClient.SqlBulkCopy), which is working much like BCP then the
loading of this db takes less than 10 minutes - total qty of rows in all
tables is 1,131,004 (one million...)

Elapsed time for this test: 122,421875 seconds (2 minutes), total elements
(rows) processed - 3,393,012 (three million...)

I will be experimenting more in the coming days (end of August probably) and
I will post more results... 

And this is not pure experimenting/"mad scientist games" - this code will be
used in production for stress testing, for data replication etc. etc.

Finally stats: DB -> Memory -> XML file -> Memory -> another Xml File (to
use in first cross-test):

TableName   Elapsed Time (s)	RowsQty
======       ==========        =====
Table53	  20.7500000	260286
Table71	  12.6562500	327819
Table29	  08.2500000	105000
Table79	  07.4843750	204243
Table80	  07.3750000	299556
Table84	   06.7656250	262908
Table82	   06.6718750	263337
Table75	   06.5000000	250155
Table37	   05.4062500	318555
Table68	   04.0156250	189810
Table69	   03.7500000	83490
Table72	   03.1562500	74598
Table81	   03.0312500	199287
Table73	   02.8281250	135150
Table38	   02.6406250	108636
Table30	   01.8593750	108726


Here are individual steps stats:

Database -> custom objects in List<...>

Elapsed Time(s)	RowsQty
==========    ======
05.9218750	86762
03.7656250	109273
02.6406250	35000
02.2968750	68081
02.2812500	99852
02.2343750	87636
02.1875000	87779
02.0312500	83385
01.7656250	106185
01.4062500	24866
01.3906250	63270
01.2343750	27830
01.0468750	45050
01.0312500	66429


Memory (custom objects) -> XML files

ElapsedTime(s)	RowsQty
=========       =======
08.8750000	86762
05.0468750	109273
03.2812500	35000
02.9375000	99852
02.7187500	83385
02.6718750	68081
02.5156250	87779
02.4687500	87636
02.1250000	106185
01.5000000	27830
01.4218750	63270
01.1562500	66429
01.00000 	24866
00.9687500	45050
00.8750000	36212
00.6718750	36242


XML File -> Memory (custom objects)

ElapsedTime(s)	RowsQty
=======             =====
05.9531250	86762
03.8437500	109273
02.5156250	68081
02.3281250	35000
02.1562500	99852
02.0625000	87636
01.9687500	87779
01.7500000	83385
01.5156250	106185
01.2031250	63270
01.0156250	27830
00.8437500	66429
00.8125000	45050
00.7812500	36212


And here is "microscopic" generic DAL, which is built using only three(!)
ADO.NET classes SqlConnection, SqlCommand, IDataReader  used here:


public int ExecuteNonQuery(DbCommand cmd) {
    return ExecuteNonQuery(cmd, false);
}

public IDataReader ExecuteReader(DbCommand cmd) {
    return ExecuteReader(cmd, CommandBehavior.Default); }

public IDataReader ExecuteReader(DbCommand cmd, CommandBehavior behavior) {
    return cmd.ExecuteReader(behavior);
}

That's it.

--
Shamil
 
-----Original Message-----
From: dba-vb-bounces at databaseadvisors.com
[mailto:dba-vb-bounces at databaseadvisors.com] On Behalf Of Jim Lawrence
Sent: Monday, July 23, 2007 7:01 AM
To: dba-vb at databaseadvisors.com
Subject: Re: [dba-VB] How I'm approaching the problem

Hi John:

You have taken on a massively complex project single-handedly. When I was
working full-time for a company and a similar sized project appeared I
assign at least 2 people to the project. It seems that 2 people can do the
work of three when they work together.

MS SQL people tend to think their a little better than the standard Access
grunts. Why that is so I have no idea. Considering that MS SQL developers
have the luxury of working with a faster and better product that is much
easier to obtain positive results than from an equally complex project
written totally in Access. That is why I write most of my new apps in a
combination of Access FE and MS SQL BE because I get the best of all worlds.

MS SQL is more rugged than the MDB, handles unbound connections without the
absolute need for a complex locking scheme as MS SQL is designed to work it
this type of environment. It internally handles locking, multi-access to a
single record or group of records. It is a professional level DB and is
actually easier to work with. Unfortunately, ADO is the best connection
protocol for performance and reliability but if you do not know it, it is
just another major item to learn.

If we throw learn .Net from scratch into the mix and you have to hold on
with both hands just to keep your sanity. I am amazed at how far you have
come in such a short time. Nothing like a baptism in fire... If you are a
little stressed, it is to be expected.

Hope your day has gone well.

Regards
Jim
    
<<< tail of thread trimmed>>

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com

_______________________________________________
dba-VB mailing list
dba-VB at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-vb
http://www.databaseadvisors.com




More information about the dba-VB mailing list