[dba-VB] Merge rows

jwcolby jwcolby at colbyconsulting.com
Wed Sep 23 07:36:19 CDT 2009


Shamil,

I have to say that the idea of simply adding a field or two or three to hold groups of these 
existing "Y" / "" fields makes the whole thing much more palatable.

I have already written about the types of queries.  It turns out there are currently 584 total 
fields in the database.  One is the PK (long).  The next 100 fields are fields that hold various 
codes such as name, age, gender, dates, income range and so forth.  So there about 480 "Y"/"" fields.

Specific fields in the first 100 fields are used a LOT.  Income bracket, gender, age bracket, child 
age brackets.

The client says "I need 2.5 million records WHERE..."

(GENDER = 1 AND
Income in (9,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t) AND
((Adults_25_34 in (1,3) OR
Adults_35_44 in (1,3))

AND

(Walking = Y OR
Physical Exercise = Y OR
Hunting = Y OR
.
.
)

You can see a pattern here, they must fit in an Gender / age / income bracket AND

They do, have, like, or own specific things.

The part that you are discussing Shamil neatly falls into the Do, Have, Like or Own part of the query.

To give a clue how I implement this, I actually have two queries.  The first query implements the 
first part - age / income / gender.  Sometimes also type of residence and the like.  The second 
query implements the Do / Have / Like / Own select.  I then JOIN on PKID those two queries to 
implement the AND between the two sets of selects.  I then JOIN that on PKID to a query that pulls 
names / addresses from a separate table.

The resulting data set may or may not have the required number of records.  If not the client throws 
in other Do/Have/Like/Own fields to raise the number of records.  Once we have enough to fill the 
order then I select the TOP(N) records ordered by random number field and ship the resulting names / 
addresses.

As you can see, you are proposing replacing the Do/Have/Like/Own query with custom binary 
processing.  In the end we have to get back to a recordset inside of SQL Server that can be JOINED 
on the PKID so that we can pull data from people table, so the end result would be a table written 
back out to the database with the PKIDs of the records that match the array manipulation.

You are of course correct, the in memory processing of the binary array would be lightning fast, and 
that could be left in memory forever.

The "left in memory forever" doesn't really do me any good though since I process anywhere from one 
or two orders a month to one or two a week, usually one a week or less.  Given that these servers 
use considerable juice, I also tend to turn them off unless they are in use.

That said, loading the array one time would be pretty fast anyway I suspect, way faster than the 
whole process of defining indexes if there isn't one for the required fields, creating the indexes, 
and then processing that second query.

I wouldn't say this isn't doable.  I would say it is not a trivial task / development effort.  If I 
processed a dozen orders a day it would be a no brainer to investigate doing this.  For a once a 
week process it becomes less so.

I like the sounds of it.  No indexes for that entire section of the table.  Multi-second response 
for that section of the queries.  The benefits would indeed be awesome, but it would take a long 
time to pay back the development cost.

John W. Colby
www.ColbyConsulting.com


Shamil Salakhetdinov wrote:
> Stuart,
> 
> I did mean that once loaded bit array will be kept "warm" in memory (John
> has enough memory for that).
> 
> It's unclear what are the main type of queries of JC's customer (John can
> you clarify this issue?) 
> 
> - if that queries are to select a few thousand individual records as you
> suggest(?) and all the queries include at least one "Y/N" field AND-end with
> the other part of WHERE filtering criteria then "brute force bits crunching"
> will outperform MS SQL, which (MS SQL) will have to scan the whole 50
> million records, and "brute force bits crunching" will return a few thousand
> individual records IDs in ~3 seconds(?), and those IDs can then be used to
> select the records from the "DB from Hell" and apply the rest of the WHERE
> filtering criteria to select a subset;
> 
> - if that queries are to select a few millions records and calculate some
> stats on them and all that queries include at least one "Y/N" field AND-end
> with the other part of WHERE filtering criteria then "brute force bits
> crunching" will again outperform MS SQL for bit operation, as for the other
> part of WHERE clause - it could be possible to use LINQ for Objects here if
> all the compacted records can be loaded in memory on huge JC's computer -
> and if that will be possible then LINQ for Objects (of just custom
> programming of selection criteria logic evaluation + counting stats) should
> let to outperform MS SQL I expect...
> 
> Recap: it all depends on the type of the queries to run against "DB From
> Hell" - if 80-90% of them fit first or the second type of queries mentioned
> above then it does make sense for compacted records to use "brute force bits
> crunching" approach as it might result in considerable shortening of the
> time queries run.
> If it doesn't make any sense there will a query run within a minute or
> within one or a couple of hours then "brute force bits crunching" isn't
> needed there...
> 
> Or do you have your queries already running over "DB from Hell" within a
> minute/several minutes, John?
> 
> Thank you.
> 
> --
> Shamil




More information about the dba-VB mailing list