[AccessD] ElasticSearch my hind leg...

John W Colby jwcolby at gmail.com
Mon Mar 10 19:25:57 CDT 2014


 >>Note: This is not Cloud computing specifically...it can be done on any server(s) you have in the 
office.

I simply went to the ElasticSearch site and clicked every single menu item, poking around. The site 
is not particularly well designed, mostly marketing fluff.  It is not cloud computing specifically, 
it is distributed computing.

The data I deal with is NOT web page kind of stuff.  It is not Google / Facebook / bing / search 
kind of stuff.  It is pretty much straight up regular old data, properties about people used to 
select interests and therefore marketing.  The vast majority of the data is single character fields.

Has cat = 'Y'/'N'
Has Dog = 'Y'/'N'
Gender = 'M'/'F'/'U'
Birthdate = some date
Income = some long int
ReligiousPreference = some single character - 'J' (Jewish)

And of course
FName Varchar(25)
LName Varchar(25)
Addr Varchar(50)
City Varchar(50)
St (char(2)
Zip5 Char(5)
Zip4 Char(4)

Say what you will but the way I see this, this is straight up SQL stuff.

Count (*) Where age (which is a dynamic calculation based on current date and the birthdate) > 30 
and <=45 AND Gender = 'F' and HasDog = 'Y' and ReligiousPreference = 'J'

Or alternately "Give me the names / addresses for the people with these properties in a specific zip 
list"

These are precisely the kinds of queries I am asked to perform day in and day out.

I have a third party Zip code list which has attributes of "Number of households in zip" and 
"Population in zip", as well as geo location (lat/lon).  I am handed a list of 15 thousand zips and 
asked how many households and population in these zips.  Or I am handed a single zip and asked for 
the population in a five mile radius of that one zip.

These are very straight up SQL kind of questions.

I know that you have a love of and hankering for these distributed processing systems but my data is 
very standard (and old fashioned) marketing stuff.  SQL was designed precisely in order to 
manipulate these kinds of questions, and that is what I use.  I am not searching a million 
paragraphs of input to blog pages looking for the word "KISS" or anything remotely like that.

My data is huge, but only because the tables have anywhere from 5 million to 225 million records and 
anywhere from 20 fields to 640 fields.  But the fields are mostly just Y/N or single character 
values 0-9/A-T (codes).

Whatever database engine I use has to have a SQL kind of language behind it.  I have to be able to 
pull name / address records based on these kinds of questions.  SQL is NOT going away.  I have read 
a ton of the interviews with execs from these data engines optimized for text search and to a man 
NONE of these folks say their systems work well for SQL type applications.  They are not designed to 
do so.  They do completely different kinds of things, and the things they do do not map to SQL apps, 
or what I am doing.

SQL Server in fact does a great job of breaking down the queries into tasks which it feeds off to 
the cores.  That is precisely why I went with a high core count machine.  SQL Server very often pegs 
every core of my machine for minutes at a time as it performs the task of selecting records based on 
huge criteria lists.  This is quite fascinating to watch as well.

The biggest problem I face really is that it is simply impossible to index 640 fields in a 
meaningful way such that a question against any of these 640 fields always hits an index.  What I do 
is index the most used fields, the zip, income, gender, religion etc, and then put up with the table 
scan of most of the other fields.  I do occasionally create a cover index for a particularly large 
set of fields in very specific cases, but mostly not.  And some of the smaller databases (smaller 
number of fields total) I may very well index more fields.

John W. Colby

Reality is what refuses to go away
when you do not believe in it

On 3/10/2014 7:20 PM, Jim Lawrence wrote:
> Hi John:
>
> Note: This is not Cloud computing specifically...it can be done on any server(s) you have in the office.
>


---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com



More information about the AccessD mailing list