[dba-Tech] New interesting software

Thu Jun 11 12:51:57 CDT 2009

There is a bit of interesting software around:

First; NUTCH http://lucene.apache.org/nutch/index.html which is an
open-source web search piece of software. According to Lucene, it adds
web-specifics, such as a crawler, a link-graph database, parsers for HTML
and other document formats, etc.

Second; there is a Distribution of HADOOP now available at
http://developer.yahoo.com. It is being toted as "... framework for running
data-intensive applications on large clusters of commodity hardware." It is
also supposed to be fully distributive..."It maps data-crunching tasks
across distributed machines, splitting them into tiny sub-tasks, before
reducing the results into one master calculation." Around this product there
is a whole group of spin-off applications. If even comes with a full set of
tools for development purposes. This is a super high end OS application and
according to product developers, it is capable of splitting a request across
10,000 processor cores; simultaneously. It is also supposed to be 16 times
faster than the Google search engine.

Third; there is a product called HIVE (http://hadoop.apache.org/hive).
According to the web site:
<quote>
HIVE is a data warehouse infrastructure built on top of Hadoop that provides
tools to enable easy data summarization, adhoc querying and analysis of
large datasets data stored in Hadoop files. It provides a mechanism to put
structure on this data and it also provides a simple query language called
Hive QL which is based on SQL and which enables users familiar with SQL to
query this data. At the same time, this language also allows traditional
map/reduce programmers to be able to plug in their custom mappers and
reducers to do more sophisticated analysis which may not be supported by the
built-in capabilities of the language.
</quote>

Microsoft has also introduced its own parallel or grid processing package
called DRYAD http://research.microsoft.com/en-us/projects/Dryad. 

<quote>
Dryad is an infrastructure which allows a programmer to use the resources of
a computer cluster or a data center for running data-parallel programs. A
Dryad programmer can use thousands of machines, each of them with multiple
processors or cores, without knowing anything about concurrent programming.
</quote>

It the application has its own programming language called DryadLINQ similar
to LINQ.

Microsoft has changed much of it attitudes towards the OpenSource community
at least on a usage bases. MS has reversed its decision to replace the BE of
Hotmail (FreeBSD). BING (http://www.bing.com/) is using much of it code from
a recent acquisition of Powerset (http://www.Powerset.com) of which the core
is mostly OpenSource and this code relates back to being a original
contributor of Hadoop's HBase project... small world.

Like Google with its customized/optimized file system or GFS
(http://labs.google.com/papers/gfs.html) and its search engine MapReduce
(http://labs.google.com/papers/mapreduce.html and
http://labs.google.com/papers/mapreduce-osdi04-slides/index.html) both are
proprietary systems

How performance matches with these applications is going to be interesting
to watch... Though it is unlikely that I will have some direct needs to
implement this type of software I would very much like to do some serious
testing. 8-)

Jim