Jim Lawrence
accessd at shaw.ca
Thu Jun 11 12:51:57 CDT 2009
There is a bit of interesting software around: First; NUTCH http://lucene.apache.org/nutch/index.html which is an open-source web search piece of software. According to Lucene, it adds web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. Second; there is a Distribution of HADOOP now available at http://developer.yahoo.com. It is being toted as "... framework for running data-intensive applications on large clusters of commodity hardware." It is also supposed to be fully distributive..."It maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation." Around this product there is a whole group of spin-off applications. If even comes with a full set of tools for development purposes. This is a super high end OS application and according to product developers, it is capable of splitting a request across 10,000 processor cores; simultaneously. It is also supposed to be 16 times faster than the Google search engine. Third; there is a product called HIVE (http://hadoop.apache.org/hive). According to the web site: <quote> HIVE is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language. </quote> Microsoft has also introduced its own parallel or grid processing package called DRYAD http://research.microsoft.com/en-us/projects/Dryad. <quote> Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming. </quote> It the application has its own programming language called DryadLINQ similar to LINQ. Microsoft has changed much of it attitudes towards the OpenSource community at least on a usage bases. MS has reversed its decision to replace the BE of Hotmail (FreeBSD). BING (http://www.bing.com/) is using much of it code from a recent acquisition of Powerset (http://www.Powerset.com) of which the core is mostly OpenSource and this code relates back to being a original contributor of Hadoop's HBase project... small world. Like Google with its customized/optimized file system or GFS (http://labs.google.com/papers/gfs.html) and its search engine MapReduce (http://labs.google.com/papers/mapreduce.html and http://labs.google.com/papers/mapreduce-osdi04-slides/index.html) both are proprietary systems How performance matches with these applications is going to be interesting to watch... Though it is unlikely that I will have some direct needs to implement this type of software I would very much like to do some serious testing. 8-) Jim