The NYC Hadoop meetup on April 21st was great. Many thanks as always to the Cloudera folks and The Winter Wyman Companies for the pizza. Also thanks to Hiveat55 for use of their office and the Datameer folks for a good time afterwards.
The first part of the meetup was presentation by Azza Abouzeid and Kamil Bajda-Pawlikowski (Yale University) on Hadoop DB (http://db.cs.yale.edu/hadoopdb/hadoopdb.html). Their vision is to take the best of both worlds from the Map Reduce bliss for lots of data that we get from Hadoop as well as the DBMS complex data analysis capabilities.
The basic idea behind HadoopDB is to give Hadoop access to multiple single-node DBMS servers (eg. PostgreSQL or MySQL) deployed across the cluster. HadoopDB pushes as much as possible data processing into the database engine by issuing SQL queries (usually most of the Map/Combine phase logic is expressible in SQL). This in turn results in creating a system that resembles a shared-nothing parallel database. Applying techniques taken from the database world leads to a performance boost, especially in more complex data analysis. At the same time, the fact that HadoopDB relies on MapReduce framework ensures scores on scalability and fault/heterogeneity tolerance similar to Hadoop.
They have a spent a lot of time thinking through, finding and resolving the tradeoffs that occur and continue to make progress on this end. They have had 2,200 downloads as of this posting and are actively looking for developers to contribute to their project. I think it is great to see a University involved at this level for Open Source in general and more specifically doing work related to Hadoop. The audience was very engaging and it made for a very lively discussion. Their paper tells all the gory details http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf.
The rest of the meetup was off the hook. Stefan Groschupf got off to a quick start throwing down some pretty serious street cred as a long-standing commit-er for Nutch, Hadoop, Katta, Bixo and more. He was very engaging with a good sort of anecdotes for the question that drives the Hadoop community “What do you want to-do with your data?”. It is always processing it or querying it and there is not one golden bullet solution. We were then demoed Datameer’s product (which is one of the best user interface concept solutions I have seen).
In short the Datameer Analytic Solution (DAS) is a spreadsheet user interface allowing users to take a sample of data and (with 15 existing data connections and over 120 functions) like any good spreadsheet pull the data into an aggregated format. Their product then turns that format pushing it down into Hadoop (like through Hive) which then goes into a map/reduce job in Hadoop.
So end to end you can have worthy analytic folks (spreadsheet types) do their job against limitless data. wicked.
From their website http://datameer.com
With DAS, business users no longer have to rely on intuition or a “best guess” based on what’s happened in the past to make business decisions. DAS makes data assets available as they are needed regardless of format or location so that users have the facts they need to reach the best possible conclusions.
DAS includes a familiar interactive spreadsheet that is easy to use, but also powerful so that business users don’t need to turn to developers for analytics. The spreadsheet is specifically designed for visualization of big data and includes more than 120 built-in functions for exploring and discovering complex relationships. In addition, because DAS is extensible, business analysts can use functions from third-party tools or they can write their own commands.
Drag & drop reporting allow users to quickly create their own personalized dashboard Users simply select the information they want to view and how to display it on the dashboard – tables, charts, or graphs.
The portfolio of analytical and reporting tools in organizations can be broad. Business users can easily share data in DAS with these tools to either extend their analysis or to give other users access.
After the quick demo Stefan walked us through a solution for using Hadoop to pull the “signal from the noise” in social data and used twitter as an example. He used a really interesting graph exploration tool (going to give it a try myself) http://gephi.org/. Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs. He then talked a bit about X-RIME http://xrime.sourceforge.net/ which is Hadoop based large-scale social network analysis (Open Source).