Archive

Archive for the ‘Open Source Projects’ Category

Cloudera, Yahoo and the Apache Hadoop Community Security Branch Release Update

May 5, 2011 1 comment

In the wake of Yahoo! having announced that they would discontinue their Hadoop distribution and focus their efforts into Apache Hadoop http://yhoo.it/i9Ww8W the landscape has become tumultuous.

Yahoo! engineers have spent their time and effort contributing back to the Apache Hadoop security branch (branch of 0.20) and have proposed release candidates.

Currently being voted and discussed is “Release candidate 0.20.203.0-rc1″. If you are following the VOTE and the DISCUSSION then maybe you are like me it just cannot be done without a bowl of popcorn before opening the emails. It is getting heated in a good and constructive kind of way. http://mail-archives.apache.org/mod_mbox/hadoop-general/201105.mbox/thread there are already more emails in 5 days of May than there were in all of April. woot!

My take? Has it become Cloudera vs Yahoo! and Apache Hadoop releases will become fragmented because of it? Well, it is kind of like that already. 0.21 is the latest and can anyone that is not a committer quickly know or find out the difference between that and the other release branches? It is esoteric :( 0.22 is right around the corner too which is a release from trunk.

Lets take HBase as an example (a Hadoop project). Do you know what version of HDFS releases can support HBase in production without losing data? If you do then maybe you don’t realize that many people still don’t even know about the branch. And, now that CDH3 is out you can use that (thanks Cloudera!) otherwise it is highly recommended to not be in production with HBase unless you use the append branch http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/ of 0.20 which makes you miss out on other changes in trunk releases…

__ eyes crossing inwards and sideways with what branch does what and when the trunk release has everything __

Hadoop is becoming an a la cart which features and fixes can I live without for all of what I really need to deploy … or requiring companies to hire a committer … or a bunch of folks that do nothing but Hadoop day in and day out (sounds like Oracle, ahhhhhh)… or going with the Cloudera Distribution (which is what I do and don’t look back). The barrier to entry feels like it has increased over the last year. However, stepping back from that the system overall has had a lot of improvements! A lot of great work by a lot of dedicated folks putting in their time and effort towards making Hadoop (in whatever form the elephant stampedes through its data) a reality.

Big shops that have teams of “Hadoop Engineers” (Yahoo, Facebook, eBay, LinkedIn, etc) with contributors and/or committers on that team should not have lots of impact because ultimately they are able to role their own releases for whatever they need/want themselves in production and just support it. Not all are so endowed.

Now, all of that having been said I write this because the discussion is REALLY good and has a lot of folks (including those from Yahoo! and Cloudera) bringing up pain points and proposing some great solutions that hopefully will contribute to the continued growth and success of the Apache Hadoop Community http://hadoop.apache.org/…. still if you want to run it in your company (and don’t have a committer on staff) then go download CDH3 http://www.cloudera.com it will get you going with the latest and greatest of all the releases, branches, etc, etc, etc. Great documentation too!

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

NoSQL HBase and Hadoop with Todd Lipcon from Cloudera

September 6, 2010 2 comments

Episode #6 of the Podcast is a talk with Todd Lipcon from Cloudera discussing HBase.

We talked about NoSQL and how it should stand for “Not Only SQL” and the tight integration between Hadoop and HBase and how systems like Cassandra (which is eventually consistent and not strongly consistent like HBase) is complementary as these systems have applicability within big data eco system depending on your use cases.

With the strong consistency of HBase you get features like incrementing counters and the tight integration with Hadoop means faster loads with HDFS thanks to a new feature in the 0.89 development preview release in the doc folders called “bulk loads”.

We covered a lot more unique features, talked about more of what is coming in upcoming releases as well as some tips with HBase so subscribe to the podcast and listen to all of what Todd had to say.

/*
Joe Stein
http://www.medialets.com
*/

Pre-Release from Pentaho – HIVE JDBC Adapter

August 15, 2010 Leave a comment

Pentaho’s Jordan Ganoff, Software Engineer, has open sourced some HIVE JDBC Adapters in what they are doing for their BI server

http://forums.pentaho.com/showthread.php?77826-Hive-amp-Hadoop

Not sure what state they are in but will try to check it on this week.

To use from maven:
<dependency>
<groupId>org.apache.hadoop.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>0.5.0-pentaho-SNAPSHOT</version>
</dependency>

You must also add the repository information to either the pom.xml or
your local settings:
<repository>
<id>pentaho</id>
<name>Pentaho External Repository</name>
<url>http://repo.pentaho.org/artifactory/repo</url>
</repository>

/*

Joe Stein
http://medialets.com

*/

Categories: Hive, Open Source Projects

Hadoop and Pig with Alan Gates from Yahoo

Episode 4 of our Podcast is with Alan Gates, Senior Software Engineer @ Yahoo! and Pig committer. Click here to listen.

Hadoop is a really important part of Yahoo’s infrastructure because processing and analyzing big data is increasingly important for their business. Hadoop allows Yahoo to connect their consumer products with their advertisers and users for a better user experience. They have been involved with Hadoop for many years now and have their own distribution. Yahoo also sponsors/hosts a user group meeting which has grown to hundreds of attendees every month.

We talked about what Pig is now, the future of Pig and other projects like Oozie http://github.com/tucu00/oozie1 which Yahoo uses (and is open source) for workflow of MapReduce & Pig script automation. We also talked about Zebra http://wiki.apache.org/pig/zebra, Owl http://wiki.apache.org/pig/owl, and Elephant Bird http://github.com/kevinweil/elephant-bird

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Ruby Streaming for Hadoop with Wukong a talk with Flip Kromer from Infochimps

Another great discussion on our PodcastClick here to listen.  For this episode our guest was Flip Kromer from Infochimps http://www.infochimps.org.  Infochimps.org’s mission is to increase the world’s access to structured data.  They have been working since the start of 2008 to build the world’s most interesting data commons, and since the start of 2009 to build the world’s first data marketplace. Our founding team consists of two physicists (Flip Kromer and Dhruv Bansal) and one entrepreneur (Joseph Kelly).

We talked about Ruby streaming with Hadoop and why to use the open source project Wukong to simplify implementation of Hadoop using Ruby.  There are some great examples http://github.com/infochimps/wukong/tree/master/examples that are just awesome like the web log analysis that creates the paths (chain of pages) that users go through during their visited session.

It was interesting to learn some of the new implementations and projects that he has going on like using Cassandra to help with storing unique values for social network analysis.  This new project is called Cluster Chef http://github.com/infochimps/cluster_chef.  ClusterChef will help you create a scalable, efficient compute cluster in the cloud. It has recipes for Hadoop, Cassandra, NFS and more — use as many or as few as you like.

  • A small 1-5 node cluster for development or just to play around with Hadoop or Cassandra
  • A spot-priced, ebs-backed cluster for unattended computing at rock-bottom prices
  • A large 30+ machine cluster with multiple EBS volumes per node running Hadoop and Cassandra, with optional NFS for
  • With Chef, you declare a final state for each node, not a procedure to follow. Adminstration is more efficient, robust and maintainable.
  • You get a nice central dashboard to manage clients
  • You can easily roll out configuration changes across all your machines
  • Chef is actively developed and has well-written recipes for webservers, databases, development tools, and a ton of different software packages.
  • Poolparty makes creating amazon cloud machines concise and easy: you can specify spot instances, ebs-backed volumes, disable-api-termination, and more.
  • Hadoop
  • NFS
  • Persistent HDFS on EBS volumes
  • Zookeeper (in progress)
  • Cassandra (in progress)

Another couple of good links we got from Flip were Peter Norvig’s “Unreasonable Effectiveness of Data” thing I mentioned: http://bit.ly/effectofdatabit.ly/norvigtalk

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Hadoop, BigData and Cassandra with Jonathan Ellis

Today I spoke with Jonathan Ellis who is the Project Chair of the Apache Cassandra project http://cassandra.apache.org/ and co-founder of Riptano, the source for professional Cassandra support http://riptano.com.  It was a great discussion about Hadoop, BigData, Cassandra and Open Source.

We talked about the recent Cassandra 0.6 NoSQL integration and support for Hadoop Map/Reduce against the data stored in Cassandra and some of what is coming up in the 0.7 release.

We touched on how Pig is currently supported and why the motivation for Hive integration may not have any support with Cassandra in the future.

We also got a bit into a discussion of HBase vs Cassandra and some of the benefits & drawbacks as they live in your ecosystem (e.g. HBase is to OLAP as Cassandra is to OLTP).

This was the second Podcast and you can click here to listen.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc/
*/

Hadoop NYC Meetup With Yale University and Datameer

April 22, 2010 Leave a comment

The NYC Hadoop meetup on April 21st was great.  Many thanks as always to the Cloudera folks and The Winter Wyman Companies for the pizza.  Also thanks to Hiveat55 for use of their office and the Datameer folks for a good time afterwards.

The first part of the meetup was presentation by Azza Abouzeid and Kamil Bajda-Pawlikowski (Yale University) on Hadoop DB (http://db.cs.yale.edu/hadoopdb/hadoopdb.html).  Their vision is to take the best of both worlds from the Map Reduce bliss for lots of data that we get from Hadoop as well as the DBMS complex data analysis capabilities.

The basic idea behind HadoopDB is to give Hadoop access to multiple single-node DBMS servers (eg. PostgreSQL or MySQL) deployed across the cluster. HadoopDB pushes as much as possible data processing into the database engine by issuing SQL queries (usually most of the Map/Combine phase logic is expressible in SQL). This in turn results in creating a system that resembles a shared-nothing parallel database. Applying techniques taken from the database world leads to a performance boost, especially in more complex data analysis. At the same time, the fact that HadoopDB relies on MapReduce framework ensures scores on scalability and fault/heterogeneity tolerance similar to Hadoop.

They have a spent a lot of time thinking through, finding and resolving the tradeoffs that occur and continue to make progress on this end.  They have had 2,200 downloads as of this posting and are actively looking for developers to contribute to their project.   I think it is great to see a University involved at this level for Open Source in general and more specifically doing work related to Hadoop.  The audience was very engaging and it made for a very lively discussion.  Their paper tells all the gory details http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf.

The rest of the meetup was off the hook.  Stefan Groschupf got off to a quick start throwing down some pretty serious street cred as a long-standing commit-er for Nutch, Hadoop, Katta, Bixo and more.    He was very engaging with a good sort of anecdotes for the question that drives the Hadoop community “What do you want to-do with your data?”.  It is always processing it or querying it and there is not one golden bullet solution.  We were then demoed Datameer’s product (which is one of the best user interface concept solutions I have seen).

In short the Datameer Analytic Solution (DAS) is a spreadsheet user interface allowing users to take a sample of data and (with 15 existing data connections and over 120 functions) like any good spreadsheet pull the data into an aggregated format.  Their product then turns that format pushing it down into Hadoop (like through Hive) which then goes into a map/reduce job in Hadoop.

So end to end you can have worthy analytic folks (spreadsheet types) do their job against limitless data.  wicked.

From their website http://datameer.com

With DAS, business users no longer have to rely on intuition or a “best guess” based on what’s happened in the past to make business decisions. DAS makes data assets available as they are needed regardless of format or location so that users have the facts they need to reach the best possible conclusions.

DAS includes a familiar interactive spreadsheet that is easy to use, but also powerful so that business users don’t need to turn to developers for analytics. The spreadsheet is specifically designed for visualization of big data and includes more than 120 built-in functions for exploring and discovering complex relationships. In addition, because DAS is extensible, business analysts can use functions from third-party tools or they can write their own commands.

Drag & drop reporting allow users to quickly create their own personalized dashboard    Users simply select the information they want to view and how to display it on the dashboard – tables, charts, or graphs.

The portfolio of analytical and reporting tools in organizations can be broad. Business users can easily share data in DAS with these tools to either extend their analysis or to give other users access.

After the quick demo Stefan walked us through a solution for using Hadoop to pull the “signal from the noise” in social data and used twitter as an example. He used a really interesting graph exploration tool (going to give it a try myself) http://gephi.org/.  Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.  He then talked a bit about X-RIME http://xrime.sourceforge.net/ which is Hadoop based large-scale social network analysis (Open Source).

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Categories: Meetups, Open Source Projects Tags:

Nifty Tool To Export Files From HDFS Into MySQL

April 19, 2010 2 comments

This is an interesting open source project I have recently heard about http://code.google.com/p/hiho/.

What is very interesting to me about this project is the export utility which takes data from HDFS and loads it into MySQL.

It also has a nice way for querying and importing data from a JDBC database directly into HDFS.  It looks much more robust than the out of the box DBInputFormat that Hadoop provides.  You can import the data as delimited records, with choice of delimiter. You can also import the data and save them as Avro records. It supports queries – you can say join two tables. It splits on user specified column ranges, instead of using LIMIT and OFFSET. It does no code generation or ORM mapping.

There are other ETL tools out there (e.g. Sqoop http://www.cloudera.com/developers/downloads/sqoop/).  In Cloudera’s Distrobution for Hadoop Version 3 (CDH3) Sqoop supports HDFS back into MySQL also now.

I am definitely going to have to give this utility a try. I here from their (HIHO) project folks that the next to-do is support for more databases for export.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc/
*/



Follow

Get every new post delivered to your Inbox.