Hadoop is a really important part of Yahoo’s infrastructure because processing and analyzing big data is increasingly important for their business. Hadoop allows Yahoo to connect their consumer products with their advertisers and users for a better user experience. They have been involved with Hadoop for many years now and have their own distribution. Yahoo also sponsors/hosts a user group meeting which has grown to hundreds of attendees every month.
We talked about what Pig is now, the future of Pig and other projects like Oozie http://github.com/tucu00/oozie1 which Yahoo uses (and is open source) for workflow of MapReduce & Pig script automation. We also talked about Zebra http://wiki.apache.org/pig/zebra, Owl http://wiki.apache.org/pig/owl, and Elephant Bird http://github.com/kevinweil/elephant-bird
Another great discussion on our Podcast. Click here to listen. For this episode our guest was Flip Kromer from Infochimps http://www.infochimps.org. Infochimps.org’s mission is to increase the world’s access to structured data. They have been working since the start of 2008 to build the world’s most interesting data commons, and since the start of 2009 to build the world’s first data marketplace. Our founding team consists of two physicists (Flip Kromer and Dhruv Bansal) and one entrepreneur (Joseph Kelly).
We talked about Ruby streaming with Hadoop and why to use the open source project Wukong to simplify implementation of Hadoop using Ruby. There are some great examples http://github.com/infochimps/wukong/tree/master/examples that are just awesome like the web log analysis that creates the paths (chain of pages) that users go through during their visited session.
It was interesting to learn some of the new implementations and projects that he has going on like using Cassandra to help with storing unique values for social network analysis. This new project is called Cluster Chef http://github.com/infochimps/cluster_chef. ClusterChef will help you create a scalable, efficient compute cluster in the cloud. It has recipes for Hadoop, Cassandra, NFS and more — use as many or as few as you like.
- A small 1-5 node cluster for development or just to play around with Hadoop or Cassandra
- A spot-priced, ebs-backed cluster for unattended computing at rock-bottom prices
- A large 30+ machine cluster with multiple EBS volumes per node running Hadoop and Cassandra, with optional NFS for
- With Chef, you declare a final state for each node, not a procedure to follow. Adminstration is more efficient, robust and maintainable.
- You get a nice central dashboard to manage clients
- You can easily roll out configuration changes across all your machines
- Chef is actively developed and has well-written recipes for webservers, databases, development tools, and a ton of different software packages.
- Poolparty makes creating amazon cloud machines concise and easy: you can specify spot instances, ebs-backed volumes, disable-api-termination, and more.
- Persistent HDFS on EBS volumes
- Zookeeper (in progress)
- Cassandra (in progress)
Today I spoke with Jonathan Ellis who is the Project Chair of the Apache Cassandra project http://cassandra.apache.org/ and co-founder of Riptano, the source for professional Cassandra support http://riptano.com. It was a great discussion about Hadoop, BigData, Cassandra and Open Source.
We talked about the recent Cassandra 0.6 NoSQL integration and support for Hadoop Map/Reduce against the data stored in Cassandra and some of what is coming up in the 0.7 release.
We touched on how Pig is currently supported and why the motivation for Hive integration may not have any support with Cassandra in the future.
We also got a bit into a discussion of HBase vs Cassandra and some of the benefits & drawbacks as they live in your ecosystem (e.g. HBase is to OLAP as Cassandra is to OLTP).
For those folks either just getting started or even already in the the daily trenches of M/R development every day Karmasphere has come about to help developers and technical professionals make Hadoop MapReduce easier http://www.karmasphere.com/. Karmasphere Studio is a desktop IDE for graphically prototyping MapReduce jobs and deploying, monitoring and debugging them on Hadoop clusters in private and public clouds.
* Runs on Linux, Apple Mac OS and Windows.
* Works with all major distributions and versions of Hadoop including Apache, Yahoo! and Cloudera.
* Works with Amazon Elastic MapReduce.
* Supports local, networked, HDFS and Amazon S3 file systems.
* Support for Cascading
* Enables job submission from all major platforms including Windows.
* Operates with clusters and file systems behind firewalls.
So, what can you do with it?
- Prototype on the desktop: Get going with MapReduce job development quickly. No need for a cluster since Hadoop emulation is included.
- Deploy to a private or cloud-based cluster: Whether you’re using a cluster in your own network or a cloud, deploy your job/s easily.
- Debug on the cluster: One of the most challenging areas in MapReduce programming is debugging your job on the cluster. Visual tools deliver real-time insight into your job, including support for viewing and charting Hadoop job and task counters.
- Graphically visualize and manipulate: Whether it’s clusters, file systems, job configuration, counters, log files or other debugging information, save time and get better insight by accessing it all in one place.
- Monitor and analyze your jobs in real-time: Get realtime, workflow view of inputs, outputs and intermediate results including map, partition, sort and reduce phases.
Whether you’re new to Hadoop and want to easily explore MapReduce programming or you like the sound of something that helps you prototype, deploy and manage in an integrated environment or you’re already using Hadoop but could use a lot more insight into your jobs running on a cluster, there’s something here for you.
All you need is NetBeans (version 6.7 or 6.8) and Java 1.6 and you’ll be ready to give Karmasphere Studio a whirl.
You do NOT need any kind of Hadoop cluster set up to begin prototyping. But when you are ready to deploy your job on a large data set, you’ll need a virtual or real cluster in your data center or a public cloud such as Amazon Web Services.
An Eclipse version is in progress.