Archive for the ‘Podcast’ Category

Ruby Streaming for Hadoop with Wukong a talk with Flip Kromer from Infochimps

Another great discussion on our PodcastClick here to listen.  For this episode our guest was Flip Kromer from Infochimps’s mission is to increase the world’s access to structured data.  They have been working since the start of 2008 to build the world’s most interesting data commons, and since the start of 2009 to build the world’s first data marketplace. Our founding team consists of two physicists (Flip Kromer and Dhruv Bansal) and one entrepreneur (Joseph Kelly).

We talked about Ruby streaming with Hadoop and why to use the open source project Wukong to simplify implementation of Hadoop using Ruby.  There are some great examples that are just awesome like the web log analysis that creates the paths (chain of pages) that users go through during their visited session.

It was interesting to learn some of the new implementations and projects that he has going on like using Cassandra to help with storing unique values for social network analysis.  This new project is called Cluster Chef  ClusterChef will help you create a scalable, efficient compute cluster in the cloud. It has recipes for Hadoop, Cassandra, NFS and more — use as many or as few as you like.

  • A small 1-5 node cluster for development or just to play around with Hadoop or Cassandra
  • A spot-priced, ebs-backed cluster for unattended computing at rock-bottom prices
  • A large 30+ machine cluster with multiple EBS volumes per node running Hadoop and Cassandra, with optional NFS for
  • With Chef, you declare a final state for each node, not a procedure to follow. Adminstration is more efficient, robust and maintainable.
  • You get a nice central dashboard to manage clients
  • You can easily roll out configuration changes across all your machines
  • Chef is actively developed and has well-written recipes for webservers, databases, development tools, and a ton of different software packages.
  • Poolparty makes creating amazon cloud machines concise and easy: you can specify spot instances, ebs-backed volumes, disable-api-termination, and more.
  • Hadoop
  • NFS
  • Persistent HDFS on EBS volumes
  • Zookeeper (in progress)
  • Cassandra (in progress)

Another couple of good links we got from Flip were Peter Norvig’s “Unreasonable Effectiveness of Data” thing I mentioned:


Joe Stein