Hadoop, Mesos, Cascading, Scalding, Cascalog and Data Science with Paco Nathan
Episode #9 of the podcast is a talk with Paco Nathon. Available also on iTunes
We talked about how he got started with Hadoop with Natural Language Processing back in 2007 with text analytics.
And then starting talking about Mesos http://mesos.apache.org/
Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.
We talked a little about the difference between YARN and Mesos. Paco talked about how Mesos is lower in the stack and part of the operating system where YARN is higher up in the stack and built to support the Hadoop ecosystem in the JVM. He talked about the future of Mesos and touched on its contrast to Google Borg … for some more information on Google Borg and Mesos here is a great article http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/all/
Then we got into Cascading which was started by Chris Wensel – http://www.cascading.org/ and talked about the enterprise use cases for Cascading. He talked about how Cascading has always been geared to satisfy enterprise use cases and not slice and dice but build an application on top of it and be able to debug it to see where it is running because it is deterministic. He talked about how this contrasts to Hive and Pig. He brought up Steve Yegeg’s post “Notes from the Mystery Machine Bus” https://plus.google.com/110981030061712822816/posts/KaSKeg4vQtz and talked a bit how Cascading applied.
We got into design patterns for the enterprise with big batch workflow breaking it up into five parts:
1) Different data sources (structured and unstructured data)
2) ETL
3) Custom data preparation and business logic to clean up the data
4) Analytics or predictive modeling to enrich the data
5) integration with end use cases that consume the data products
Cascading addresses all of these points and Paco talked in more detail about them.
We finished up the podcast with him talking about the future of these technologies and also data science.
Subscribe to the podcast and listen to what Paco had to say. Available also on iTunes
/*
Joe Stein
Big Data Open Source Security LLC
http://www.stealth.ly
*/