Home > Cascading, Open Source Projects, Podcast, Tools > Hadoop, Mesos, Cascading, Scalding, Cascalog and Data Science with Paco Nathan

Hadoop, Mesos, Cascading, Scalding, Cascalog and Data Science with Paco Nathan

Episode #9 of the podcast is a talk with Paco Nathon.  Available also on iTunes

We talked about how he got started with Hadoop with Natural Language Processing back in 2007 with text analytics.

And then starting talking about Mesos http://mesos.apache.org/

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.

We talked a little about the difference between YARN and Mesos.  Paco talked about how Mesos is lower in the stack and part of the operating system where YARN is higher up in the stack and built to support the Hadoop ecosystem in the JVM.  He talked about the future of Mesos and touched on its contrast to Google Borg … for some more information on Google Borg and Mesos here is a great article http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/all/

Then we got into Cascading which was started by Chris Wensel – http://www.cascading.org/ and talked about the enterprise use cases for Cascading.  He talked about how Cascading has always been geared to satisfy enterprise use cases and not slice and dice but build an application on top of it and be able to debug it to see where it is running because it is deterministic. He talked about how this contrasts to Hive and Pig. He brought up Steve Yegeg’s post “Notes from the Mystery Machine Bus” https://plus.google.com/110981030061712822816/posts/KaSKeg4vQtz and talked a bit how Cascading applied.

We got into design patterns for the enterprise with big batch workflow breaking it up into five parts:

1) Different data sources (structured and unstructured data)
2) ETL
3) Custom data preparation and business logic to clean up the data
4) Analytics or predictive modeling to enrich the data
5) integration with end use cases that consume the data products

Cascading addresses all of these points and Paco talked in more detail about them.

We finished up the podcast with him talking about the future of these technologies and also data science.

Subscribe to the podcast and listen to what Paco had to say.  Available also on iTunes

/*
Joe Stein
Big Data Open Source Security LLC
http://www.stealth.ly
*/

About these ads
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 50 other followers

%d bloggers like this: