Archive for the ‘Open Source Projects’ Category

Apache BigTop and how packaging infrastructure binds the Hadoop ecosystem together

August 12, 2013 Leave a comment

Episode #12 of the podcast is a talk with Mark Grover and Roman Shaposhnik  Available also on iTunes

Apache Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

BigTop makes it easier to deploy Hadoop Ecosystem projects including:

  • Apache Zookeeper

  • Apache Flume

  • Apache HBase

  • Apache Pig

  • Apache Hive

  • Apache Sqoop

  • Apache Oozie

  • Apache Whirr

  • Apache Mahout

  • Apache Solr (SolrCloud)

  • Apache Crunch (incubating)

  • Apache HCatalog

  • Apache Giraph

  • LinkedIn DataFu

  • Cloudera Hue

The list of supported Linux platforms has expanded to include:

  • CentOS/RHEL 5 and 6

  • Fedora 17 and 18

  • SuSE Linux Enterprise 11

  • OpenSUSE 12.2

  • Ubuntu LTS Lucid (10.04) and Precise (12.04)

  • Ubuntu Quantal (12.10)

Subscribe to the podcast and listen to what Mark and Roman had to say.  Available also on iTunes

Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
Twitter: @allthingshadoop

Hadoop, The Cloudera Development Kit, Parquet, Apache BigTop and more with Tom White

August 2, 2013 Leave a comment

Episode #10 of the podcast is a talk with Tom White.  Available also on iTunes

We talked a lot about The Cloudera Development Kit, or CDK for short, which is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

The goals of the CDK are:

  • Codify expert patterns and practices for building data-oriented systems and applications.
  • Let developers focus on business logic, not plumbing or infrastructure.
  • Provide smart defaults for platform choices.
  • Support piecemeal adoption via loosely-coupled modules.

Eric Sammer recorded a webinar in which he talks about the goals of the CDK.

This project is organized into modules. Modules may be independent or have dependencies on other modules within the CDK. When possible, dependencies on external projects are minimized.

We also talked about Parquet which was created  to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.  Parquet is built from the ground up with complex nested data structures in mind, and uses the repetition/definition level approach to encoding such data structures, as popularized by Google Dremel. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Tom talked about Apache BigTop too Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.  The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Subscribe to the podcast and listen to what Tom had to say.  Available also on iTunes

Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
Twitter: @allthingshadoop

Hadoop, Mesos, Cascading, Scalding, Cascalog and Data Science with Paco Nathan

July 30, 2013 Leave a comment

Episode #9 of the podcast is a talk with Paco Nathon.  Available also on iTunes

We talked about how he got started with Hadoop with Natural Language Processing back in 2007 with text analytics.

And then starting talking about Mesos

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.

We talked a little about the difference between YARN and Mesos.  Paco talked about how Mesos is lower in the stack and part of the operating system where YARN is higher up in the stack and built to support the Hadoop ecosystem in the JVM.  He talked about the future of Mesos and touched on its contrast to Google Borg … for some more information on Google Borg and Mesos here is a great article

Then we got into Cascading which was started by Chris Wensel – and talked about the enterprise use cases for Cascading.  He talked about how Cascading has always been geared to satisfy enterprise use cases and not slice and dice but build an application on top of it and be able to debug it to see where it is running because it is deterministic. He talked about how this contrasts to Hive and Pig. He brought up Steve Yegeg’s post “Notes from the Mystery Machine Bus” and talked a bit how Cascading applied.

We got into design patterns for the enterprise with big batch workflow breaking it up into five parts:

1) Different data sources (structured and unstructured data)
2) ETL
3) Custom data preparation and business logic to clean up the data
4) Analytics or predictive modeling to enrich the data
5) integration with end use cases that consume the data products

Cascading addresses all of these points and Paco talked in more detail about them.

We finished up the podcast with him talking about the future of these technologies and also data science.

Subscribe to the podcast and listen to what Paco had to say.  Available also on iTunes

Joe Stein
Big Data Open Source Security LLC

Distributed System Development Considerations

June 16, 2013 Leave a comment

There are a number of factors to take into account while developing distributed software systems. If you don’t even know what I am talking about in the first sentence then let me give you some insight, examples and for instances of what distributed systems are.


A distributed system is when multiple physical hardware devices interact with separate and discrete users and collaborate together through these hardware devices to accomplishes different and also similar goals for these discrete and seperate users.  Sometimes these devices work in a pier-to-pier mode using servers as hubs for knowledge of connectivity to each other while others collaborate and coordinate through a single or set of centralized servers e.g.

An extranet like system where supply side users are reviewing documents through a web browser and click through a workflow in their browsers that ultimatly fax documents to another set of users that provide services that are sufficiently labeled… i.e. with a bar code.  This loop makes for a closed system to design and develop distributed software for.  The recipient of the faxes fill out and do whatever they need as part of their paper based world (including receiving many faxes over time and sending them all back in at once, which is another set of cases for another blog post on e-signing).  The receiving fax servers on receipt of the inbound original documents “hands” the faxed images received over for collation and processing for yet another set of users (or even the same ones that originated the outbound fax) to review and work with them under an intake workflow again in their browser.

Here you have many systems working together in the data center to accomplish different goals. You have some web servers for the user interface. Some data persistance layer for the workflow systems, analytics and reporting.  You have fax servers for management of the incoming/outgoing fax transmissions (I like Dialogic Brooktrout Fax Boards myself ).  You have the inbound and outbound labeling, stamping, collation, scanning and object character recognition components (usually a few different servers together depending on design likely not each a separate instance but likely a half dozen or so depending on scale).

The fabric underneath all of those components are what is consistent across all of them together.   This starts at the OSI Layer because first and for most distributed systems are connected through physical layers of constraints so lets start there.

The Physical Layer Constraints

The OSI Layer has either 7 or 9 layers depending on whom you speak with.  In some teams layers 8 and 9 are politics and religion (respectively) because social structures, methodologies and human behavior around all interactions to design, develop, deploy and maintain these systems have to be taken into account.  If you don’t appreciate this then your application layer user experience is likely going to be wrong also.  Layers 1-7 (the more commonly known and often better understood layers) classify how computers move data through an interface for a user or computer and an interface to another computer for another computer or another user.  You can never do anything outside of the underlying layers without refactoring or creating new software in those layers.  So, its good to know how they work. You can do that some in your own software program writing works (please, please, pretty please) to better balance out processing constraints.

Figure 30-1 of maps many of the protocols of the Internet protocol suite and their corresponding OSI layers.

Where software continues within the hardware device

The operating system and its implementation for the local machine hardware is next and foremost.  I am hard pressed to pick many or any other layers or systems that are truelly part of the underly fabric depending on your hosting provider which ultimately is a vendor so thats a different story for another time too… unless they run software your wrote also via open source project work you have done but, I digress… The Linux Programming Interface: A Linux and UNIX System Programming Handbook is the resource I like for Linux.

Now What?

Over the last nine years a lot of classic computer concepts have been more readily achievable because of the inexpensive (sometimes cheep) commodity hardware that has become our foreshadowed reality.  This has made parallel and distributed computing platforms, distributed document and key/value stores, distributed pub/sub broker, publisher and consumer systems and even actor patterns with immutable structures start to become part of this fabric.  The rub here is the competition and infancy of these market segments arguably yet to even have “crossed the chasm“.

Now, having said all of that … There is a stand out. “ZooKeeper provides key points of correctness and coordination that must have higher transactional guarantees than all of these types of system that want to build intrinsically into their own key logic”.  Camille Fournier made me see this in one of her blog posts because of how Zookeeper is tied into so many of these existing types of systems.  Since these systems have yet to mature in the marketplace (while having sufficiently matured with early adopters, by far) there are some consistencies within and across them we have to start to take a look at, support and commit to (literally).

What is also nice now about Zookeeper is that there is an open source library that has been developed and (all things being equal) continuously maintained to make developing the repeatable patterns of Zookeeper development easier  If you have never worked with Zookeeper I recommend starting here


Joe Stein


Big Data Open Source Security

May 25, 2013 1 comment

In security there has never (IMHO) been enough open source solutions and Bruce Schneier has written about this several times in the past, and there’s no need to rewrite the arguments again.

Now with “NoSQL” and “Big Data” Open Source trends in the market place Security finally has an intersection… a union if I may where new solutions to solve problems that have plagued our society can finally begin to arrise (and have already in many cases). Fraud, Malware, Phishing, Spam, etc all can be tackled now with new Security solutions because of Big Data and Open Source.

At the front lines of this is Apache Accumulo which is a Big Data, Open Source and Secure NoSQL Database that runs on top of Apache Hadoop. It was originally developed by the United States National Security Agency and submitted to the Apache Foundation as Open Source in 2011 with 3 years of development and production operation already having occurred.

Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.


When mutations are applied, users can specify a security label for each value. This is done as the Mutation is created by passing a ColumnVisibility object to the put() method:

Text rowID = new Text("row1");
Text colFam = new Text("myColFam");
Text colQual = new Text("myColQual");
ColumnVisibility colVis = new ColumnVisibility("public");
long timestamp = System.currentTimeMillis();

Value value = new Value("myValue");

Mutation mutation = new Mutation(rowID);
mutation.put(colFam, colQual, colVis, timestamp, value);


Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together.

For example, suppose within our organization we want to label our data values with security labels defined in terms of user roles. We might have tokens such as:

These can be specified alone or combined using logical operators:

// Users must have admin privileges:

// Users must have admin and audit privileges

// Users with either admin or audit privileges

// Users must have audit and one or both of admin or system

When both | and & operators are used, parentheses must be used to specify precedence of the operators.


When clients attempt to read data from Accumulo, any security labels present are examined against the set of authorizations passed by the client code when the Scanner or BatchScanner are created. If the authorizations are determined to be insufficient to satisfy the security label, the value is suppressed from the set of results sent back to the client.

Authorizations are specified as a comma-separated list of tokens the user possesses:

// user possess both admin and system level access
Authorization auths = new Authorization("admin","system");

Scanner s = connector.createScanner("table", auths);


Each accumulo user has a set of associated security labels. To manipulate these in the shell use the setuaths and getauths commands. These may also be modified using the java security operations API.

When a user creates a scanner a set of Authorizations is passed. If the authorizations passed to the scanner are not a subset of the users authorizations, then an exception will be thrown.

To prevent users from writing data they can not read, add the visibility constraint to a table. Use the -evc option in the createtable shell command to enable this constraint. For existing tables use the following shell command to enable the visibility constraint. Ensure the constraint number does not conflict with any existing constraints.

config -t table -s

Any user with the alter table permission can add or remove this constraint. This constraint is not applied to bulk imported data, if this a concern then disable the bulk import permission.


For applications serving many users, it is not expected that an accumulo user will be created for each application user. In this case an accumulo user with all authorizations needed by any of the applications users must be created. To service queries, the application should create a scanner with the application users authorizations. These authorizations could be obtained from a trusted 3rd party.

Often production systems will integrate with Public-Key Infrastructure (PKI) and designate client code within the query layer to negotiate with PKI servers in order to authenticate users and retrieve their authorization tokens (credentials). This requires users to specify only the information necessary to authenticate themselves to the system. Once user identity is established, their credentials can be accessed by the client code and passed to Accumulo outside of the reach of the user.


Since the primary method of interaction with Accumulo is through the Java API, production environments often call for the implementation of a Query layer. This can be done using web services in containers such as Apache Tomcat, but is not a requirement. The Query Services Layer provides a mechanism for providing a platform on which user facing applications can be built. This allows the application designers to isolate potentially complex query logic, and enables a convenient point at which to perform essential security functions.

Several production environments choose to implement authentication at this layer, where users identifiers are used to retrieve their access credentials which are then cached within the query layer and presented to Accumulo through the Authorizations mechanism.

Typically, the query services layer sits between Accumulo and user workstations.

Apache Accumulo version 1.5 just came out for download with docs

New software as a service solutions will start to spring up into the market as will new out of the box open source solutions. Whether we are trying to prevent health care fraud, protect individuals from identify theft or corporations from intrusion all without comprimsing the (C)onfidentiality, (I)ntegrity and the (A)vailability of the data and distributes systems.

Joe Stein

Hortonworks HDP1, Apache Hadoop 2.0, NextGen MapReduce (YARN), HDFS Federation and the future of Hadoop with Arun C. Murthy

July 23, 2012 2 comments

Episode #8 of the Podcast is a talk with Arun C. Murthy.

We talked about Hortonworks HDP1, the first release from Hortonworks, Apache Hadoop 2.0, NextGen MapReduce (YARN) and HDFS Federations

subscribe to the podcast and listen to all of what Arun had to share.

Some background to what we discussed:

Hortonworks Data Platform (HDP)

from their website:

Hortonworks Data Platform (HDP) is a 100% open source data management platform based on Apache Hadoop. It allows you to load, store, process and manage data in virtually any format and at any scale. As the foundation for the next generation enterprise data architecture, HDP includes all of the necessary components to begin uncovering business insights from the quickly growing streams of data flowing into and throughout your business.

Hortonworks Data Platform is ideal for organizations that want to combine the power and cost-effectiveness of Apache Hadoop with the advanced services required for enterprise deployments. It is also ideal for solution providers that wish to integrate or extend their solutions with an open and extensible Apache Hadoop-based platform.

Key Features
  • Integrated and Tested Package – HDP includes stable versions of all the critical Apache Hadoop components in an integrated and tested package.
  • Easy Installation – HDP includes an installation and provisioning tool with a modern, intuitive user interface.
  • Management and Monitoring Services – HDP includes intuitive dashboards for monitoring your clusters and creating alerts.
  • Data Integration Services – HDP includes Talend Open Studio for Big Data, the leading open source integration tool for easily connecting Hadoop to hundreds of data systems without having to write code.
  • Metadata Services – HDP includes Apache HCatalog, which simplifies data sharing between Hadoop applications and between Hadoop and other data systems.
  • High Availability – HDP has been extended to seamlessly integrate with proven high availability solutions.

Apache Hadoop 2.0

from their website:

Apache Hadoop 2.x consists of significant improvements over the previous stable release (hadoop-1.x).

Here is a short overview of the improvments to both HDFS and MapReduce.

  • HDFS FederationIn order to scale the name service horizontally, federation uses multiple independent Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.More details are available in the HDFS Federation document.
  • MapReduce NextGen aka YARN aka MRv2The new architecture introduced in hadoop-0.23, divides the two major functions of the JobTracker: resource management and job life-cycle management into separate components.The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚Äôs scheduling and coordination.An application is either a single job in the sense of classic MapReduce jobs or a DAG of such jobs.The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.More details are available in the YARN document.
Getting Started

The Hadoop documentation includes the information you need to get started using Hadoop. Begin with the Single Node Setup which shows you how to set up a single-node Hadoop installation. Then move on to the Cluster Setup to learn how to set up a multi-node Hadoop installation.

Apache Hadoop NextGen MapReduce (YARN)

from their website:

MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.

The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

MapReduce NextGen Architecture

The ResourceManager has two main components: Scheduler and ApplicationsManager.

The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc. In the first version, only memory is supported.

The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.

The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources

The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

MRV2 maintains API compatibility with previous stable release (hadoop-0.20.205). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

HDFS Federation

from their website:


HDFS LayersHDFS has two main layers:

  • Namespace
    • Consists of directories, files and blocks
    • It supports all the namespace related file system operations such as create, delete, modify and list files and directories.
  • Block Storage Service has two parts
    • Block Management (which is done in Namenode)
      • Provides datanode cluster membership by handling registrations, and periodic heart beats.
      • Processes block reports and maintains location of blocks.
      • Supports block related operations such as create, delete, modify and get block location.
      • Manages replica placement and replication of a block for under replicated blocks and deletes blocks that are over replicated.
    • Storage – is provided by datanodes by storing blocks on the local file system and allows read/write access.

    The prior HDFS architecture allows only a single namespace for the entire cluster. A single Namenode manages this namespace. HDFS Federation addresses limitation of the prior architecture by adding support multiple Namenodes/namespaces to HDFS file system.

Multiple Namenodes/Namespaces

In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.

HDFS Federation ArchitectureBlock Pool

A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for all the block pools in the cluster. It is managed independently of other block pools. This allows a namespace to generate Block IDs for new blocks without the need for coordination with the other namespaces. The failure of a Namenode does not prevent the datanode from serving other Namenodes in the cluster.

A Namespace and its block pool together are called Namespace Volume. It is a self-contained unit of management. When a Namenode/namespace is deleted, the corresponding block pool at the datanodes is deleted. Each namespace volume is upgraded as a unit, during cluster upgrade.


A new identifier ClusterID is added to identify all the nodes in the cluster. When a Namenode is formatted, this identifier is provided or auto generated. This ID should be used for formatting the other Namenodes into the cluster.

Key Benefits

  • Namespace Scalability – HDFS cluster storage scales horizontally but the namespace does not. Large deployments or deployments using lot of small files benefit from scaling the namespace by adding more Namenodes to the cluster
  • Performance – File system operation throughput is limited by a single Namenode in the prior architecture. Adding more Namenodes to the cluster scales the file system read/write operations throughput.
  • Isolation – A single Namenode offers no isolation in multi user environment. An experimental application can overload the Namenode and slow down production critical applications. With multiple Namenodes, different categories of applications and users can be isolated to different namespaces.

subscribe to the podcast and listen to all of what Arun had to share.


Joe Stein

Unified analytics and large scale machine learning with Milind Bhandarkar

June 1, 2012 1 comment

Episode #7 of the Podcast is a talk with Milind Bhandarkar.

We talked about unified analytics, machine learning, data science, some great history of Hadoop, the future of Hadoop and a lot more!

subscribe to the podcast and listen to all of what Milind had to share.


Joe Stein

Cloudera, Yahoo and the Apache Hadoop Community Security Branch Release Update

May 5, 2011 1 comment

In the wake of Yahoo! having announced that they would discontinue their Hadoop distribution and focus their efforts into Apache Hadoop the landscape has become tumultuous.

Yahoo! engineers have spent their time and effort contributing back to the Apache Hadoop security branch (branch of 0.20) and have proposed release candidates.

Currently being voted and discussed is “Release candidate”. If you are following the VOTE and the DISCUSSION then maybe you are like me it just cannot be done without a bowl of popcorn before opening the emails. It is getting heated in a good and constructive kind of way. there are already more emails in 5 days of May than there were in all of April. woot!

My take? Has it become Cloudera vs Yahoo! and Apache Hadoop releases will become fragmented because of it? Well, it is kind of like that already. 0.21 is the latest and can anyone that is not a committer quickly know or find out the difference between that and the other release branches? It is esoteric 😦 0.22 is right around the corner too which is a release from trunk.

Lets take HBase as an example (a Hadoop project). Do you know what version of HDFS releases can support HBase in production without losing data? If you do then maybe you don’t realize that many people still don’t even know about the branch. And, now that CDH3 is out you can use that (thanks Cloudera!) otherwise it is highly recommended to not be in production with HBase unless you use the append branch of 0.20 which makes you miss out on other changes in trunk releases…

__ eyes crossing inwards and sideways with what branch does what and when the trunk release has everything __

Hadoop is becoming an a la cart which features and fixes can I live without for all of what I really need to deploy … or requiring companies to hire a committer … or a bunch of folks that do nothing but Hadoop day in and day out (sounds like Oracle, ahhhhhh)… or going with the Cloudera Distribution (which is what I do and don’t look back). The barrier to entry feels like it has increased over the last year. However, stepping back from that the system overall has had a lot of improvements! A lot of great work by a lot of dedicated folks putting in their time and effort towards making Hadoop (in whatever form the elephant stampedes through its data) a reality.

Big shops that have teams of “Hadoop Engineers” (Yahoo, Facebook, eBay, LinkedIn, etc) with contributors and/or committers on that team should not have lots of impact because ultimately they are able to role their own releases for whatever they need/want themselves in production and just support it. Not all are so endowed.

Now, all of that having been said I write this because the discussion is REALLY good and has a lot of folks (including those from Yahoo! and Cloudera) bringing up pain points and proposing some great solutions that hopefully will contribute to the continued growth and success of the Apache Hadoop Community…. still if you want to run it in your company (and don’t have a committer on staff) then go download CDH3 it will get you going with the latest and greatest of all the releases, branches, etc, etc, etc. Great documentation too!


Joe Stein

NoSQL HBase and Hadoop with Todd Lipcon from Cloudera

September 6, 2010 3 comments

Episode #6 of the Podcast is a talk with Todd Lipcon from Cloudera discussing HBase.

We talked about NoSQL and how it should stand for “Not Only SQL” and the tight integration between Hadoop and HBase and how systems like Cassandra (which is eventually consistent and not strongly consistent like HBase) is complementary as these systems have applicability within big data eco system depending on your use cases.

With the strong consistency of HBase you get features like incrementing counters and the tight integration with Hadoop means faster loads with HDFS thanks to a new feature in the 0.89 development preview release in the doc folders called “bulk loads”.

We covered a lot more unique features, talked about more of what is coming in upcoming releases as well as some tips with HBase so subscribe to the podcast and listen to all of what Todd had to say.


Joe Stein

Pre-Release from Pentaho – HIVE JDBC Adapter

August 15, 2010 Leave a comment

Pentaho’s Jordan Ganoff, Software Engineer, has open sourced some HIVE JDBC Adapters in what they are doing for their BI server

Not sure what state they are in but will try to check it on this week.

To use from maven:

You must also add the repository information to either the pom.xml or
your local settings:
<name>Pentaho External Repository</name>



Joe Stein


Categories: Hive, Open Source Projects

Hadoop and Pig with Alan Gates from Yahoo

Episode 4 of our Podcast is with Alan Gates, Senior Software Engineer @ Yahoo! and Pig committer. Click here to listen.

Hadoop is a really important part of Yahoo’s infrastructure because processing and analyzing big data is increasingly important for their business. Hadoop allows Yahoo to connect their consumer products with their advertisers and users for a better user experience. They have been involved with Hadoop for many years now and have their own distribution. Yahoo also sponsors/hosts a user group meeting which has grown to hundreds of attendees every month.

We talked about what Pig is now, the future of Pig and other projects like Oozie which Yahoo uses (and is open source) for workflow of MapReduce & Pig script automation. We also talked about Zebra, Owl, and Elephant Bird


Joe Stein

Ruby Streaming for Hadoop with Wukong a talk with Flip Kromer from Infochimps

Another great discussion on our PodcastClick here to listen.  For this episode our guest was Flip Kromer from Infochimps’s mission is to increase the world’s access to structured data.  They have been working since the start of 2008 to build the world’s most interesting data commons, and since the start of 2009 to build the world’s first data marketplace. Our founding team consists of two physicists (Flip Kromer and Dhruv Bansal) and one entrepreneur (Joseph Kelly).

We talked about Ruby streaming with Hadoop and why to use the open source project Wukong to simplify implementation of Hadoop using Ruby.  There are some great examples that are just awesome like the web log analysis that creates the paths (chain of pages) that users go through during their visited session.

It was interesting to learn some of the new implementations and projects that he has going on like using Cassandra to help with storing unique values for social network analysis.  This new project is called Cluster Chef  ClusterChef will help you create a scalable, efficient compute cluster in the cloud. It has recipes for Hadoop, Cassandra, NFS and more — use as many or as few as you like.

  • A small 1-5 node cluster for development or just to play around with Hadoop or Cassandra
  • A spot-priced, ebs-backed cluster for unattended computing at rock-bottom prices
  • A large 30+ machine cluster with multiple EBS volumes per node running Hadoop and Cassandra, with optional NFS for
  • With Chef, you declare a final state for each node, not a procedure to follow. Adminstration is more efficient, robust and maintainable.
  • You get a nice central dashboard to manage clients
  • You can easily roll out configuration changes across all your machines
  • Chef is actively developed and has well-written recipes for webservers, databases, development tools, and a ton of different software packages.
  • Poolparty makes creating amazon cloud machines concise and easy: you can specify spot instances, ebs-backed volumes, disable-api-termination, and more.
  • Hadoop
  • NFS
  • Persistent HDFS on EBS volumes
  • Zookeeper (in progress)
  • Cassandra (in progress)

Another couple of good links we got from Flip were Peter Norvig’s “Unreasonable Effectiveness of Data” thing I mentioned:


Joe Stein

Hadoop, BigData and Cassandra with Jonathan Ellis

Today I spoke with Jonathan Ellis who is the Project Chair of the Apache Cassandra project and co-founder of Riptano, the source for professional Cassandra support  It was a great discussion about Hadoop, BigData, Cassandra and Open Source.

We talked about the recent Cassandra 0.6 NoSQL integration and support for Hadoop Map/Reduce against the data stored in Cassandra and some of what is coming up in the 0.7 release.

We touched on how Pig is currently supported and why the motivation for Hive integration may not have any support with Cassandra in the future.

We also got a bit into a discussion of HBase vs Cassandra and some of the benefits & drawbacks as they live in your ecosystem (e.g. HBase is to OLAP as Cassandra is to OLTP).

This was the second Podcast and you can click here to listen.


Joe Stein

Hadoop NYC Meetup With Yale University and Datameer

April 22, 2010 Leave a comment

The NYC Hadoop meetup on April 21st was great.  Many thanks as always to the Cloudera folks and The Winter Wyman Companies for the pizza.  Also thanks to Hiveat55 for use of their office and the Datameer folks for a good time afterwards.

The first part of the meetup was presentation by Azza Abouzeid and Kamil Bajda-Pawlikowski (Yale University) on Hadoop DB (  Their vision is to take the best of both worlds from the Map Reduce bliss for lots of data that we get from Hadoop as well as the DBMS complex data analysis capabilities.

The basic idea behind HadoopDB is to give Hadoop access to multiple single-node DBMS servers (eg. PostgreSQL or MySQL) deployed across the cluster. HadoopDB pushes as much as possible data processing into the database engine by issuing SQL queries (usually most of the Map/Combine phase logic is expressible in SQL). This in turn results in creating a system that resembles a shared-nothing parallel database. Applying techniques taken from the database world leads to a performance boost, especially in more complex data analysis. At the same time, the fact that HadoopDB relies on MapReduce framework ensures scores on scalability and fault/heterogeneity tolerance similar to Hadoop.

They have a spent a lot of time thinking through, finding and resolving the tradeoffs that occur and continue to make progress on this end.  They have had 2,200 downloads as of this posting and are actively looking for developers to contribute to their project.   I think it is great to see a University involved at this level for Open Source in general and more specifically doing work related to Hadoop.  The audience was very engaging and it made for a very lively discussion.  Their paper tells all the gory details

The rest of the meetup was off the hook.  Stefan Groschupf got off to a quick start throwing down some pretty serious street cred as a long-standing commit-er for Nutch, Hadoop, Katta, Bixo and more.    He was very engaging with a good sort of anecdotes for the question that drives the Hadoop community “What do you want to-do with your data?”.  It is always processing it or querying it and there is not one golden bullet solution.  We were then demoed Datameer’s product (which is one of the best user interface concept solutions I have seen).

In short the Datameer Analytic Solution (DAS) is a spreadsheet user interface allowing users to take a sample of data and (with 15 existing data connections and over 120 functions) like any good spreadsheet pull the data into an aggregated format.  Their product then turns that format pushing it down into Hadoop (like through Hive) which then goes into a map/reduce job in Hadoop.

So end to end you can have worthy analytic folks (spreadsheet types) do their job against limitless data.  wicked.

From their website

With DAS, business users no longer have to rely on intuition or a “best guess” based on what’s happened in the past to make business decisions. DAS makes data assets available as they are needed regardless of format or location so that users have the facts they need to reach the best possible conclusions.

DAS includes a familiar interactive spreadsheet that is easy to use, but also powerful so that business users don’t need to turn to developers for analytics. The spreadsheet is specifically designed for visualization of big data and includes more than 120 built-in functions for exploring and discovering complex relationships. In addition, because DAS is extensible, business analysts can use functions from third-party tools or they can write their own commands.

Drag & drop reporting allow users to quickly create their own personalized dashboard    Users simply select the information they want to view and how to display it on the dashboard – tables, charts, or graphs.

The portfolio of analytical and reporting tools in organizations can be broad. Business users can easily share data in DAS with these tools to either extend their analysis or to give other users access.

After the quick demo Stefan walked us through a solution for using Hadoop to pull the “signal from the noise” in social data and used twitter as an example. He used a really interesting graph exploration tool (going to give it a try myself)  Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.  He then talked a bit about X-RIME which is Hadoop based large-scale social network analysis (Open Source).


Joe Stein

Categories: Meetups, Open Source Projects Tags:

Nifty Tool To Export Files From HDFS Into MySQL

April 19, 2010 5 comments

This is an interesting open source project I have recently heard about

What is very interesting to me about this project is the export utility which takes data from HDFS and loads it into MySQL.

It also has a nice way for querying and importing data from a JDBC database directly into HDFS.  It looks much more robust than the out of the box DBInputFormat that Hadoop provides.  You can import the data as delimited records, with choice of delimiter. You can also import the data and save them as Avro records. It supports queries – you can say join two tables. It splits on user specified column ranges, instead of using LIMIT and OFFSET. It does no code generation or ORM mapping.

There are other ETL tools out there (e.g. Sqoop  In Cloudera’s Distrobution for Hadoop Version 3 (CDH3) Sqoop supports HDFS back into MySQL also now.

I am definitely going to have to give this utility a try. I here from their (HIHO) project folks that the next to-do is support for more databases for export.


Joe Stein