Hadoop distribution bake-off and my experience with Cloudera and MapR
A few months back we started to endeavor on a new Hadoop cluster @ medialets
We have been live with Hadoop in production since April 2010 and we are still running CDH2. Our current hosting provider does not have a very ideal implementation for us where our 36 nodes are spread out across an entire data center and 5 networks each with 1 GB link. While there are issues with this type of setup we have been able to organically grow our cluster (started at 4 nodes) which powers 100% of our batch analytics for what is now hundreds of millions of mobile devices.
One of our mapreduce jobs processes 30+ billion objects (about 3 TB of uncompressed data) and takes about 90 minutes to run. This jobs runs all day long contiguously. Each run ingests the data that was received while the previous job was running. One of the primary goals of our new cluster was to reduce the time these type of jobs take without having to make any code changes or increase our investment in hardware. We figured besides the infrastructure changes we needed/wanted to make that running an old version of Hadoop meant that we were not taking advantage of all the awesome work that folks have been putting in over the last 2 years to do things like increasing performance.
So we endeavored to what seems to have been coined as “The Hadoop Distribution Bake-off”. We wanted to not only see how new versions of the Cloudera distribution would be running our jobs but also evaluate other distributions that have emerged since we first started with Hadoop. When we did this Hortonwork’s distribution was not released yet otherwise we would have added them and their distro to the possible choices.
First we found a new vendor to setup a test cluster for us http://www.logicworks.com. It was a four node cluster each with 2GB (1G dual bonded) NIC, 12GB of RAM, 4 x 1TB drives (using 3 of the drives for data and one for the OS) and 2x Westmere 5645 2.4GHz Hex-Core CPU. While this was not going to be the exact configuration we were going to end up with it was what they had in inventory and for the purposes of this test it was all about keeping the same hardware running with the same job with the same data and only changing the distro and configurations. As part of our due diligence, performance was not the only point we were interested in but was the primary goal of the bake-off and testing. We also reviewed other aspects of the distributions and companies which ultimately led to our final decision to go with CDH4 for our new cluster.
First, we wanted to create a baseline to see how our data and job did with the existing distribution (CDH2) we run in production with our existing production configuration. Next we wanted to give MapR a shot. We engaged with their team and they spent their time and assistance to help configure and optimize for the job’s test run. Once that was done we wanted to give CDH3 and CDH4 (which was still beta at the time) and the Cloudera folks also lent their time and helped configure and optimize the cluster.
CDH2 = 12 hours 12 min (our production configuration)
MapR = 4 hours 31 min (configuration done by MapR team)
CDH3 = 6 hours 8 min (our production configuration)
CDH4 = 4 hours 20 min (configuration done by Cloudera team)
This told us that the decision between running CDH4 or MapR was not going to be made based on performance of the distribution with our data and mapreduce jobs.
So, we had to look at the other things that were important to us.
MapR has a couple of a really nice features that are unique to their platform. Their file system features with NFS and Snapshots, both are cool so lets go through them quickly. MapR’s underlying proprietary file system allows for these unique features in the Hadoop ecosystem. The NFS feature basically allows you to copy to an NFS share that is distributed across the entire cluster (with a VIP so highly available). This means that you can use the cluster for saving data from your applications and then without any additional copies map-reduce over it. Data is compressible under the hood though this did not mean much to us since we compress all of our data in sequence files using compress by block size on the sequence file. Snapshots (and mirroring to other clusters of those snapshots) is nifty. Being able to take a point in time instance cut of things makes the cluster feel and operate like our SAN. While snapshots are nifty the same end result is capable with a distcp which sure takes longer but is still technically feasible not a lot of other benefits for us or our business, nifty none the less. The main issue we had with all of this was that all of the features that were attractive required us to license their product. Their product also is not open source so we would not be able to build the code, make changes or anything else always having to rely on them for support and maintenance. We met a lot of great folks from MapR but only 2 of them were Apache committers (they may have more on staff, I only met two though) and this is important to us from a support & maintenance perspective… for them it probably is not a huge deal since their platform is not open source and proprietary ( I think I just repeated myself here but did so on purpose ).
Cloudera… tried, true and trusted (I have been running CDH2 for 2 years in production without ever having to upgrade) and know lots of folks that can say the same thing. Everything is Open Source with a very healthy and active community. A handful of times this has been very helpful in development cycles for me to see what the container I was running in was doing to help me resolve the problems I was finding in my own code… or even to simply shoot a question over the mailing list to get a response to a question. As far as the distribution goes, it costs nothing to get it running and have it run in production with all of the features we wanted. If we ever decided to pay for support there are a boat load (a large boat) of Apache Committers not just to the Hadoop project but to lots of projects within the Hadoop eco system all of which are available and part and parcel to help answer questions and make code changes, etc. The philosophy of their distribution (besides just being open source) is to cherry pick changes from Apache Hadoop as soon as they can (or should or want) to be introduced to making their distribution best.
I can think of a lot of industries and companies were MapR would be a good choice over Cloudera.
We decided what was best for us was to go with CDH4 for our new cluster. And, if we ever decide to purchase support we would get it from Cloudera.