Archive
Resource scheduling and task launching with Apache Mesos and Apache Aurora at Twitter
Episode # 23 of the podcast was a talk with Bill Farner
Bill explained how Twitter, using Apache Mesos and Apache Aurora, gets more for their money for the hardware and saves engineering time (both development and operations) by utilizing fine grained resources scheduling across their infrastructure. Bill talked a bit how the power of what he saw and experienced at Google with Borg is how they wanted to run things at Twitter and what they built Aurora for. Now after years of running in production at Twitter, Aurora is open source, part of the Apache foundation and available for use. Lots of new use cases that they didn’t see coming have become very powerful for their teams and Bill went into more detail about that too.
Bill also talked about the type of instrumentation that was done with features in Aurora to get to a place where now all new systems and almost all legacy systems at Twitter are run on top of Aurora. Bill went into detail about how that works in regards to Twitter’s cache and how the SLA features of Aurora make this a reality. Aurora is amazing providing end users (everyone from engineers to analysts) the ability to have full access to the potential resources of their hardware clusters. Aurora provides features like quotas and preemption so that any user can be provided the access to the compute resources of the entire hardware infrastructure without worry of abuse to hog resources and keep production always as the priority.
Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments.
Apache Aurora is a Mesos framework. A Mesos frameworks is a scheduler of resources and launcher of tasks. Aurora provides a Job abstraction consisting of a Task template and instructions for creating near-identical replicas of that Task. Typically a Task is a single Process corresponding to a single command line, such as python2.6 my_script.py
. However, sometimes you must colocate separate Processes together within a single Task, which runs within a single container and chroot
, often referred to as a “sandbox”. For example, if you run multiple cooperating agents together such as logrotate
, installer
, and master or slave processes. Thermos provides a Process abstraction under the Mesos Tasks.
To use and get up to speed on Aurora, you should look the docs in this directory in this order:
- How to deploy Aurora or, how to install Aurora on virtual machines on your private machine (the Tutorial uses the virtual machine approach).
- As a user, get started quickly with a Tutorial.
- For an overview of Aurora’s process flow under the hood, see the User Guide.
- To learn how to write a configuration file, look at our Configuration Tutorial. From there, look at the Aurora + Thermos Reference.
- Then read up on the Aurora Command Line Client.
- Find out general information and useful tips about how Aurora does Resource Isolation.
For some more great background on Mesos and Aurora please check out these three videos.
Datacenter Management with Apache Mesos
An intro video to Apache Aurora
Past, Present, Future of Apache Aurora
To hear everything that Bill had to say please subscribe to the podcast.
Apache Solr real-time live index updates at scale with Apache Hadoop
Episode # 22 of the podcast was a talk with Patrick Hunt
We talked about the new work that has gone into Apache Solr (upstream) that allows it to work on Apache Hadoop. Solr has support for writing and reading its index and transaction log files to the HDFS distributed filesystem. This does not use Hadoop Map-Reduce to process Solr data, rather it only uses the HDFS filesystem for index and transaction log file storage. https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
We also talked about Solr Cloud and how the sharding features allow Solr to scale with a Hadoop cluster https://cwiki.apache.org/confluence/display/solr/SolrCloud.
Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability. Called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features:
- Central configuration for the entire cluster
- Automatic load balancing and fail-over for queries
- ZooKeeper integration for cluster coordination and configuration.
SolrCloud is flexible distributed search and indexing, without a master node to allocate nodes, shards and replicas. Instead, Solr uses ZooKeeper to manage these locations, depending on configuration files and schemas. Documents can be sent to any server and ZooKeeper will figure it out.
Patrick introduced me to Morphlines (part of the Cloudera Development Kit for Hadoop) http://cloudera.github.io/cdk/docs/current/cdk-morphlines/index.html
Cloudera Morphlines is an open source framework that reduces the time and skills necessary to build and change Hadoop ETL stream processing applications that extract, transform and load data into Apache Solr, HBase, HDFS, Enterprise Data Warehouses, or Analytic Online Dashboards. Want to build or facilitate ETL jobs without programming and without substantial MapReduce skills? Get the job done with a minimum amount of fuss and support costs? Here is how to get started.
A morphline is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing and maintaining custom ETL projects.
Morphlines is a library, embeddable in any Java codebase. A morphline is an in-memory container of transformation commands. Commands are plugins to a morphline that perform tasks such as loading, parsing, transforming, or otherwise processing a single record. A record is an in-memory data structure of name-value pairs with optional blob attachments or POJO attachments. The framework is extensible and integrates existing functionality and third party systems in a straightforward manner.
The morphline commands were developed as part of Cloudera Search. Morphlines power ETL data flows from Flume and MapReduce and HBase into Apache Solr. Flume covers the real time case, whereas MapReduce covers the batch processing case. Since the launch of Cloudera Search morphline development graduated into the Cloudera Development Kit(CDK) in order to make the technology accessible to a wider range of users and products, beyond Search. The CDK is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem. The CDK is hosted on GitHub and encourages involvement by the community. For example, morphlines could be embedded into Crunch, HBase, Impala, Pig, Hive, or Sqoop. Let us know where you want to take it!
Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records, including arbitrary binary payloads. A morphline is an efficient way to consume records (e.g. Flume events, HDFS files, RDBMS tables or Avro objects), turn them into a stream of records, and pipe the stream of records through a set of easily configurable transformations on the way to a target application such as Solr, for example as outlined in the following figure:
In this figure, a Flume Source receives syslog events and sends them to a Flume Morphline Sink, which converts each Flume event to a record and pipes it into a readLine command. The readLine command extracts the log line and pipes it into a grok command. The grok command uses regular expression pattern matching to extract some substrings of the line. It pipes the resulting structured record into the loadSolr command. Finally, the loadSolr command loads the record into Solr, typically a SolrCloud. In the process, raw data or semi-structured data is transformed into structured data according to application modelling requirements.
The Morphline framework ships with a set of frequently used high level transformation and I/O commands that can be combined in application specific ways. The plugin system allows the adding of new transformations and I/O commands and integrates existing functionality and third party systems in a straightforward manner.
This integration enables rapid Hadoop ETL application prototyping, complex stream and event processing in real time, flexible log file analysis, integration of multiple heterogeneous input schemas and file formats, as well as reuse of ETL logic building blocks across Hadoop ETL applications.
The CDK ships an efficient runtime that compiles a morphline on the fly. The runtime executes all commands of a given morphline in the same thread. Piping a record from one command to another implies just a cheap Java method call. In particular, there are no queues, no handoffs among threads, no context switches and no serialization between commands, which minimizes performance overheads.
Morphlines manipulate continuous or arbitrarily large streams of records. A command transforms a record into zero or more records. The data model can be described as follows: A record is a set of named fields where each field has an ordered list of one or more values. A value can be any Java Object. That is, a record is essentially a hash table where each hash table entry contains a String key and a list of Java Objects as values. Note that a field can have multiple values and any two records need not use common field names. This flexible data model corresponds exactly to the characteristics of the Solr/Lucene data model.
Not only structured data, but also binary data can be passed into and processed by a morphline. By convention, a record can contain an optional field named _attachment_body, which can be a Java java.io.InputStream or Java byte[]. Optionally, such binary input data can be characterized in more detail by setting the fields named _attachment_mimetype (such as “application/pdf”) and _attachment_charset (such as “UTF-8”) and _attachment_name (such as “cars.pdf”), which assists in detecting and parsing the data type. This is similar to the way email works.
This generic data model is useful to support a wide range of applications. For example, the Apache Flume Morphline Solr Sink embeds the morphline library and executes a morphline to convert flume events into morphline records and load them into Solr. This sink fills the body of the Flume event into the _attachment_body field of the morphline record, as well as copies the headers of the Flume event into record fields of the same name. As another example, the Mappers of the MapReduceIndexerTool fill the Java java.io.InputStream referring to the currently processed HDFS file into the _attachment_body field of the morphline record. The Mappers of the MapReduceIndexerTool also fill metadata about the HDFS file into record fields, such as the file’s name, path, size, last modified time, etc. This way a morphline can act on all data received from Flume and HDFS. As yet another example, the Morphline Lily HBase Indexer fills a HBase Result Java POJO into the _attachment_body field of the morphline record. This way morphline commands such as extractHBaseCells can extract data from HBase updates and correspondingly update a Solr index.
We also talked a good deal about Apache Zookeeper and some of the history back from when Zookeeper was originally at Yahoo! and Patrick’s experience since then. To hear everything that Patrick had to say please subscribe to the podcast.
Higher Availability, Increased Scale and Enhanced Security on Apache HBase
Episode #21 of the Podcast was a talk with Michael Stack, Lars Hofhansl and Andrew Purtell.
Having these guests from the Apache HBase PMC allowed us to talk about HBase 0.96, 0.98, some use cases, HBaseCon and 1.0.
The highlights from 0.96 where around stability and longer term scale (moving all internal data exchange and persistence to protobufs).
0.98 introduced some exciting new security features and a new HFile format with both encryption at rest and cell level security labels.
HBaseCon has all new speakers and new use cases with new and familiar faces listening onward. A must attend if you can make it.
1.0 is focusing on SLA and more inmemorry database features and general cleanup.
Listen into the podcast and all of what they talked about together.
Beyond MapReduce and Apache Hadoop 2.X with Bikas Saha and Arun Murthy
Episode 20 of the podcast was with Bikas Saha and Arun Murthy.
When I spoke with Arun a year or so a go YARN was NextGen Hadoop and there have been a lot of updates, work done and production experience since.
Besides Yahoo! other multi thousand node clusters have been and are running in production with YARN. These clusters have shown 2x capacity throughput which resulted in reduced cost for hardware (and in some cases being able to shut down co-los) while still gaining performance improvements overall to previous clusters of Hadoop 1.X.
I got to hear about some of what is in 2.4 and coming in 2.5 of Hadoop:
- Application timeline server repository and api for application specific metrics (Tez, Spark, Whatever).
- web service API to put and get with some aggregation.
- plugable nosql store (hbase, accumulo) to scale it.
- Preemption capacity scheduler.
- Multiple resource support (CPU, RAM and Disk).
- Labels tag nodes with labels can be labeled however so some windows and some linux and ask for resources with only those labels with ACLS.
- Hypervisor support as a key part of the topology.
- Hoya generalize for YARN (game changer) and now proposed as Slider to the Apache incubator.
We talked about Tez which provides complex DAGs of queries to translate what you want to-do on Hadoop without the work arounds for making it have to run in MapReduce. MapReduce was not designed to be re-workable out side of the parts of the Job it gave you for Map, Split, Shuffle, Combine, Reduce, Etc and Tez is more expressible exposing a DAG API.
Now becomes with Tez:
There were also some updates on Hive v13 coming out with sub queries, low latency queries (through Tez), high precision decimal points and more!
Subscribe to the podcast and listen to all of what Bikas and Arun had to say.
Big Data with Apache Accumulo Preserving Security with Open Source
Episode 19 of the podcast was a talk with Adam Fuchs.
Adam talked about Apache Accumulo which is a system built for doing random i/o with peta bytes of data.
Distributing the computation to the data with cell level security is where Accumulo really shines.
Accumulo provides a richer data model than simple key-value stores, but is not a fully relational database. Data is represented as key-value pairs, where the key and value are comprised of the following elements:
Key | Value | ||||
Row ID | Column | Timestamp | |||
Family | Qualifier | Visibility |
All elements of the Key and the Value are represented as byte arrays except for Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically in ascending order. Timestamps are sorted in descending order so that later versions of the same Key appear first in a sequential scan. Tables consist of a set of sorted key-value pairs.
Accumulo stores data in tables, which are partitioned into tablets. Tablets are partitioned on row boundaries so that all of the columns and values for a particular row are found together within the same tablet. The Master assigns Tablets to one TabletServer at a time. This enables row-level transactions to take place without using distributed locking or some other complicated synchronization mechanism. As clients insert and query data, and as machines are added and removed from the cluster, the Master migrates tablets to ensure they remain available and that the ingest and query load is balanced across the cluster.


Impala and SQL on Hadoop
The origins of Impala can be found in F1 – The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business.
One of many differences between MapReduce and Impala is in Impala the intermediate data moves from process to process directly instead of storing it on HDFS for processes to get at the data needed for processing. This provides a HUGE performance advantage and doing so while consuming few cluster resources. Less hardware to-do more!
There are many advantages to this approach over alternative approaches for querying Hadoop data, including::
- Thanks to local processing on data nodes, network bottlenecks are avoided.
- A single, open, and unified metadata store can be utilized.
- Costly data format conversion is unnecessary and thus no overhead is incurred.
- All data is immediately query-able, with no delays for ETL.
- All hardware is utilized for Impala queries as well as for MapReduce.
- Only a single machine pool is needed to scale.
We encourage you to read the documentation for further exploration!
There are still transformation steps required to optimize the queries but Impala can help to-do this for you with Parquet file format. Better compression and optimized runtime performance is realized using the ParquetFormat though many other file types are supported.
This and a whole lot more was discussed with Marcel Kornacker the Cloudera Architect behind Impala on Episode 18 of the All Things Hadoop Podcast.
Using Apache Drill for Large Scale, Interactive, Real-Time Analytic Queries
Episode #17 of the podcast is a talk with Jacques Nadeau available also on iTunes
Apache Drill http://incubator.apache.org/drill/, a modern interactive query engine that runs on top of Hadoop.
Jacques talked about how Apache Drill is a modern query engine that is meant to be a query layer on top of all big data open source systems. Apache Drill is being designed to make the storage engine as plug-able so it could be the interface for any big data storage engine. The first release came out recently to allow developers to understand the data pipeline.
Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization.
Perform interactive analysis on all of your data, including nested and schema-less. Drill supports querying against many different schema-less data sources including HBase, Cassandra and MongoDB. Naturally flat records are included as a special case of nested data.
Strongly defined tiers and APIs for straightforward integration with a wide array of technologies.
Subscribe to the podcast and listen to what Jacques had to say. Available also on iTunes
/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/
Real-Time Data Pipelines and Analytics with Apache Kafka and Apache Samza
Episode #16 of the podcast is a talk with Jay Kreps available also on iTunes
Jay talked about the open source work he has done while @ LinkedIn
Including
- Project Voldermort – http://www.project-voldemort.com/voldemort/ – a distributed database
- Apache Kafka – http://kafka.apache.org – A high-throughput distributed data pipeline and messaging system
- Apache Samza – http://samza.incubator.apache.org/ – Apache Samza is a distributed stream processing framework
- Azkaban – http://data.linkedin.com/opensource/azkaban – a batch job scheduler for Apache Hadoop
Most of the conversation was about the Apache Kafka pipeline and the use of Apache Samza for processing it.
Apache Kafka
http://kafka.apache.org/documentation.html#introduction
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
What does all that mean?
First let’s review some basic messaging terminology:
- Kafka maintains feeds of messages in categories called topics.
- We’ll call processes that publish messages to a Kafka topic producers.
- We’ll call processes that subscribe to topics and process the feed of published messages consumers..
- Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:

Communication between the clients and the servers is done with a simple, high-performance, language agnosticTCP protocol. We provide a java client for Kafka, but clients are available in many languages.
Topics and Logs
Let’s first dive into the high-level abstraction Kafka provides—the topic.
A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:

Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so retaining lots of data is not a problem.
In fact the only metadata retained on a per-consumer basis is the position of the consumer in in the log, called the “offset”. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads messages, but in fact the position is controlled by the consumer and it can consume messages in any order it likes. For example a consumer can reset to an older offset to reprocess.
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to “tail” the contents of any topic without changing what is consumed by any existing consumers.
Apache Samza
http://samza.incubator.apache.org/
Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
- Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple call-back based “process message” API that should be familiar to anyone that’s used Map/Reduce.
- Managed state: Samza manages snapshotting and restoration of a stream processor’s state. Samza will restore a stream processor’s state to a snapshot consistent with the processor’s last read messages when the processor is restarted.
- Fault tolerance: Samza will work with YARN to restart your stream processor if there is a machine or processor failure.
- Durability: Samza uses Kafka to guarantee that messages will be processed in the order they were written to a partition, and that no messages will ever be lost.
- Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, re-playable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
- Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
- Processor isolation: Samza works with Apache YARN, which supports processor security through Hadoop’s security model, and resource isolation through Linux CGroups.
Check out Hello Samza to try Samza. Read the Background page to learn more about Samza.
Subscribe to the podcast and listen to what Jay had to say. Available also on iTunes
/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/
Big Data, Open Source and Analytics
Episode #15 of the podcast is a talk with Stefan Groschupf available also on iTunes
Stefan is the CEO of Datameer and talked about how the company started and where it is now. Founded in 2009 by some of the original contributors to Apache Hadoop, Datameer has grown to a global team, advancing big data analytics. After several implementations of Hadoop analytics solutions at Global 500 companies, the founders were determined to build the next generation analytics application to solve the new use cases created by the explosion of structured and unstructured data. Datameer is the single application for big data analytics by combining data integration, data transformation and data visualization. Customers love us and we work to make Datameer even better each day.
Datameer provides the most complete solution to analyze structured and unstructured data. Not limited by a pre-built schema, the point and click functions means your analytics are only limited by your imagination. Even the most complex nested joins of a large number of datasets can be performed using an interactive dialog. Mix and match analytics and data transformations in unlimited number of data processing pipelines. Leave the raw data untouched.
Datameer turbocharges time series analytics by correlating multiple sets of complex, disparate data. Resulting analytics are endless including correlation of credit card transactions with card holder authorizations, network traffic data, marketing interaction data and many more. The end game is a clear window into the operations of your business, giving you the actionable insights you need to make business decisions.
Data is the raw materials of insight and the more data you have, the deeper and broader the possible insights. Not just traditional, transaction data but all types of data so that you can get a complete view of your customers, better understand business processes and improve business performance.
Datameer ignores the limitations of ETL and static schemas to empower business users to integrate data from any source into Hadoop. Pre-built data connector wizards for all common structured and unstructured data sources means that data integration is an easy, three step process of where, what and when.
Now you never have to waste precious time by starting from scratch. Anyone can simply browse the Analytics App Market, download an app, connect to data, and get instant results. But why stop there? Every application is completely open so you can customize it, extend it, or even mash it up with other applications to get the insights you need.
Built by data scientists, analysts, or subject matter experts, analytic apps range from horizontal use cases like email and social sentiment analysis to vertical or even product-specific applications like advanced Salesforce.com sales-cycle analysis.
Check out the Datameer app market.
Subscribe to the podcast and listen to what Stefan had to say. Available also on iTunes
/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/
SQL Compatibility in Hadoop with Hive
Episode #14 of the podcast is a talk with Alan Gates available also on iTunes
The Stinger initiative is a collection of development threads in the Hive community that will deliver 100X performance improvements as well as SQL compatibility.
Fast Interactive Query An immediate aim of 100x performance increase for Hive is more ambitious than any other effort. |
SQL Compatibility Based on industry standard SQL, the Stinger Initiative improves HiveQL to deliver SQL compatibility. |
Apache Hive is the de facto standard for SQL-in-Hadoop today with more enterprises relying on this open source project than any alternative. As Hadoop gains in popularity, enterprise requirements for Hive to become more real time or interactive have evolved… and the Hive community has responded.
He spoke in detail about the Stinger initiative, who is contributing to it, why they decided to improve upon Hive and not create a new system and more.
He talked about how Microsoft is contributing in the open source community to improve upon Hive.
Hadoop is so much more than just SQL, one of the wonderful things about Big Data is the power it brings for users to bring different processing models such as realtime streaming with Storm, Graph processing with Giraph and ETL with Pig and all different things to-do beyond just this SQL compatibility.
Alan also talked about YARN and Tez and the benefits of the Stinger initiative to other Hadoop ecosystem tools too.
Subscribe to the podcast and listen to what Alan had to say. Available also on iTunes
/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/
Apache Zookeeper, Distributed Systems, Open Source and more with Camille Fournier
Episode #13 of the podcast is a talk with Camille Fournier Available also on iTunes
Apache Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.
Camille talked about discovery services, distributed locking as well as some tips to developing against and operating Zookeeper in production including how to build a Global, Highly Available Service Discovery Infrastructure with ZooKeeper which she also wrote about on her blog http://whilefalse.blogspot.com/2012/12/building-global-highly-available.html.
Camille gave some great insights about how to apply Open Source community practices to an organization’s SDLC to foster a better culture for better products and services where all developers need to own more parts of their software (like it is in Open Source projects). #devops #qaops #userops
Subscribe to the podcast and listen to what Camille had to say. Available also on iTunes
/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/
Apache BigTop and how packaging infrastructure binds the Hadoop ecosystem together
Episode #12 of the podcast is a talk with Mark Grover and Roman Shaposhnik Available also on iTunes
Apache Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.
The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
BigTop makes it easier to deploy Hadoop Ecosystem projects including:
-
Apache Zookeeper
-
Apache Flume
-
Apache HBase
-
Apache Pig
-
Apache Hive
-
Apache Sqoop
-
Apache Oozie
-
Apache Whirr
-
Apache Mahout
-
Apache Solr (SolrCloud)
-
Apache Crunch (incubating)
-
Apache HCatalog
-
Apache Giraph
-
LinkedIn DataFu
-
Cloudera Hue
The list of supported Linux platforms has expanded to include:
-
CentOS/RHEL 5 and 6
-
Fedora 17 and 18
-
SuSE Linux Enterprise 11
-
OpenSUSE 12.2
-
Ubuntu LTS Lucid (10.04) and Precise (12.04)
-
Ubuntu Quantal (12.10)
Subscribe to the podcast and listen to what Mark and Roman had to say. Available also on iTunes
/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/
Hadoop as a Service cloud platform with the Mortar Framework and Pig
Episode #11 of the podcast is a talk with K Young. Available also on iTunes
Mortar is the fastest and easiest way to work with Pig and Python on Hadoop in the Cloud.
Mortar’s platform is for everything from joining and cleansing large data sets to machine learning and building recommender systems.
Mortar makes it easy for developers and data scientists to do powerful work with Hadoop. The main advantages of Mortar are:
- Zero Setup Time: Mortar takes only minutes to set up (or no time at all on the web), and you can start running Pig jobs immediately. No need for painful installation or configuration.
- Powerful Tooling: Mortar provides a rich suite of tools to aid in Pig development, including the ability to Illustrate a script before running it, and an extremely fast and free local development mode.
- Elastic Clusters: We spin up Hadoop clusters as you need them, so you don’t have to predict your needs in advance, and you don’t pay for machines you don’t use.
- Solid Support: Whether the issue is in your script or in Hadoop, we’ll help you figure out a solution.
We talked about the Open Source Mortar Framework and their new Open Source tool for visualizing data while writing Pig scripts called Watchtower
The Mortar Blog has a great video demo on Watchtower.
There are no two ways around it, Hadoop development iterations are slow. Traditional programmers have always had the benefit of re-compiling their app, running it, and seeing the results within seconds. They have near instant validation that what they’re building is actually working. When you’re working with Hadoop, dealing with gigabytes of data, your development iteration time is more like hours.
Subscribe to the podcast and listen to what K Young had to say. Available also on iTunes
/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/
Hadoop, The Cloudera Development Kit, Parquet, Apache BigTop and more with Tom White
Episode #10 of the podcast is a talk with Tom White. Available also on iTunes
We talked a lot about The Cloudera Development Kit http://github.com/cloudera/cdk, or CDK for short, which is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
The goals of the CDK are:
- Codify expert patterns and practices for building data-oriented systems and applications.
- Let developers focus on business logic, not plumbing or infrastructure.
- Provide smart defaults for platform choices.
- Support piecemeal adoption via loosely-coupled modules.
Eric Sammer recorded a webinar in which he talks about the goals of the CDK.
This project is organized into modules. Modules may be independent or have dependencies on other modules within the CDK. When possible, dependencies on external projects are minimized.
We also talked about Parquet http://parquet.io/ which was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language. Parquet is built from the ground up with complex nested data structures in mind, and uses the repetition/definition level approach to encoding such data structures, as popularized by Google Dremel. We believe this approach is superior to simple flattening of nested name spaces.
Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.
Tom talked about Apache BigTop too http://bigtop.apache.org/ Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem. The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
Subscribe to the podcast and listen to what Tom had to say. Available also on iTunes
/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/
Hadoop, Mesos, Cascading, Scalding, Cascalog and Data Science with Paco Nathan
Episode #9 of the podcast is a talk with Paco Nathon. Available also on iTunes
We talked about how he got started with Hadoop with Natural Language Processing back in 2007 with text analytics.
And then starting talking about Mesos http://mesos.apache.org/
Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.
We talked a little about the difference between YARN and Mesos. Paco talked about how Mesos is lower in the stack and part of the operating system where YARN is higher up in the stack and built to support the Hadoop ecosystem in the JVM. He talked about the future of Mesos and touched on its contrast to Google Borg … for some more information on Google Borg and Mesos here is a great article http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/all/
Then we got into Cascading which was started by Chris Wensel – http://www.cascading.org/ and talked about the enterprise use cases for Cascading. He talked about how Cascading has always been geared to satisfy enterprise use cases and not slice and dice but build an application on top of it and be able to debug it to see where it is running because it is deterministic. He talked about how this contrasts to Hive and Pig. He brought up Steve Yegeg’s post “Notes from the Mystery Machine Bus” https://plus.google.com/110981030061712822816/posts/KaSKeg4vQtz and talked a bit how Cascading applied.
We got into design patterns for the enterprise with big batch workflow breaking it up into five parts:
1) Different data sources (structured and unstructured data)
2) ETL
3) Custom data preparation and business logic to clean up the data
4) Analytics or predictive modeling to enrich the data
5) integration with end use cases that consume the data products
Cascading addresses all of these points and Paco talked in more detail about them.
We finished up the podcast with him talking about the future of these technologies and also data science.
Subscribe to the podcast and listen to what Paco had to say. Available also on iTunes
/*
Joe Stein
Big Data Open Source Security LLC
http://www.stealth.ly
*/
Hortonworks HDP1, Apache Hadoop 2.0, NextGen MapReduce (YARN), HDFS Federation and the future of Hadoop with Arun C. Murthy
Episode #8 of the Podcast is a talk with Arun C. Murthy.
We talked about Hortonworks HDP1, the first release from Hortonworks, Apache Hadoop 2.0, NextGen MapReduce (YARN) and HDFS Federations
subscribe to the podcast and listen to all of what Arun had to share.
Some background to what we discussed:
Hortonworks Data Platform (HDP)
from their website: http://hortonworks.com/products/hortonworksdataplatform/
Hortonworks Data Platform (HDP) is a 100% open source data management platform based on Apache Hadoop. It allows you to load, store, process and manage data in virtually any format and at any scale. As the foundation for the next generation enterprise data architecture, HDP includes all of the necessary components to begin uncovering business insights from the quickly growing streams of data flowing into and throughout your business.
Hortonworks Data Platform is ideal for organizations that want to combine the power and cost-effectiveness of Apache Hadoop with the advanced services required for enterprise deployments. It is also ideal for solution providers that wish to integrate or extend their solutions with an open and extensible Apache Hadoop-based platform.
Key Features
- Integrated and Tested Package – HDP includes stable versions of all the critical Apache Hadoop components in an integrated and tested package.
- Easy Installation – HDP includes an installation and provisioning tool with a modern, intuitive user interface.
- Management and Monitoring Services – HDP includes intuitive dashboards for monitoring your clusters and creating alerts.
- Data Integration Services – HDP includes Talend Open Studio for Big Data, the leading open source integration tool for easily connecting Hadoop to hundreds of data systems without having to write code.
- Metadata Services – HDP includes Apache HCatalog, which simplifies data sharing between Hadoop applications and between Hadoop and other data systems.
- High Availability – HDP has been extended to seamlessly integrate with proven high availability solutions.
Apache Hadoop 2.0
from their website: http://hadoop.apache.org/common/docs/current/
Apache Hadoop 2.x consists of significant improvements over the previous stable release (hadoop-1.x).
Here is a short overview of the improvments to both HDFS and MapReduce.
- HDFS FederationIn order to scale the name service horizontally, federation uses multiple independent Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.More details are available in the HDFS Federation document.
- MapReduce NextGen aka YARN aka MRv2The new architecture introduced in hadoop-0.23, divides the two major functions of the JobTracker: resource management and job life-cycle management into separate components.The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination.An application is either a single job in the sense of classic MapReduce jobs or a DAG of such jobs.The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.More details are available in the YARN document.
Getting Started
The Hadoop documentation includes the information you need to get started using Hadoop. Begin with the Single Node Setup which shows you how to set up a single-node Hadoop installation. Then move on to the Cluster Setup to learn how to set up a multi-node Hadoop installation.
Apache Hadoop NextGen MapReduce (YARN)
from their website: http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc. In the first version, only memory is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.
The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
MRV2 maintains API compatibility with previous stable release (hadoop-0.20.205). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.
HDFS Federation
from their website: http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html
Background
HDFS has two main layers:
- Namespace
- Consists of directories, files and blocks
- It supports all the namespace related file system operations such as create, delete, modify and list files and directories.
- Block Storage Service has two parts
- Block Management (which is done in Namenode)
- Provides datanode cluster membership by handling registrations, and periodic heart beats.
- Processes block reports and maintains location of blocks.
- Supports block related operations such as create, delete, modify and get block location.
- Manages replica placement and replication of a block for under replicated blocks and deletes blocks that are over replicated.
- Storage – is provided by datanodes by storing blocks on the local file system and allows read/write access.
The prior HDFS architecture allows only a single namespace for the entire cluster. A single Namenode manages this namespace. HDFS Federation addresses limitation of the prior architecture by adding support multiple Namenodes/namespaces to HDFS file system.
- Block Management (which is done in Namenode)
Multiple Namenodes/Namespaces
In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.
Block Pool
A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for all the block pools in the cluster. It is managed independently of other block pools. This allows a namespace to generate Block IDs for new blocks without the need for coordination with the other namespaces. The failure of a Namenode does not prevent the datanode from serving other Namenodes in the cluster.
A Namespace and its block pool together are called Namespace Volume. It is a self-contained unit of management. When a Namenode/namespace is deleted, the corresponding block pool at the datanodes is deleted. Each namespace volume is upgraded as a unit, during cluster upgrade.
ClusterID
A new identifier ClusterID is added to identify all the nodes in the cluster. When a Namenode is formatted, this identifier is provided or auto generated. This ID should be used for formatting the other Namenodes into the cluster.
Key Benefits
- Namespace Scalability – HDFS cluster storage scales horizontally but the namespace does not. Large deployments or deployments using lot of small files benefit from scaling the namespace by adding more Namenodes to the cluster
- Performance – File system operation throughput is limited by a single Namenode in the prior architecture. Adding more Namenodes to the cluster scales the file system read/write operations throughput.
- Isolation – A single Namenode offers no isolation in multi user environment. An experimental application can overload the Namenode and slow down production critical applications. With multiple Namenodes, different categories of applications and users can be isolated to different namespaces.
subscribe to the podcast and listen to all of what Arun had to share.
[tweetmeme http://wp.me/pTu1i-80%5D
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/
Unified analytics and large scale machine learning with Milind Bhandarkar
Episode #7 of the Podcast is a talk with Milind Bhandarkar.
We talked about unified analytics, machine learning, data science, some great history of Hadoop, the future of Hadoop and a lot more!
subscribe to the podcast and listen to all of what Milind had to share.
[tweetmeme http://wp.me/pTu1i-7v%5D
/*
Joe Stein
http://www.medialets.com
*/
NoSQL HBase and Hadoop with Todd Lipcon from Cloudera
Episode #6 of the Podcast is a talk with Todd Lipcon from Cloudera discussing HBase.
We talked about NoSQL and how it should stand for “Not Only SQL” and the tight integration between Hadoop and HBase and how systems like Cassandra (which is eventually consistent and not strongly consistent like HBase) is complementary as these systems have applicability within big data eco system depending on your use cases.
With the strong consistency of HBase you get features like incrementing counters and the tight integration with Hadoop means faster loads with HDFS thanks to a new feature in the 0.89 development preview release in the doc folders called “bulk loads”.
We covered a lot more unique features, talked about more of what is coming in upcoming releases as well as some tips with HBase so subscribe to the podcast and listen to all of what Todd had to say.
[tweetmeme http://wp.me/pTu1i-5k%5D
/*
Joe Stein
http://www.medialets.com
*/
Hadoop Development Tools By Karmasphere
In Episode #5 of the Hadoop Podcast https://allthingshadoop.com/podcast/ I speak with Shevek, the CTO of Karmasphere http://karmasphere.com/. To subscribe to the Podcast click here.
We talk a bit about their existing Community Edition (support Netbeans & Eclipse)
- For developing, debugging and deploying Hadoop Jobs
- Desktop MapReduce Prototyping
- GUI to manipulate clusters, file systems and jobs
- Easy deployment to any Hadoop version, any distribution in any cloud
- Works through firewalls
As well as the new products they have launched:
Karmasphere Client:
The Karmasphere Client is a cross platform library for ensuring MapReduce jobs can work from any desktop environment to any Hadoop cluster in any enterprise data network. By isolating the Big Data professional and version of Hadoop, Karmasphere Client simplifies the process of switching between data centers and the cloud and enables Hadoop jobs to be independent of the version of the underlying cluster.
Unlike the standard Hadoop client , Karmasphere Client works from Microsoft Windows as well as Linux and MacOS, and works through SSH-based firewalls. Karmasphere Client provides a cloud-independent environment that makes it easy and predictable to maintain a business operation reliant on Hadoop.
- Ensures Hadoop distribution and version independence
- Works from Windows (unlike Hadoop Client)
- Supports any cloud environment: public, private or public cloud service.
- Provides:
- Job portability
- Operating system portability
- Firewall hopping
- Fault tolerant API
- Synchronous and Asynchronous API
- Clean Object Oriented Design
- Making it easy and predictable to maintain a business operation reliant on Hadoop
Karmasphere Studio Professional Edition
Karmasphere Studio Professional Edition includes all the functionality of the Community Edition, plus a range of deeper functionality required to simplify the developer’s task of making a MapReduce job robust, efficient and production-ready.
For a MapReduce job to be robust, its functioning on the cluster has to be well understood in terms of time, processing, and storage requirements, as well as in terms of its behavior when implemented within well-defined “bounds.” Karmasphere Studio Professional Edition incorporates the tools and a predefined set of rules that make it easy for the developer to understand how his or her job is performing on the cluster and where there is room for improvement.
- Enhanced cluster visualization and debugging
- Execution diagnostics
- Job performance timelines
- Job charting
- Job profiling
- Job Export
- For easy production deployment
- Support
Karmasphere Studio Analyst Edition
- SQL interface for ad hoc analysis
- Karmasphere Application Framework + Hive + GUI =
- No cluster changes
- Works over proxies and firewalls
- Integrated Hadoop monitoring Interactive syntax checking
- Detailed diagnostics
- Enhanced schema browser
- Full JDBC4 compliance
- Multi-threaded & concurrent
[tweetmeme http://wp.me/pTu1i-4N%5D
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/
Hadoop and Pig with Alan Gates from Yahoo
Episode 4 of our Podcast is with Alan Gates, Senior Software Engineer @ Yahoo! and Pig committer. Click here to listen.
Hadoop is a really important part of Yahoo’s infrastructure because processing and analyzing big data is increasingly important for their business. Hadoop allows Yahoo to connect their consumer products with their advertisers and users for a better user experience. They have been involved with Hadoop for many years now and have their own distribution. Yahoo also sponsors/hosts a user group meeting which has grown to hundreds of attendees every month.
We talked about what Pig is now, the future of Pig and other projects like Oozie http://github.com/tucu00/oozie1 which Yahoo uses (and is open source) for workflow of MapReduce & Pig script automation. We also talked about Zebra http://wiki.apache.org/pig/zebra, Owl http://wiki.apache.org/pig/owl, and Elephant Bird http://github.com/kevinweil/elephant-bird
[tweetmeme http://wp.me/pTu1i-4A%5D
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/