Hadoop, The Cloudera Development Kit, Parquet, Apache BigTop and more with Tom White
We talked a lot about The Cloudera Development Kit http://github.com/cloudera/cdk, or CDK for short, which is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
The goals of the CDK are:
- Codify expert patterns and practices for building data-oriented systems and applications.
- Let developers focus on business logic, not plumbing or infrastructure.
- Provide smart defaults for platform choices.
- Support piecemeal adoption via loosely-coupled modules.
This project is organized into modules. Modules may be independent or have dependencies on other modules within the CDK. When possible, dependencies on external projects are minimized.
We also talked about Parquet http://parquet.io/ which was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language. Parquet is built from the ground up with complex nested data structures in mind, and uses the repetition/definition level approach to encoding such data structures, as popularized by Google Dremel. We believe this approach is superior to simple flattening of nested name spaces.
Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.
Tom talked about Apache BigTop too http://bigtop.apache.org/ Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem. The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.