Archive

Archive for August, 2013

Big Data, Open Source and Analytics

August 26, 2013 Leave a comment

Episode #15 of the podcast is a talk with Stefan Groschupf  available also on iTunes

Stefan is the CEO of Datameer and talked about how the company started and where it is now. Founded in 2009 by some of the original contributors to Apache Hadoop, Datameer has grown to a global team, advancing big data analytics. After several implementations of Hadoop analytics solutions at Global 500 companies, the founders were determined to build the next generation analytics application to solve the new use cases created by the explosion of structured and unstructured data. Datameer is the single application for big data analytics by combining data integration, data transformation and data visualization. Customers love us and we work to make Datameer even better each day.

Datameer provides the most complete solution to analyze structured and unstructured data. Not limited by a pre-built schema, the point and click functions means your analytics are only limited by your imagination. Even the most complex nested joins of a large number of datasets can be performed using an interactive dialog. Mix and match analytics and data transformations in unlimited number of data processing pipelines. Leave the raw data untouched.

Datameer turbocharges time series analytics by correlating multiple sets of complex, disparate data. Resulting analytics are endless including correlation of credit card transactions with card holder authorizations, network traffic data, marketing interaction data and many more. The end game is a clear window into the operations of your business, giving you the actionable insights you need to make business decisions.

some alt text

Data is the raw materials of insight and the more data you have, the deeper and broader the possible insights. Not just traditional, transaction data but all types of data so that you can get a complete view of your customers, better understand business processes and improve business performance.

Datameer ignores the limitations of ETL and static schemas to empower business users to integrate data from any source into Hadoop. Pre-built data connector wizards for all common structured and unstructured data sources means that data integration is an easy, three step process of where, what and when.

App Market Infographics

Now you never have to waste precious time by starting from scratch. Anyone can simply browse the Analytics App Market, download an app, connect to data, and get instant results. But why stop there? Every application is completely open so you can customize it, extend it, or even mash it up with other applications to get the insights you need.

Built by data scientists, analysts, or subject matter experts, analytic apps range from horizontal use cases like email and social sentiment analysis to vertical or even product-specific applications like advanced Salesforce.com sales-cycle analysis.

Check out the Datameer app market.

Subscribe to the podcast and listen to what Stefan had to say.  Available also on iTunes

/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/

Advertisement
Categories: Hadoop, Podcast

SQL Compatibility in Hadoop with Hive

August 15, 2013 Leave a comment

Episode #14 of the podcast is a talk with Alan Gates available also on iTunes

The Stinger initiative is a collection of development threads in the Hive community that will deliver 100X performance improvements as well as SQL compatibility.

Fast Interactive Query
An immediate aim of 100x performance increase for Hive is more ambitious than any other effort.
SQL Compatibility
Based on industry standard SQL, the Stinger Initiative improves HiveQL to deliver SQL compatibility.

Apache Hive is the de facto standard for SQL-in-Hadoop today with more enterprises relying on this open source project than any alternative. As Hadoop gains in popularity, enterprise requirements for Hive to become more real time or interactive have evolved… and the Hive community has responded.

He spoke in detail about the Stinger initiative, who is contributing to it, why they decided to improve upon Hive and not create a new system and more.

He talked about how Microsoft is contributing in the open source community to improve upon Hive.

Hadoop is so much more than just SQL, one of the wonderful things about Big Data is the power it brings for users to bring different processing models such as realtime streaming with Storm, Graph processing with Giraph and ETL with Pig and all different things to-do beyond just this SQL compatibility.

Alan also talked about YARN and Tez and the benefits of the Stinger initiative to other Hadoop ecosystem tools too.

Subscribe to the podcast and listen to what Alan had to say.  Available also on iTunes

/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/

Categories: Hadoop, Hive, Podcast

Apache Zookeeper, Distributed Systems, Open Source and more with Camille Fournier

August 13, 2013 Leave a comment

Episode #13 of the podcast is a talk with Camille Fournier Available also on iTunes

Apache Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage.  Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Camille talked about discovery services, distributed locking as well as some tips to developing against and operating Zookeeper in production including how to build a Global, Highly Available Service Discovery Infrastructure with ZooKeeper which she also wrote about on her blog http://whilefalse.blogspot.com/2012/12/building-global-highly-available.html.

Camille gave some great insights about how to apply Open Source community practices to an organization’s SDLC to foster a better culture for better products and services where all developers need to own more parts of their software  (like it is in Open Source projects). #devops #qaops #userops

Subscribe to the podcast and listen to what Camille had to say.  Available also on iTunes

/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/

Categories: Podcast, Zookeeper

Apache BigTop and how packaging infrastructure binds the Hadoop ecosystem together

August 12, 2013 Leave a comment

Episode #12 of the podcast is a talk with Mark Grover and Roman Shaposhnik  Available also on iTunes

Apache Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

BigTop makes it easier to deploy Hadoop Ecosystem projects including:

  • Apache Zookeeper

  • Apache Flume

  • Apache HBase

  • Apache Pig

  • Apache Hive

  • Apache Sqoop

  • Apache Oozie

  • Apache Whirr

  • Apache Mahout

  • Apache Solr (SolrCloud)

  • Apache Crunch (incubating)

  • Apache HCatalog

  • Apache Giraph

  • LinkedIn DataFu

  • Cloudera Hue

The list of supported Linux platforms has expanded to include:

  • CentOS/RHEL 5 and 6

  • Fedora 17 and 18

  • SuSE Linux Enterprise 11

  • OpenSUSE 12.2

  • Ubuntu LTS Lucid (10.04) and Precise (12.04)

  • Ubuntu Quantal (12.10)

Subscribe to the podcast and listen to what Mark and Roman had to say.  Available also on iTunes

/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/

Hadoop as a Service cloud platform with the Mortar Framework and Pig

August 9, 2013 Leave a comment

Episode #11 of the podcast is a talk with K Young.  Available also on iTunes

Mortar is the fastest and easiest way to work with Pig and Python on Hadoop in the Cloud.

Mortar’s platform is for everything from joining and cleansing large data sets to machine learning and building recommender systems.

Mortar makes it easy for developers and data scientists to do powerful work with Hadoop. The main advantages of Mortar are:

  • Zero Setup Time: Mortar takes only minutes to set up (or no time at all on the web), and you can start running Pig jobs immediately. No need for painful installation or configuration.
  • Powerful Tooling: Mortar provides a rich suite of tools to aid in Pig development, including the ability to Illustrate a script before running it, and an extremely fast and free local development mode.
  • Elastic Clusters: We spin up Hadoop clusters as you need them, so you don’t have to predict your needs in advance, and you don’t pay for machines you don’t use.
  • Solid Support: Whether the issue is in your script or in Hadoop, we’ll help you figure out a solution.

We talked about the Open Source Mortar Framework and their new Open Source tool for visualizing data while writing Pig scripts called Watchtower

The Mortar Blog has a great video demo on Watchtower.

There are no two ways around it, Hadoop development iterations are slow. Traditional programmers have always had the benefit of re-compiling their app, running it, and seeing the results within seconds. They have near instant validation that what they’re building is actually working. When you’re working with Hadoop, dealing with gigabytes of data, your development iteration time is more like hours.

Subscribe to the podcast and listen to what K Young had to say.  Available also on iTunes

/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/

Categories: Hadoop, Pig, Podcast

Hadoop, The Cloudera Development Kit, Parquet, Apache BigTop and more with Tom White

August 2, 2013 Leave a comment

Episode #10 of the podcast is a talk with Tom White.  Available also on iTunes

We talked a lot about The Cloudera Development Kit http://github.com/cloudera/cdk, or CDK for short, which is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

The goals of the CDK are:

  • Codify expert patterns and practices for building data-oriented systems and applications.
  • Let developers focus on business logic, not plumbing or infrastructure.
  • Provide smart defaults for platform choices.
  • Support piecemeal adoption via loosely-coupled modules.

Eric Sammer recorded a webinar in which he talks about the goals of the CDK.

This project is organized into modules. Modules may be independent or have dependencies on other modules within the CDK. When possible, dependencies on external projects are minimized.

We also talked about Parquet http://parquet.io/ which was created  to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.  Parquet is built from the ground up with complex nested data structures in mind, and uses the repetition/definition level approach to encoding such data structures, as popularized by Google Dremel. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Tom talked about Apache BigTop too http://bigtop.apache.org/ Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.  The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Subscribe to the podcast and listen to what Tom had to say.  Available also on iTunes

/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/