Archive

Archive for the ‘Hive’ Category

Beyond MapReduce and Apache Hadoop 2.X with Bikas Saha and Arun Murthy

April 2, 2014 Leave a comment

Episode 20 of the podcast was with Bikas Saha and Arun Murthy.

When I spoke with Arun a year or so a go YARN was NextGen Hadoop and there have been a lot of updates, work done and production experience since.

Besides Yahoo! other multi thousand node clusters have been and are running in production with YARN. These clusters have shown 2x capacity throughput which resulted in reduced cost for hardware (and in some cases being able to shut down co-los) while still gaining performance improvements overall to previous clusters of Hadoop 1.X.

I got to hear about some of what is in 2.4 and coming in 2.5 of Hadoop:

  • Application timeline server repository and api for application specific metrics (Tez, Spark, Whatever).
    • web service API to put and get with some aggregation.
    • plugable nosql store (hbase, accumulo) to scale it.
  • Preemption capacity scheduler.
  • Multiple resource support (CPU, RAM and Disk).
  • Labels tag nodes with labels can be labeled however so some windows and some linux and ask for resources with only those labels with ACLS.
  • Hypervisor support as a key part of the topology.
  • Hoya generalize for YARN (game changer) and now proposed as Slider to the Apache incubator.

We talked about Tez which provides complex DAGs of queries to translate what you want to-do on Hadoop without the work arounds for making it have to run in MapReduce.  MapReduce was not designed to be re-workable out side of the parts of the Job it gave you for Map, Split, Shuffle, Combine, Reduce, Etc and Tez is more expressible exposing a DAG API.

PigHiveQueryOnMR

Now becomes with Tez:

PigHiveQueryOnTez

 

There were also some updates on Hive v13 coming out with sub queries, low latency queries (through Tez), high precision decimal points and more!

Subscribe to the podcast and listen to all of what Bikas and Arun had to say.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 Twitter: @allthingshadoop
********************************************/

 

 

SQL Compatibility in Hadoop with Hive

August 15, 2013 Leave a comment

Episode #14 of the podcast is a talk with Alan Gates available also on iTunes

The Stinger initiative is a collection of development threads in the Hive community that will deliver 100X performance improvements as well as SQL compatibility.

Fast Interactive Query
An immediate aim of 100x performance increase for Hive is more ambitious than any other effort.
SQL Compatibility
Based on industry standard SQL, the Stinger Initiative improves HiveQL to deliver SQL compatibility.

Apache Hive is the de facto standard for SQL-in-Hadoop today with more enterprises relying on this open source project than any alternative. As Hadoop gains in popularity, enterprise requirements for Hive to become more real time or interactive have evolved… and the Hive community has responded.

He spoke in detail about the Stinger initiative, who is contributing to it, why they decided to improve upon Hive and not create a new system and more.

He talked about how Microsoft is contributing in the open source community to improve upon Hive.

Hadoop is so much more than just SQL, one of the wonderful things about Big Data is the power it brings for users to bring different processing models such as realtime streaming with Storm, Graph processing with Giraph and ETL with Pig and all different things to-do beyond just this SQL compatibility.

Alan also talked about YARN and Tez and the benefits of the Stinger initiative to other Hadoop ecosystem tools too.

Subscribe to the podcast and listen to what Alan had to say.  Available also on iTunes

/*********************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
**********************************/

Categories: Hadoop, Hive, Podcast

Pre-Release from Pentaho – HIVE JDBC Adapter

August 15, 2010 Leave a comment

Pentaho’s Jordan Ganoff, Software Engineer, has open sourced some HIVE JDBC Adapters in what they are doing for their BI server

http://forums.pentaho.com/showthread.php?77826-Hive-amp-Hadoop

Not sure what state they are in but will try to check it on this week.

To use from maven:
<dependency>
<groupId>org.apache.hadoop.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>0.5.0-pentaho-SNAPSHOT</version>
</dependency>

You must also add the repository information to either the pom.xml or
your local settings:
<repository>
<id>pentaho</id>
<name>Pentaho External Repository</name>
<url>http://repo.pentaho.org/artifactory/repo</url&gt;
</repository>

/*

Joe Stein
http://medialets.com

*/

Categories: Hive, Open Source Projects
Follow

Get every new post delivered to your Inbox.

Join 49 other followers