<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>All Things Hadoop</title>
	<atom:link href="http://allthingshadoop.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://allthingshadoop.com</link>
	<description>Scalable &#38; Distributed Computing for noobs, nerds and the elite Hadooper and Hadooperette.</description>
	<lastBuildDate>Thu, 02 May 2013 21:53:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='allthingshadoop.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/c6d1ce6389fbc4c5c50fe33c968530fc?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>All Things Hadoop</title>
		<link>http://allthingshadoop.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://allthingshadoop.com/osd.xml" title="All Things Hadoop" />
	<atom:link rel='hub' href='http://allthingshadoop.com/?pushpress=hub'/>
		<item>
		<title>Using Scala To Work With Hadoop</title>
		<link>http://allthingshadoop.com/2013/05/02/using-scala-to-work-with-hadoop/</link>
		<comments>http://allthingshadoop.com/2013/05/02/using-scala-to-work-with-hadoop/#comments</comments>
		<pubDate>Thu, 02 May 2013 21:53:38 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=524</guid>
		<description><![CDATA[Cloudera has a great toolkit to work with Hadoop.  Specifically it is focused on building distributed systems and services on top of the Hadoop Ecosystem. http://cloudera.github.io/cdk/docs/0.2.0/cdk-data/guide.html And the examples are in Scala!!!! Here is how you you work with generic stuff on the file system including Avro files reading and writing. https://github.com/cloudera/cdk/blob/master/cdk-examples/src/main/scala/creategeneric.scala /** * Copyright [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=524&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Cloudera has a great toolkit to work with Hadoop.  Specifically it is focused on building distributed systems and services on top of the Hadoop Ecosystem.</p>
<p><a href="http://cloudera.github.io/cdk/docs/0.2.0/cdk-data/guide.html">http://cloudera.github.io/cdk/docs/0.2.0/cdk-data/guide.html</a></p>
<p>And the examples are in Scala!!!!</p>
<p>Here is how you you work with generic stuff on the file system including Avro files reading and writing.</p>
<p><a href="https://github.com/cloudera/cdk/blob/master/cdk-examples/src/main/scala/creategeneric.scala">https://github.com/cloudera/cdk/blob/master/cdk-examples/src/main/scala/creategeneric.scala</a></p>
<p><code>/**<br />
* Copyright 2013 Cloudera Inc.<br />
*<br />
* Licensed under the Apache License, Version 2.0 (the "License");<br />
* you may not use this file except in compliance with the License.<br />
* You may obtain a copy of the License at<br />
*<br />
* <a href="http://www.apache.org/licenses/LICENSE-2.0" rel="nofollow">http://www.apache.org/licenses/LICENSE-2.0</a><br />
*<br />
* Unless required by applicable law or agreed to in writing, software<br />
* distributed under the License is distributed on an "AS IS" BASIS,<br />
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.<br />
* See the License for the specific language governing permissions and<br />
* limitations under the License.<br />
*/<br />
import com.cloudera.data.{DatasetDescriptor, DatasetWriter}<br />
import com.cloudera.data.filesystem.FileSystemDatasetRepository<br />
import java.io.FileInputStream<br />
import org.apache.avro.Schema<br />
import org.apache.avro.Schema.Parser<br />
import org.apache.avro.generic.{GenericRecord, GenericRecordBuilder}<br />
import org.apache.hadoop.conf.Configuration<br />
import org.apache.hadoop.fs.{FileSystem, Path}<br />
import scala.compat.Platform<br />
import scala.util.Random</p>
<p>// Construct a local filesystem dataset repository rooted at /tmp/data<br />
val repo = new FileSystemDatasetRepository(<br />
FileSystem.get(new Configuration()),<br />
new Path("/tmp/data")<br />
)</p>
<p>// Read an Avro schema from the user.avsc file on the classpath<br />
val schema = new Parser().parse(new FileInputStream("src/main/resources/user.avsc"))</p>
<p>// Create a dataset of users with the Avro schema in the repository<br />
val descriptor = new DatasetDescriptor.Builder().schema(schema).get()<br />
val users = repo.create("users", descriptor)</p>
<p>// Get a writer for the dataset and write some users to it<br />
val writer = users.getWriter().asInstanceOf[DatasetWriter[GenericRecord]]<br />
writer.open()<br />
val colors = Array("green", "blue", "pink", "brown", "yellow")<br />
val rand = new Random()<br />
for (i val builder = new GenericRecordBuilder(schema)<br />
val record = builder.set("username", "user-" + i)<br />
.set("creationDate", Platform.currentTime)<br />
.set("favoriteColor", colors(rand.nextInt(colors.length))).build()<br />
writer.write(record)<br />
}<br />
writer.close()</p>
<p></code></p>
<p>Big ups to the Cloudera team!</p>
<p>/*<br />
Joe Stein<br />
<a href="https://twitter.com/allthingshadoop">https://twitter.com/allthingshadoop</a><br />
*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/524/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/524/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=524&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2013/05/02/using-scala-to-work-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>
	</item>
		<item>
		<title>Hortonworks HDP1, Apache Hadoop 2.0, NextGen MapReduce (YARN), HDFS Federation and the future of Hadoop with Arun C. Murthy</title>
		<link>http://allthingshadoop.com/2012/07/23/hortonworks-hdp1-apache-hadoop-2-0-nextgen-mapreduce-yarn-hdfs-federation-and-the-future-of-hadoop-with-arun-c-murthy/</link>
		<comments>http://allthingshadoop.com/2012/07/23/hortonworks-hdp1-apache-hadoop-2-0-nextgen-mapreduce-yarn-hdfs-federation-and-the-future-of-hadoop-with-arun-c-murthy/#comments</comments>
		<pubDate>Mon, 23 Jul 2012 12:25:36 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open Source Projects]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=496</guid>
		<description><![CDATA[Episode #8 of the Podcast is a talk with Arun C. Murthy. We talked about Hortonworks HDP1, the first release from Hortonworks, Apache Hadoop 2.0, NextGen MapReduce (YARN) and HDFS Federations subscribe to the podcast and listen to all of what Arun had to share. Some background to what we discussed: Hortonworks Data Platform (HDP) [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=496&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://feeds.feedburner.com/allthingshadoop/kjGc" target="_blank">Episode #8</a> of the <a href="http://allthingshadoop.com/podcast" target="_self">Podcast</a> is a talk with <a href="http://twitter.com/acmurthy" target="_blank">Arun C. Murthy</a>.</p>
<p>We talked about <a href="http://hortonworks.com/products/hortonworksdataplatform/" target="_blank">Hortonworks HDP1,</a> the first release from Hortonworks, <a href="http://hadoop.apache.org/common/docs/current/" target="_blank">Apache Hadoop 2.0</a>, <a href="http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html" target="_blank">NextGen MapReduce (YARN)</a> and <a href="http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/Federation.html" target="_blank">HDFS Federations</a></p>
<p><a href="http://feeds.feedburner.com/allthingshadoop/kjGc" target="_blank">subscribe to the podcast</a> and listen to all of what Arun had to share.</p>
<p>Some background to what we discussed:</p>
<h2>Hortonworks Data Platform (HDP)</h2>
<p>from their website: <a href="http://hortonworks.com/products/hortonworksdataplatform/" target="_blank">http://hortonworks.com/products/hortonworksdataplatform/</a></p>
<p>Hortonworks Data Platform (HDP) is a 100% open source data management platform based on Apache Hadoop. It allows you to load, store, process and manage data in virtually any format and at any scale. As the foundation for the next generation enterprise data architecture, HDP includes all of the necessary components to begin uncovering business insights from the quickly growing streams of data flowing into and throughout your business.</p>
<p>Hortonworks Data Platform is ideal for organizations that want to combine the power and cost-effectiveness of Apache Hadoop with the advanced services required for enterprise deployments. It is also ideal for solution providers that wish to integrate or extend their solutions with an open and extensible Apache Hadoop-based platform.</p>
<h5>Key Features</h5>
<ul>
<li><strong>Integrated and Tested Package</strong> – HDP includes stable versions of all the critical Apache Hadoop components in an integrated and tested package.</li>
<li><strong>Easy Installation</strong> – HDP includes an installation and provisioning tool with a modern, intuitive user interface.</li>
<li><strong>Management and Monitoring Services</strong> – HDP includes intuitive dashboards for monitoring your clusters and creating alerts.</li>
<li><strong>Data Integration Services</strong> – HDP includes Talend Open Studio for Big Data, the leading open source integration tool for easily connecting Hadoop to hundreds of data systems without having to write code.</li>
<li><strong>Metadata Services</strong> – HDP includes Apache HCatalog, which simplifies data sharing between Hadoop applications and between Hadoop and other data systems.</li>
<li><strong>High Availability</strong> – HDP has been extended to seamlessly integrate with proven high availability solutions.</li>
</ul>
<h2>Apache Hadoop 2.0</h2>
<p>from their website: <a href="http://hadoop.apache.org/common/docs/current/" target="_blank">http://hadoop.apache.org/common/docs/current/</a></p>
<div>
<p>Apache Hadoop 2.x consists of significant improvements over the previous stable release (hadoop-1.x).</p>
<p>Here is a short overview of the improvments to both HDFS and MapReduce.</p>
<ul>
<li><a name="HDFS_Federation"></a>HDFS FederationIn order to scale the name service horizontally, federation uses multiple independent Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent and don&#8217;t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.More details are available in the <a href="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html">HDFS Federation</a> document.</li>
<li><a name="MapReduce_NextGen_aka_YARN_aka_MRv2"></a>MapReduce NextGen aka YARN aka MRv2The new architecture introduced in hadoop-0.23, divides the two major functions of the JobTracker: resource management and job life-cycle management into separate components.The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚Äôs scheduling and coordination.An application is either a single job in the sense of classic MapReduce jobs or a DAG of such jobs.The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
<p>More details are available in the <a href="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">YARN</a> document.</li>
</ul>
</div>
<div>
<h5>Getting Started<a name="Getting_Started"></a></h5>
<p>The Hadoop documentation includes the information you need to get started using Hadoop. Begin with the <a href="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/SingleCluster.html">Single Node Setup</a> which shows you how to set up a single-node Hadoop installation. Then move on to the <a href="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html">Cluster Setup</a> to learn how to set up a multi-node Hadoop installation.</p>
</div>
<h2>Apache Hadoop NextGen MapReduce (YARN)<a name="Apache_Hadoop_NextGen_MapReduce_YARN"></a></h2>
<p>from their website: <a href="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html" target="_blank">http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html</a></p>
<p>MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.</p>
<p>The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (<em>RM</em>) and per-application ApplicationMaster (<em>AM</em>). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.</p>
<p>The ResourceManager and per-node slave, the NodeManager (<em>NM</em>), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.</p>
<p>The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.</p>
<p><img src="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif" alt="MapReduce NextGen Architecture" /></p>
<p>The ResourceManager has two main components: Scheduler and ApplicationsManager.</p>
<p>The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource <em>Container</em> which incorporates elements such as memory, cpu, disk, network etc. In the first version, only <tt>memory</tt> is supported.</p>
<p>The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.</p>
<p>The CapacityScheduler supports <tt>hierarchical queues</tt> to allow for more predictable sharing of cluster resources</p>
<p>The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.</p>
<p>The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.</p>
<p>The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.</p>
<p>MRV2 maintains <strong>API compatibility</strong> with previous stable release (hadoop-0.20.205). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.</p>
<h2>HDFS Federation</h2>
<p>from their website: <a href="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html" target="_blank">http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html</a></p>
<div>
<h3><a name="Background"></a>Background</h3>
<p><img src="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/federation-background.gif" alt="HDFS Layers" />HDFS has two main layers:</p>
<ul>
<li><strong>Namespace</strong>
<ul>
<li>Consists of directories, files and blocks</li>
<li>It supports all the namespace related file system operations such as create, delete, modify and list files and directories.</li>
</ul>
</li>
<li><strong>Block Storage Service</strong> has two parts
<ul>
<li>Block Management (which is done in Namenode)
<ul>
<li>Provides datanode cluster membership by handling registrations, and periodic heart beats.</li>
<li>Processes block reports and maintains location of blocks.</li>
<li>Supports block related operations such as create, delete, modify and get block location.</li>
<li>Manages replica placement and replication of a block for under replicated blocks and deletes blocks that are over replicated.</li>
</ul>
</li>
<li>Storage &#8211; is provided by datanodes by storing blocks on the local file system and allows read/write access.</li>
</ul>
<p>The prior HDFS architecture allows only a single namespace for the entire cluster. A single Namenode manages this namespace. HDFS Federation addresses limitation of the prior architecture by adding support multiple Namenodes/namespaces to HDFS file system.</li>
</ul>
</div>
<div>
<h3><a name="Multiple_NamenodesNamespaces"></a>Multiple Namenodes/Namespaces</h3>
<p>In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.</p>
<p><img src="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/federation.gif" alt="HDFS Federation Architecture" /><strong>Block Pool</strong></p>
<p>A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for all the block pools in the cluster. It is managed independently of other block pools. This allows a namespace to generate Block IDs for new blocks without the need for coordination with the other namespaces. The failure of a Namenode does not prevent the datanode from serving other Namenodes in the cluster.</p>
<p>A Namespace and its block pool together are called Namespace Volume. It is a self-contained unit of management. When a Namenode/namespace is deleted, the corresponding block pool at the datanodes is deleted. Each namespace volume is upgraded as a unit, during cluster upgrade.</p>
<p><strong>ClusterID</strong></p>
<p>A new identifier <strong>ClusterID</strong> is added to identify all the nodes in the cluster. When a Namenode is formatted, this identifier is provided or auto generated. This ID should be used for formatting the other Namenodes into the cluster.</p>
<div>
<h4>Key Benefits<a name="Key_Benefits"></a></h4>
<ul>
<li>Namespace Scalability &#8211; HDFS cluster storage scales horizontally but the namespace does not. Large deployments or deployments using lot of small files benefit from scaling the namespace by adding more Namenodes to the cluster</li>
<li>Performance &#8211; File system operation throughput is limited by a single Namenode in the prior architecture. Adding more Namenodes to the cluster scales the file system read/write operations throughput.</li>
<li>Isolation &#8211; A single Namenode offers no isolation in multi user environment. An experimental application can overload the Namenode and slow down production critical applications. With multiple Namenodes, different categories of applications and users can be isolated to different namespaces.</li>
</ul>
</div>
</div>
<p><a href="http://feeds.feedburner.com/allthingshadoop/kjGc" target="_blank">subscribe to the podcast</a> and listen to all of what Arun had to share.</p>
<div class="tweetmeme-button" id="tweetmeme-button-post-496" style='float: right; margin-left: 10px; margin-bottom: 5px; padding: 4px 0 2px 4px; background: #fff;'>
<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fallthingshadoop.com%2F2012%2F07%2F23%2Fhortonworks-hdp1-apache-hadoop-2-0-nextgen-mapreduce-yarn-hdfs-federation-and-the-future-of-hadoop-with-arun-c-murthy%2Ftweetmeme_alias%3Dhttp%3A%2F%2Fwp.me%2FpTu1i-80%26tweetmeme_source%3Dwordpressdotcom"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fallthingshadoop.com%2F2012%2F07%2F23%2Fhortonworks-hdp1-apache-hadoop-2-0-nextgen-mapreduce-yarn-hdfs-federation-and-the-future-of-hadoop-with-arun-c-murthy%2F" height="61" width="51" /></a>
</div>
<p>/*<br />
Joe Stein<br />
<a href="http://www.linkedin.com/in/charmalloc" target="_blank">http://www.linkedin.com/in/charmalloc</a><br />
*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/496/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/496/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=496&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2012/07/23/hortonworks-hdp1-apache-hadoop-2-0-nextgen-mapreduce-yarn-hdfs-federation-and-the-future-of-hadoop-with-arun-c-murthy/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>

		<media:content url="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif" medium="image">
			<media:title type="html">MapReduce NextGen Architecture</media:title>
		</media:content>

		<media:content url="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/federation-background.gif" medium="image">
			<media:title type="html">HDFS Layers</media:title>
		</media:content>

		<media:content url="http://hadoop.apache.org/common/docs/current/hadoop-yarn/hadoop-yarn-site/federation.gif" medium="image">
			<media:title type="html">HDFS Federation Architecture</media:title>
		</media:content>
	</item>
		<item>
		<title>Hadoop distribution bake-off and my experience with Cloudera and MapR</title>
		<link>http://allthingshadoop.com/2012/07/10/hadoop-distribution-bake-off-my-experience-with-cloudera-and-mapr/</link>
		<comments>http://allthingshadoop.com/2012/07/10/hadoop-distribution-bake-off-my-experience-with-cloudera-and-mapr/#comments</comments>
		<pubDate>Wed, 11 Jul 2012 02:26:37 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=472</guid>
		<description><![CDATA[A few months back we started to endeavor on a new Hadoop cluster @ medialets We have been live with Hadoop in production since April 2010 and we are still running CDH2. Our current hosting provider does not have a very ideal implementation for us where our 36 nodes are spread out across an entire [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=472&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>A few months back we started to endeavor on a new Hadoop cluster @ <a href="http://www.medialets.com" target="_blank">medialets</a></p>
<p>We have been live with Hadoop in production since April 2010 and we are still running CDH2.  Our current hosting provider does not have a very ideal implementation for us where our 36 nodes are spread out across an entire data center and 5 networks each with 1 GB link.  While there are issues with this type of setup we have been able to organically grow our cluster (started at 4 nodes) which powers 100% of our batch analytics for what is now hundreds of millions of mobile devices.</p>
<p>One of our mapreduce jobs processes 30+ billion objects (about 3 TB of uncompressed data) and takes about 90 minutes to run.  This jobs runs all day long contiguously.  Each run ingests the data that was received while the previous job was running.  One of the primary goals of our new cluster was to reduce the time these type of jobs take without having to make any code changes or increase our investment in hardware.  We figured besides the infrastructure changes we needed/wanted to make that running an old version of Hadoop meant that we were not taking advantage of all the awesome work that folks have been putting in over the last 2 years to do things like increasing performance. </p>
<p>So we endeavored to what seems to have been coined as &#8220;The Hadoop Distribution Bake-off&#8221;.  We wanted to not only see how new versions of the Cloudera distribution would be running our jobs but also evaluate other distributions that have emerged since we first started with Hadoop.  When we did this Hortonwork&#8217;s distribution was not released yet otherwise we would have added them and their distro to the possible choices.</p>
<p>First we found a new vendor to setup a test cluster for us <a href="http://www.logicworks.com" target="_blank">http://www.logicworks.com</a>.  It was a four node cluster each with 2GB (1G dual bonded) NIC, 12GB of RAM, 4 x 1TB drives (using 3 of the drives for data and one for the OS) and 2x Westmere 5645 2.4GHz Hex-Core CPU.  While this was not going to be the exact configuration we were going to end up with it was what they had in inventory and for the purposes of this test it was all about keeping the same hardware running with the same job with the same data and only changing the distro and configurations.  As part of our due diligence, performance was not the only point we were interested in but was the primary goal of the bake-off and testing.  We also reviewed other aspects of the distributions and companies which ultimately led to our final decision to go with CDH4 for our new cluster.</p>
<p>First, we wanted to create a baseline to see how our data and job did with the existing distribution (CDH2) we run in production with our existing production configuration.  Next we wanted to give MapR a shot.  We engaged with their team and they spent their time and assistance to help configure and optimize for the job&#8217;s test run.  Once that was done we wanted to give CDH3 and CDH4 (which was still beta at the time) and the Cloudera folks also lent their time and helped configure and optimize the cluster.</p>
<p>CDH2 = 12 hours 12 min (our production configuration)<br />
MapR = 4 hours 31 min (configuration done by MapR team)<br />
CDH3 = 6 hours 8 min (our production configuration)<br />
CDH4 = 4 hours 20 min (configuration done by Cloudera team)</p>
<p>This told us that the decision between running CDH4 or MapR was not going to be made based on performance of the distribution with our data and mapreduce jobs.  </p>
<p>So, we had to look at the other things that were important to us.</p>
<p>MapR has a couple of a really nice features that are unique to their platform.  Their file system features with NFS and Snapshots, both are cool so lets go through them quickly.  MapR&#8217;s underlying proprietary file system allows for these unique features in the Hadoop ecosystem.  The NFS feature basically allows you to copy to an NFS share that is distributed across the entire cluster (with a VIP so highly available).   This means that you can use the cluster for saving data from your applications and then without any additional copies map-reduce over it.  Data is compressible under the hood though this did not mean much to us since we compress all of our data in sequence files using compress by block size on the sequence file.  Snapshots (and mirroring to other clusters of those snapshots) is nifty.  Being able to take a point in time instance cut of things makes the cluster feel and operate like our SAN.  While snapshots are nifty the same end result is capable with a distcp which sure takes longer but is still technically feasible not a lot of other benefits for us or our business, nifty none the less.  The main issue we had with all of this was that all of the features that were attractive required us to license their product.  Their product also is not open source so we would not be able to build the code, make changes or anything else always having to rely on them for support and maintenance.  We met a lot of great folks from MapR but only 2 of them were Apache committers (they may have more on staff, I only met two though) and this is important to us from a support &amp; maintenance perspective&#8230; for them it probably is not a huge deal since their platform is not open source and proprietary ( I think I just repeated myself here but did so on purpose ).</p>
<p>Cloudera&#8230; tried, true and trusted (I have been running CDH2 for 2 years in production without ever having to upgrade) and know lots of folks that can say the same thing.  Everything is Open Source with a very healthy and active community.  A handful of times this has been very helpful in development cycles for me to see what the container I was running in was doing to help me resolve the problems I was finding in my own code&#8230; or even to simply shoot a question over the mailing list to get a response to a question.  As far as the distribution goes, it costs nothing to get it running and have it run in production with all of the features we wanted.  If we ever decided to pay for support there are a boat load (a large boat) of Apache Committers not just to the Hadoop project but to lots of projects within the Hadoop eco system all of which are available and part and parcel to help answer questions and make code changes, etc.  The philosophy of their distribution (besides just being open source) is to cherry pick changes from Apache Hadoop as soon as they can (or should or want) to be introduced to making their distribution best.</p>
<p>I can think of a lot of industries and companies were MapR would be a good choice over Cloudera.  </p>
<p>We decided what was best for us was to go with CDH4 for our new cluster.  And, if we ever decide to purchase support we would get it from Cloudera.</p>
<p>/*<br />
Joe Stein<br />
<a href="http://www.linkedin.com/in/charmalloc" target="_blank">http://www.linkedin.com/in/charmalloc</a><br />
*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/472/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/472/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=472&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2012/07/10/hadoop-distribution-bake-off-my-experience-with-cloudera-and-mapr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>
	</item>
		<item>
		<title>Unified analytics and large scale machine learning with Milind Bhandarkar</title>
		<link>http://allthingshadoop.com/2012/06/01/unified-analytics-and-large-scale-machine-learning-with-milind-bhandarkar/</link>
		<comments>http://allthingshadoop.com/2012/06/01/unified-analytics-and-large-scale-machine-learning-with-milind-bhandarkar/#comments</comments>
		<pubDate>Fri, 01 Jun 2012 22:42:10 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open Source Projects]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=465</guid>
		<description><![CDATA[Episode #7 of the Podcast is a talk with Milind Bhandarkar. We talked about unified analytics, machine learning, data science, some great history of Hadoop, the future of Hadoop and a lot more! subscribe to the podcast and listen to all of what Milind had to share. /* Joe Stein http://www.medialets.com */<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=465&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://feeds.feedburner.com/allthingshadoop/kjGc" target="_blank">Episode #7</a> of the <a href="http://allthingshadoop.com/podcast" target="_self">Podcast</a> is a talk with <a href="http://twitter.com/techmilind" target="_blank">Milind Bhandarkar</a>.</p>
<p>We talked about unified analytics, machine learning, data science, some great history of Hadoop, the future of Hadoop and a lot more!</p>
<p><a href="http://feeds.feedburner.com/allthingshadoop/kjGc" target="_blank">subscribe to the podcast</a> and listen to all of what Milind had to share.</p>
<div class="tweetmeme-button" id="tweetmeme-button-post-465" style='float: right; margin-left: 10px; margin-bottom: 5px; padding: 4px 0 2px 4px; background: #fff;'>
<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fallthingshadoop.com%2F2012%2F06%2F01%2Funified-analytics-and-large-scale-machine-learning-with-milind-bhandarkar%2Ftweetmeme_alias%3Dhttp%3A%2F%2Fwp.me%2FpTu1i-7v%26tweetmeme_source%3Dwordpressdotcom"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fallthingshadoop.com%2F2012%2F06%2F01%2Funified-analytics-and-large-scale-machine-learning-with-milind-bhandarkar%2F" height="61" width="51" /></a>
</div>
<p>/*<br />
Joe Stein<br />
<a href="http://www.medialets.com" target="_blank">http://www.medialets.com</a><br />
*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/465/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/465/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=465&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2012/06/01/unified-analytics-and-large-scale-machine-learning-with-milind-bhandarkar/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>
	</item>
		<item>
		<title>Hadoop Streaming Made Simple using Joins and Keys with Python</title>
		<link>http://allthingshadoop.com/2011/12/16/simple-hadoop-streaming-tutorial-using-joins-and-keys-with-python/</link>
		<comments>http://allthingshadoop.com/2011/12/16/simple-hadoop-streaming-tutorial-using-joins-and-keys-with-python/#comments</comments>
		<pubDate>Sat, 17 Dec 2011 00:20:46 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=355</guid>
		<description><![CDATA[There are a lot of different ways to write MapReduce jobs!!! Sample code for this post https://github.com/joestein/amaunet I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=355&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>There are a lot of different ways to write MapReduce jobs!!!</p>
<p>Sample code for this post <a href="https://github.com/joestein/amaunet" target="_blank">https://github.com/joestein/amaunet</a></p>
<p>I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).</p>
<p>When doing streaming with Hadoop you do have a few library options.  If you are a Ruby programmer then <a target="_blank" href="http://mrflip.github.com/wukong/moreinfo.html">wukong</a> is awesome! For Python programmers you can use <a target="_blank" href="https://github.com/klbostee/dumbo/wiki">dumbo</a> and more recently released <a target="_blank" href="http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html">mrjob</a>.  </p>
<p>I like working under the hood myself and getting down and dirty with the data and here is how you can too.</p>
<p>Lets start first with defining two simple sample data sets.</p>
<p>Data set 1:  <strong>countries.dat</strong></p>
<p>name|key</p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
United States|US
Canada|CA
United Kingdom|UK
Italy|IT
</pre>
<p>Data set 2: <strong>customers.dat</strong></p>
<p>name|type|country</p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
Alice Bob|not bad|US
Sam Sneed|valued|CA
Jon Sneed|valued|CA
Arnold Wesise|not so good|UK
Henry Bob|not bad|US
Yo Yo Ma|not so good|CA
Jon York|valued|CA
Alex Ball|valued|UK
Jim Davis|not so bad|JA
</pre>
<p><strong>The requirements:</strong> you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name).</p>
<p><strong>To-do this you need to:</strong></p>
<pre>
1) Join the data sets
2) Key on country
3) Count type of customer per country
4) Output the results</pre>
<p>So first lets code up a quick mapper called <strong>smplMapper.py</strong> (you can decide if smpl is short for simple or sample).</p>
<p>Now in coding the mapper and reducer in Python the basics are explained nicely here <a target="_blank" href="http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/</a> but I am going to dive a bit deeper to tackle our example with some more tactics.</p>
<pre class="brush: python; gutter: false; title: ; notranslate">
#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
	try: #sometimes bad data can cause errors use this how you like to deal with lint and bad data
        
		personName = &quot;-1&quot; #default sorted as first
		personType = &quot;-1&quot; #default sorted as first
		countryName = &quot;-1&quot; #default sorted as first
		country2digit = &quot;-1&quot; #default sorted as first
		
		# remove leading and trailing whitespace
		line = line.strip()
	 	
		splits = line.split(&quot;|&quot;)
		
		if len(splits) == 2: #country data
			countryName = splits[0]
			country2digit = splits[1]
		else: #people data
			personName = splits[0]
			personType = splits[1]
			country2digit = splits[2]			
		
		print '%s^%s^%s^%s' % (country2digit,personType,personName,countryName)
	except: #errors are going to make your job fail which you may or may not want
		pass

</pre>
<p><strong>Don&#8217;t forget:</strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
chmod a+x smplMapper.py
</pre>
<p>Great! We just took care of #1 but time to test and see what is going to the reducer.</p>
<p><strong>From the command line run:</strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
cat customers.dat countries.dat|./smplMapper.py|sort
</pre>
<p><strong>Which will result in:</strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
CA^-1^-1^Canada
CA^not so good^Yo Yo Ma^-1
CA^valued^Jon Sneed^-1
CA^valued^Jon York^-1
CA^valued^Sam Sneed^-1
IT^-1^-1^Italy
JA^not so bad^Jim Davis^-1
UK^-1^-1^United Kingdom
UK^not so good^Arnold Wesise^-1
UK^valued^Alex Ball^-1
US^-1^-1^United States
US^not bad^Alice Bob^-1
US^not bad^Henry Bob^-1
</pre>
<p>Notice how this is sorted so the country is first and the people in that country after it (so we can grab the correct country name as we loop) and with the type of customer also sorted (but within country) so we can properly count the types within the country. =8^)</p>
<p>Let us hold off on #2 for a moment (just hang in there it will all come together soon I promise) and get <strong>smplReducer.py</strong> working first.</p>
<pre class="brush: python; gutter: false; title: ; notranslate">
#!/usr/bin/env python
 
import sys
 
# maps words to their counts
foundKey = &quot;&quot;
foundValue = &quot;&quot;
isFirst = 1
currentCount = 0
currentCountry2digit = &quot;-1&quot;
currentCountryName = &quot;-1&quot;
isCountryMappingLine = False

# input comes from STDIN
for line in sys.stdin:
	# remove leading and trailing whitespace
	line = line.strip()
	
	try:
		# parse the input we got from mapper.py
		country2digit,personType,personName,countryName = line.split('^')
		
		#the first line should be a mapping line, otherwise we need to set the currentCountryName to not known
		if personName == &quot;-1&quot;: #this is a new country which may or may not have people in it
			currentCountryName = countryName
			currentCountry2digit = country2digit
			isCountryMappingLine = True
		else:
			isCountryMappingLine = False # this is a person we want to count
		
		if not isCountryMappingLine: #we only want to count people but use the country line to get the right name 

			#first check to see if the 2digit country info matches up, might be unkown country
			if currentCountry2digit != country2digit:
				currentCountry2digit = country2digit
				currentCountryName = '%s - Unkown Country' % currentCountry2digit
			
			currentKey = '%s\t%s' % (currentCountryName,personType) 
			
			if foundKey != currentKey: #new combo of keys to count
				if isFirst == 0:
					print '%s\t%s' % (foundKey,currentCount)
					currentCount = 0 #reset the count
				else:
					isFirst = 0
			
				foundKey = currentKey #make the found key what we see so when we loop again can see if we increment or print out
			
			currentCount += 1 # we increment anything not in the map list
	except:
		pass

try:
	print '%s\t%s' % (foundKey,currentCount)
except:
	pass

</pre>
<p><strong>Don&#8217;t forget:</strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
chmod a+x smplReducer.py
</pre>
<p><strong>And then run:</strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
cat customers.dat countries.dat|./smplMapper.py|sort|./smplReducer.py
</pre>
<p>And voila!</p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
Canada	not so good	1
Canada	valued	3
JA - Unkown Country	not so bad	1
United Kingdom	not so good	1
United Kingdom	valued	1
United States	not bad	2
</pre>
<p>So now #3 and #4 are done but what about #2?  </p>
<p><strong>First put the files into Hadoop:</strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
hadoop fs -put ~/mayo/customers.dat .
hadoop fs -put ~/mayo/countries.dat .
</pre>
<p><strong>And now run it like this (assuming you are running as hadoop in the bin directory):</strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
hadoop jar ../contrib/streaming/hadoop-0.20.1+169.89-streaming.jar -D mapred.reduce.tasks=4 -file ~/mayo/smplMapper.py -mapper smplMapper.py -file ~/mayo/smplReducer.py -reducer smplReducer.py -input customers.dat -input countries.dat -output mayo -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -jobconf stream.map.output.field.separator=^ -jobconf stream.num.map.output.key.fields=4 -jobconf map.output.key.field.separator=^ -jobconf num.key.fields.for.partition=1
</pre>
<p><strong>Let us look at what we did:</strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
hadoop fs -cat mayo/part*
</pre>
<p><strong>Which results in: </strong></p>
<pre class="brush: plain; gutter: false; title: ; notranslate">
Canada	not so good	1
Canada	valued	3
United Kingdom	not so good	1
United Kingdom	valued	1
United States	not bad	2
JA - Unkown Country	not so bad	1
</pre>
<p>So #2 is the <strong>partioner</strong> KeyFieldBasedPartitioner explained here further <a target="_blank" href="http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29">Hadoop Wiki On Streaming</a> which allows the <em>key</em> to be whatever set of columns you output (in our case by country) configurable by the command line options and the rest of the <em>values</em> are sorted within that <em>key</em> and sent to the reducer together by <em>key</em>.</p>
<p>And there you go &#8230; Simple Python Scripting Implementing Streaming in Hadoop.   </p>
<p>Grab the tar <a href="http://www.gencolee.com/smpl.py.stream.tgz">here</a> and give it a spin.</p>
<p>/*<br />
Joe Stein<br />
Twitter: <a target="_blank" href="http://www.twitter.com/allthingshadoop">@allthingshadoop</a><br />
Connect: <a target="_blank" href="http://www.linkedin.com/in/charmalloc">On Linked In</a><br />
*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/355/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/355/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=355&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2011/12/16/simple-hadoop-streaming-tutorial-using-joins-and-keys-with-python/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>
	</item>
		<item>
		<title>Faster Datanodes with less wait io using df instead of du</title>
		<link>http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/</link>
		<comments>http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/#comments</comments>
		<pubDate>Sat, 21 May 2011 04:37:22 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=446</guid>
		<description><![CDATA[I have noticed often that the check Hadoop uses to calculate usage for the data nodes causes a fair amount of wait io on them driving up load. Every cycle we can get from every spindle we want! So I came up with a nice little hack to use df instead of du. Here is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=446&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I have noticed often that the check Hadoop uses to calculate usage for the data nodes causes a fair amount of wait io on them driving up load.</p>
<p>Every cycle we can get from every spindle we want!</p>
<p>So I came up with a nice little hack to use df instead of du.</p>
<p>Here is basically what I did so you can do it too.</p>
<p><code><br />
mv /usr/bin/du /usr/bin/bak_du<br />
vi /usr/bin/du </code></p>
<p>and save this inside of it<br />
<code><br />
#!/bin/sh</p>
<p>mydf=$(df $2 | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{ print $3 }')<br />
echo -e "$mydf\t$2"<br />
</code></p>
<p>then give it execute permission<br />
<code><br />
chmod a+x /usr/bin/du<br />
</code></p>
<p>restart you data node check the log for no errors and make sure it starts back up</p>
<p>viola</p>
<p>Now when Hadoop calls &#8220;du -sk /yourhdfslocation&#8221; it will be expedient with its results</p>
<p>whats wrong with this?</p>
<p>1) I assume you have nothing else on your disks that you are storing so df is really close to du since almost all of your data is in HDFS</p>
<p>2) If you have more than 1 volume holding your hdfs blocks this is not exactly accurate so you are skewing the size of each vol by only calculating one of them and using that result for the others&#8230;. this is simple to fix just parse your df result differently and use the path passed into the second paramater to know which vol to grep in your df result&#8230; your first volume is going to be larger anyways most likely and you should be monitoring disk space another way so it is not going to be very harmefull if you just check and report the first volume&#8217;s size</p>
<p>3) you might not have your HDFS blocks on your first volume &#8230;. see #2 you can just grep the volume you want to report</p>
<p>/*<br />
Joe Stein<br />
<a href="http://www.linkedin.com/in/charmalloc" target="_blank">http://www.linkedin.com/in/charmalloc</a><br />
*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/446/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/446/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=446&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>
	</item>
		<item>
		<title>Cloudera, Yahoo and the Apache Hadoop Community Security Branch Release Update</title>
		<link>http://allthingshadoop.com/2011/05/05/cloudera-yahoo-and-the-apache-hadoop-community-security-branch-release-update/</link>
		<comments>http://allthingshadoop.com/2011/05/05/cloudera-yahoo-and-the-apache-hadoop-community-security-branch-release-update/#comments</comments>
		<pubDate>Fri, 06 May 2011 02:03:39 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Open Source Projects]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=423</guid>
		<description><![CDATA[In the wake of Yahoo! having announced that they would discontinue their Hadoop distribution and focus their efforts into Apache Hadoop http://yhoo.it/i9Ww8W the landscape has become tumultuous. Yahoo! engineers have spent their time and effort contributing back to the Apache Hadoop security branch (branch of 0.20) and have proposed release candidates. Currently being voted and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=423&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In the wake of Yahoo! having announced that they would discontinue their Hadoop distribution and focus their efforts into Apache Hadoop <a target="_blank" href="http://yhoo.it/i9Ww8W">http://yhoo.it/i9Ww8W</a> the landscape has become tumultuous.</p>
<p>Yahoo! engineers have spent their time and effort contributing back to the Apache Hadoop security branch (branch of 0.20) and have proposed release candidates.  </p>
<p>Currently being voted and discussed is &#8220;Release candidate 0.20.203.0-rc1&#8243;.  If you are following the VOTE and the DISCUSSION then maybe you are like me it just cannot be done without a bowl of popcorn before opening the emails.  It is getting heated in a good and constructive kind of way. <a target="_blank" href="http://mail-archives.apache.org/mod_mbox/hadoop-general/201105.mbox/thread">http://mail-archives.apache.org/mod_mbox/hadoop-general/201105.mbox/thread</a> there are already more emails in 5 days of May than there were in all of April. woot!</p>
<p>My take?  Has it become Cloudera vs Yahoo! and Apache Hadoop releases will become fragmented because of it? Well, it is kind of like that already.  0.21 is the latest and can anyone that is not a committer quickly know or find out the difference between that and the other release branches? It is esoteric <img src='http://s0.wp.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' />  0.22 is right around the corner too which is a release from trunk.</p>
<p>Lets take HBase as an example (a Hadoop project).  Do you know what version of HDFS releases can support HBase in production without losing data? If you do then maybe you don&#8217;t realize that many people still don&#8217;t even know about the branch. And, now that CDH3 is out you can use that (thanks Cloudera!) otherwise it is highly recommended to not be in production with HBase unless you use the append branch <a target="_blank" href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/</a> of 0.20 which makes you miss out on other changes in trunk releases&#8230;</p>
<p>__ eyes crossing inwards and sideways with what branch does what and when the trunk release has everything __</p>
<p>Hadoop is becoming an a la cart which features and fixes can I live without for all of what I really need to deploy &#8230; or requiring companies to hire a committer &#8230; or a bunch of folks that do nothing but Hadoop day in and day out (sounds like Oracle, ahhhhhh)&#8230; or going with the Cloudera Distribution (which is what I do and don&#8217;t look back).  The barrier to entry feels like it has increased over the last year. However, stepping back from that the system overall has had a lot of improvements!  A lot of great work by a lot of dedicated folks putting in their time and effort towards making Hadoop (in whatever form the elephant stampedes through its data) a reality.</p>
<p>Big shops that have teams of &#8220;Hadoop Engineers&#8221; (Yahoo, Facebook, eBay, LinkedIn, etc) with contributors and/or committers on that team should not have lots of impact because ultimately they are able to role their own releases for whatever they need/want themselves in production and just support it.  Not all are so endowed.</p>
<p>Now, all of that having been said I write this because the discussion is REALLY good and has a lot of folks (including those from Yahoo! and Cloudera) bringing up pain points and proposing some great solutions that hopefully will contribute to the continued growth and success of the Apache Hadoop Community <a href="http://hadoop.apache.org/" target="_blank">http://hadoop.apache.org/</a>&#8230;. still if you want to run it in your company (and don&#8217;t have a committer on staff) then go download CDH3 <a href="http://www.cloudera.com" target="_blank">http://www.cloudera.com</a> it will get you going with the latest and greatest of all the releases, branches, etc, etc, etc.  Great documentation too!</p>
<div class="tweetmeme-button" id="tweetmeme-button-post-423" style='float: right; margin-left: 10px; margin-bottom: 5px; padding: 4px 0 2px 4px; background: #fff;'>
<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fallthingshadoop.com%2F2011%2F05%2F05%2Fcloudera-yahoo-and-the-apache-hadoop-community-security-branch-release-update%2Ftweetmeme_alias%3Dhttp%3A%2F%2Fwp.me%2FpTu1i-6P%26tweetmeme_source%3Dwordpressdotcom"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fallthingshadoop.com%2F2011%2F05%2F05%2Fcloudera-yahoo-and-the-apache-hadoop-community-security-branch-release-update%2F" height="61" width="51" /></a>
</div>
<p>/*<br />
Joe Stein<br />
<a target="_blank" href="http://www.linkedin.com/in/charmalloc">http://www.linkedin.com/in/charmalloc</a><br />
*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/423/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/423/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=423&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2011/05/05/cloudera-yahoo-and-the-apache-hadoop-community-security-branch-release-update/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>
	</item>
		<item>
		<title>NoSQL HBase and Hadoop with Todd Lipcon from Cloudera</title>
		<link>http://allthingshadoop.com/2010/09/06/nosql-hbase-hadoop-todd-lipcon-cloudera/</link>
		<comments>http://allthingshadoop.com/2010/09/06/nosql-hbase-hadoop-todd-lipcon-cloudera/#comments</comments>
		<pubDate>Tue, 07 Sep 2010 02:46:47 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Open Source Projects]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=330</guid>
		<description><![CDATA[Episode #6 of the Podcast is a talk with Todd Lipcon from Cloudera discussing HBase. We talked about NoSQL and how it should stand for &#8220;Not Only SQL&#8221; and the tight integration between Hadoop and HBase and how systems like Cassandra (which is eventually consistent and not strongly consistent like HBase) is complementary as these [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=330&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://feeds.feedburner.com/allthingshadoop/kjGc" target="_blank">Episode #6</a> of the <a href="http://allthingshadoop/podcast" target="_self">Podcast</a> is a talk with <a href="http://twitter.com/tlipcon" target="_blank">Todd Lipcon </a>from <a href="http://cloudera.com" target="_blank">Cloudera</a> discussing HBase.</p>
<p>We talked about NoSQL and how it should stand for &#8220;Not Only SQL&#8221; and the tight integration between Hadoop and HBase and how systems like Cassandra (which is eventually consistent and not strongly consistent like HBase) is complementary as these systems have applicability within big data eco system depending on your use cases.</p>
<p>With the strong consistency of HBase you get features like incrementing counters and the tight integration with Hadoop means faster loads with HDFS thanks to a new feature in the 0.89 development preview release in the doc folders called &#8220;bulk loads&#8221;.</p>
<p>We covered a lot more unique features, talked about more of what is coming in upcoming releases as well as some tips with HBase so <a href="http://feeds.feedburner.com/allthingshadoop/kjGc" target="_blank">subscribe to the podcast</a> and listen to all of what Todd had to say.</p>
<div class="tweetmeme-button" id="tweetmeme-button-post-330" style='float: right; margin-left: 10px; margin-bottom: 5px; padding: 4px 0 2px 4px; background: #fff;'>
<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fallthingshadoop.com%2F2010%2F09%2F06%2Fnosql-hbase-hadoop-todd-lipcon-cloudera%2Ftweetmeme_alias%3Dhttp%3A%2F%2Fwp.me%2FpTu1i-5k%26tweetmeme_source%3Dwordpressdotcom"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fallthingshadoop.com%2F2010%2F09%2F06%2Fnosql-hbase-hadoop-todd-lipcon-cloudera%2F" height="61" width="51" /></a>
</div>
<p>/*<br />
Joe Stein<br />
<a href="http://www.medialets.com" target="_blank">http://www.medialets.com</a><br />
*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/330/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=330&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2010/09/06/nosql-hbase-hadoop-todd-lipcon-cloudera/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>
	</item>
		<item>
		<title>Pre-Release from Pentaho &#8211; HIVE JDBC Adapter</title>
		<link>http://allthingshadoop.com/2010/08/15/pre-release-from-pentaho-hive-jdbc-adapter/</link>
		<comments>http://allthingshadoop.com/2010/08/15/pre-release-from-pentaho-hive-jdbc-adapter/#comments</comments>
		<pubDate>Sun, 15 Aug 2010 21:34:50 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hive]]></category>
		<category><![CDATA[Open Source Projects]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=323</guid>
		<description><![CDATA[Pentaho&#8217;s Jordan Ganoff, Software Engineer, has open sourced some HIVE JDBC Adapters in what they are doing for their BI server http://forums.pentaho.com/showthread.php?77826-Hive-amp-Hadoop Not sure what state they are in but will try to check it on this week. To use from maven: &#60;dependency&#62; &#60;groupId&#62;org.apache.hadoop.hive&#60;/groupId&#62; &#60;artifactId&#62;hive-jdbc&#60;/artifactId&#62; &#60;version&#62;0.5.0-pentaho-SNAPSHOT&#60;/version&#62; &#60;/dependency&#62; You must also add the repository information to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=323&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Pentaho&#8217;s Jordan Ganoff, Software Engineer, has open sourced some HIVE JDBC Adapters in what they are doing for their BI server</p>
<p><a href="http://forums.pentaho.com/showthread.php?77826-Hive-amp-Hadoop" target="_blank">http://forums.pentaho.com/showthread.php?77826-Hive-amp-Hadoop</a></p>
<p>Not sure what state they are in but will try to check it on this week.</p>
<p><strong>To use from maven:</strong><br />
&lt;dependency&gt;<br />
&lt;groupId&gt;org.apache.hadoop.hive&lt;/groupId&gt;<br />
&lt;artifactId&gt;hive-jdbc&lt;/artifactId&gt;<br />
&lt;version&gt;0.5.0-pentaho-SNAPSHOT&lt;/version&gt;<br />
&lt;/dependency&gt;</p>
<p>You must also add the repository information to either the pom.xml or<br />
your local settings:<br />
&lt;repository&gt;<br />
&lt;id&gt;pentaho&lt;/id&gt;<br />
&lt;name&gt;Pentaho External Repository&lt;/name&gt;<br />
&lt;url&gt;<a href="http://repo.pentaho.org/artifactory/repo&lt;/url&#038;gt" rel="nofollow">http://repo.pentaho.org/artifactory/repo&lt;/url&#038;gt</a>;<br />
&lt;/repository&gt;</p>
<div class="tweetmeme-button" id="tweetmeme-button-post-323" style='float: right; margin-left: 10px; margin-bottom: 5px; padding: 4px 0 2px 4px; background: #fff;'>
<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fallthingshadoop.com%2F2010%2F08%2F15%2Fpre-release-from-pentaho-hive-jdbc-adapter%2Ftweetmeme_alias%3Dhttp%3A%2F%2Fwp.me%2FpTu1i-5d%26tweetmeme_source%3Dwordpressdotcom"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fallthingshadoop.com%2F2010%2F08%2F15%2Fpre-release-from-pentaho-hive-jdbc-adapter%2F" height="61" width="51" /></a>
</div>
<p>/*</p>
<p>Joe Stein<br />
<a href="http://medialets.com" target="_blank">http://medialets.com</a></p>
<p>*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/323/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/323/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=323&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2010/08/15/pre-release-from-pentaho-hive-jdbc-adapter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>
	</item>
		<item>
		<title>Hadoop Development Tools By Karmasphere</title>
		<link>http://allthingshadoop.com/2010/06/29/hadoop-development-tools-by-karmasphere/</link>
		<comments>http://allthingshadoop.com/2010/06/29/hadoop-development-tools-by-karmasphere/#comments</comments>
		<pubDate>Tue, 29 Jun 2010 10:07:45 +0000</pubDate>
		<dc:creator>charmalloc</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://allthingshadoop.com/?p=297</guid>
		<description><![CDATA[In Episode #5 of the Hadoop Podcast http://allthingshadoop.com/podcast/ I speak with Shevek, the CTO of Karmasphere http://karmasphere.com/.  To subscribe to the Podcast click here. We talk a bit about their existing Community Edition (support Netbeans &#38; Eclipse) For developing, debugging and deploying Hadoop Jobs Desktop MapReduce Prototyping GUI to manipulate clusters, file systems and jobs [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=297&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In Episode #5 of the Hadoop Podcast <a href="http://allthingshadoop.com/podcast/" target="_blank">http://allthingshadoop.com/podcast/</a> I speak with Shevek, the CTO of Karmasphere <a href="http://karmasphere.com/" target="_blank">http://karmasphere.com/</a>.  To subscribe to the Podcast <a href="http://feeds.feedburner.com/allthingshadoop/kjGc" target="_blank">click here</a>.</p>
<p>We talk a bit about their existing Community Edition (support Netbeans &amp; Eclipse)</p>
<ul>
<li>For developing, debugging and deploying Hadoop Jobs</li>
<li>Desktop MapReduce Prototyping</li>
<li>GUI to manipulate clusters, file systems and jobs</li>
<li>Easy deployment to any Hadoop version, any distribution in any cloud</li>
<li>Works through firewalls</li>
</ul>
<p>As well as the new products they have launched:</p>
<h2><strong>Karmasphere Client:</strong></h2>
<p>The <a href="http://karmasphere.com/Products-Information/karmasphere-client.html" target="_blank">Karmasphere Client</a> is a cross platform library for ensuring MapReduce jobs can work from any desktop environment to any Hadoop cluster in any enterprise data network. By isolating the Big Data professional and version of Hadoop, Karmasphere Client simplifies the process of switching between data centers and the cloud and enables Hadoop jobs to be independent of the version of the underlying cluster.</p>
<p>Unlike the standard Hadoop client , Karmasphere Client works from Microsoft Windows as well as Linux and MacOS, and works through SSH-based firewalls. Karmasphere Client provides a cloud-independent environment that makes it easy and predictable to maintain a business operation reliant on Hadoop.</p>
<p><a href="http://charmalloc.files.wordpress.com/2010/06/application-framework-3.gif"><img class="aligncenter size-full wp-image-299" title="Application-Framework" src="http://charmalloc.files.wordpress.com/2010/06/application-framework-3.gif?w=595" alt=""   /></a></p>
<ul>
<li>Ensures Hadoop distribution and version independence</li>
<li>Works from Windows (unlike Hadoop Client)</li>
<li>Supports any cloud environment: public, private or public cloud service.</li>
<li>Provides:
<ul>
<li>Job portability</li>
<li>Operating system portability</li>
<li>Firewall hopping</li>
<li>Fault tolerant API</li>
<li>Synchronous and Asynchronous API</li>
<li>Clean Object Oriented Design</li>
</ul>
</li>
<li>Making it easy and predictable to maintain a business operation reliant on Hadoop</li>
</ul>
<h2>Karmasphere Studio Professional Edition</h2>
<p><a href="http://karmasphere.com/Products-Information/karmasphere-studio-professional-edition.html" target="_blank">Karmasphere Studio Professional Edition</a> includes all the functionality  of the Community Edition, plus a range of deeper functionality required  to simplify the developer&#8217;s task of making a MapReduce job robust,  efficient and production-ready.</p>
<p>For a MapReduce job to be robust, its functioning on the cluster has  to be well understood in terms of time, processing, and storage  requirements, as well as in terms of its behavior when implemented  within well-defined &#8220;bounds.&#8221; Karmasphere Studio Professional Edition  incorporates the tools and a predefined set of rules that make it easy  for the developer to understand how his or her job is performing on the  cluster and where there is room for improvement.</p>
<ul>
<li>Enhanced cluster visualization and debugging
<ul>
<li>Execution diagnostics</li>
<li>Job performance timelines</li>
<li>Job charting</li>
<li>Job profiling</li>
</ul>
</li>
<li>Job Export
<ul>
<li>For easy production deployment</li>
</ul>
</li>
<li>Support</li>
</ul>
<h2>Karmasphere Studio Analyst Edition</h2>
<ul>
<li>SQL interface for ad hoc analysis</li>
<li>Karmasphere Application Framework + Hive + GUI =
<ul>
<li>No cluster changes</li>
<li>Works over proxies and firewalls</li>
<li>Integrated Hadoop monitoring Interactive syntax checking</li>
<li>Detailed diagnostics</li>
<li>Enhanced schema browser</li>
<li>Full JDBC4 compliance</li>
<li>Multi-threaded &amp; concurrent</li>
</ul>
</li>
</ul>
<div class="tweetmeme-button" id="tweetmeme-button-post-297" style='float: right; margin-left: 10px; margin-bottom: 5px; padding: 4px 0 2px 4px; background: #fff;'>
<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fallthingshadoop.com%2F2010%2F06%2F29%2Fhadoop-development-tools-by-karmasphere%2Ftweetmeme_alias%3Dhttp%3A%2F%2Fwp.me%2FpTu1i-4N%26tweetmeme_source%3Dwordpressdotcom"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fallthingshadoop.com%2F2010%2F06%2F29%2Fhadoop-development-tools-by-karmasphere%2F" height="61" width="51" /></a>
</div>
<p>/*<br />
Joe Stein<br />
<a href="http://www.linkedin.com/in/charmalloc" target="_blank">http://www.linkedin.com/in/charmalloc<br />
</a>*/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/charmalloc.wordpress.com/297/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/charmalloc.wordpress.com/297/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=allthingshadoop.com&#038;blog=13223440&#038;post=297&#038;subd=charmalloc&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://allthingshadoop.com/2010/06/29/hadoop-development-tools-by-karmasphere/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/c5949edcf9e35a9aeb2584b6d4a58dcf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">charmalloc</media:title>
		</media:content>

		<media:content url="http://charmalloc.files.wordpress.com/2010/06/application-framework-3.gif" medium="image">
			<media:title type="html">Application-Framework</media:title>
		</media:content>
	</item>
	</channel>
</rss>
