Archive

Archive for April, 2014

Reporting Metrics to Apache Kafka and Monitoring with Consumers

April 18, 2014 2 comments

Apache Kafka has been used for some time now by organizations to consume not only all of the data within its infrastructure from an application perspective but also the server statistics of the running applications and infrastructure.  Apache Kafka is great for this.

Coda Hale’s metrics’s has become a leading way to instrument your JVM applications capturing what the application is doing and reporting it to different servers like Graphite, Ganglia, Riemann and other systems.  The main problem with this is the tight coupling between your application (and infrastructure) metrics with how you are charting, trending and alerting on them.  Now lets insert Apache Kafka to-do the decoupling which it does best.

kafka1

The systems sending data and the systems reading the data become decoupled through the Kafka brokers.

producer_consumer

 

Now, once this decoupling happens it allows us to plug in new systems (producers) to send metrics and then have multiple different systems consuming them.  This means that not only can you monitor and alert on the metrics but also (and at the same time) do real time analysis or analysis over the larger dataset consumed off somewhere else.

So, how does this all work? Well, we created a Metrics reporter https://github.com/stealthly/metrics-kafka that you can use within your own applications and any application supporting Coda Hale’s metrics but now reporting to a Kafka topic.  We will be creating more producers (such as scrapping cpu, mem, disk, etc) and adding them to this project.  To use this setup a Metrics reporter like you would normally but do so like this:

import ly.stealth.kafka.metrics.KafkaReporter
val producer = KafkaReporter.builder(registry, kafkaConnection, topic).build()
producer.report()
It was developed in Java so you don’t have to include the Scala dependencies if you are not using Scala. The source code for the reporter is available here: https://github.com/stealthly/metrics-kafka/tree/master/metrics/src/main/java/ly/stealth/kafka/metrics.
We also setup an example of how to read the produced metrics with Riemann. Riemann monitors distributed systems and provides low-latency, transient shared state for systems with many moving parts.  There is a KafkaRiemman consumer that reads from Kafka and posts to Riemann https://github.com/stealthly/metrics-kafka/blob/master/riemann/src/main/scala/ly/stealth/kafka/riemann/RiemannMetricsConsumer.scala.
1) Install Vagrant http://www.vagrantup.com/
2) Install Virtual Box https://www.virtualbox.org/
git clone https://github.com/stealthly/metrics-kafka.git
cd metrics-kafka
vagrant up
Once the virtual machines are launched go to http://192.168.86.55:4567/.  We setup a sample dashboard. Fom your command prompt run the test cases from the command prompt (after vms have launched).
./gradlew test
And the dashboard will populate automatically.
Screen Shot 2014-04-18 at 12.04.13 PM
You can modify what is showing up in Riemann and learn how to use this from the test cases https://github.com/stealthly/metrics-kafka/blob/master/riemann/src/test/scala/ly/stealth/kafka/metrics/RiemannMetricsConsumerSpec.scala.
 
/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 Twitter: @allthingshadoop
********************************************/

 

Categories: Open Source Projects

Higher Availability, Increased Scale and Enhanced Security on Apache HBase

April 3, 2014 Leave a comment

Episode #21 of the Podcast was a talk with Michael Stack, Lars Hofhansl and Andrew Purtell.

Having these guests from the Apache HBase PMC allowed us to talk about HBase 0.96, 0.98, some use cases, HBaseCon and 1.0.

The highlights from 0.96 where around stability and longer term scale (moving all internal data exchange and persistence to protobufs).

0.98 introduced some exciting new security features and a new HFile format with both encryption at rest and cell level security labels.

HBaseCon has all new speakers and new use cases with new and familiar faces listening onward. A must attend if you can make it.

1.0 is focusing on SLA and more inmemorry database features and general cleanup.

Listen into the podcast and all of what they talked about together.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 Twitter: @allthingshadoop
********************************************/

 

Categories: HBase, Podcast

Beyond MapReduce and Apache Hadoop 2.X with Bikas Saha and Arun Murthy

April 2, 2014 Leave a comment

Episode 20 of the podcast was with Bikas Saha and Arun Murthy.

When I spoke with Arun a year or so a go YARN was NextGen Hadoop and there have been a lot of updates, work done and production experience since.

Besides Yahoo! other multi thousand node clusters have been and are running in production with YARN. These clusters have shown 2x capacity throughput which resulted in reduced cost for hardware (and in some cases being able to shut down co-los) while still gaining performance improvements overall to previous clusters of Hadoop 1.X.

I got to hear about some of what is in 2.4 and coming in 2.5 of Hadoop:

  • Application timeline server repository and api for application specific metrics (Tez, Spark, Whatever).
    • web service API to put and get with some aggregation.
    • plugable nosql store (hbase, accumulo) to scale it.
  • Preemption capacity scheduler.
  • Multiple resource support (CPU, RAM and Disk).
  • Labels tag nodes with labels can be labeled however so some windows and some linux and ask for resources with only those labels with ACLS.
  • Hypervisor support as a key part of the topology.
  • Hoya generalize for YARN (game changer) and now proposed as Slider to the Apache incubator.

We talked about Tez which provides complex DAGs of queries to translate what you want to-do on Hadoop without the work arounds for making it have to run in MapReduce.  MapReduce was not designed to be re-workable out side of the parts of the Job it gave you for Map, Split, Shuffle, Combine, Reduce, Etc and Tez is more expressible exposing a DAG API.

PigHiveQueryOnMR

Now becomes with Tez:

PigHiveQueryOnTez

 

There were also some updates on Hive v13 coming out with sub queries, low latency queries (through Tez), high precision decimal points and more!

Subscribe to the podcast and listen to all of what Bikas and Arun had to say.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 Twitter: @allthingshadoop
********************************************/