Archive

Archive for the ‘Uncategorized’ Category

Distributed Trace for Distributed Systems

March 23, 2016 Leave a comment

zipkin.ui

Originally posted on https://www.linkedin.com/pulse/distributed-trace-systems-joe-stein?trk=pulse_spock-articles

For a while now folks have known about and have been using distributed trace within their infrastructure systems. If you aren’t familiar with the concept take a look at Google’s paper about it. Basically, it gives you macro level insight into how a distributed system behaves and allows you to optimize your “Data Center Apps” like Chrome Dev Tools lets you optimize your web apps.

Ok, so what do you need?

1) A Mesos Cluster. Sorry if you don’t have one of these yet, you should. Apache Mesos and Mesosphere’s DCOS have revolutionized how businesses utilize their compute resources. Instead of managing each resource individually, the technology unifies the compute in your data center into one centrally managed resource. Don’t worry if you are new to Mesos, hang tight and we will come back to it after discussing the other required elements.

2) Stack Deploy. This software  is one of the open source components of Elodina’s Sawfly Platform as a service product and is available athttps://github.com/elodina/stack-deploy. Stack deploy was developed to run and manage schedulers (Kafka, Cassandra, HDFS, etc) in orchestrated ways with other applications. These can be state-full or state-less (yes with Docker too but not required) and are intended to be deployed in a multi tenant and micro segmented software defined data center model.

3) Run This Stack!

https://github.com/elodina/stack-deploy/blob/master/stacks/zipkin-full.stack

The stack will deploy:

  • 1x Exhibitor-Mesos
  • 1x Exhibitor
  • 1x DSE-Mesos
  • 1x Cassandra node
  • 1x Kafka-Mesos
  • 1x Kafka 0.8 broker
  • 1x Zipkin-Mesos
  • 1x Zipkin Collector
  • 1x Zipkin Query
  • 1x Zipkin Web

Should you want to customize the number of Kafka brokers, Cassandra nodes, or Zipkin topic name, do so by modifying the corresponding fields (e.g. node count, cpu, mem, etc) in the stack file.

In order to deploy this stack the only thing that needs to be specified is the zookeeper url:

export SD_API=stackdeploy_url>
# stackdeploy_user would most be probably “admin”
export SD_USER=stackdeploy_user>
export SD_KEY=stackdeploy_key>

./stack-deploy add –file zipkin-full.stack
# for “Mesos Consul”-enabled clusters zk_url would probably be zookeeper.service:2181
./stack-deploy run zipkin-full –var “zk_url=zk_url>”

Note: if you want to restart your stack for some reason, make sure to clean up theExhibitor Mesos framework znode storage in Zookeeper if the Exhibitor phase was already completed. Failure to do this step may cause subsequent deployments of the stack to fail.

Launching from existing infrastructure

If you want to include Zipkin in already running infrastructure, use the following file: zipkin-standalone.stack

In order to zipkin stand alone, you will need a Cassandra cluster and 0.8 Kafka brokers accesible from your clusterfor Zipkin to use.

# Presuming you have SD_API, SD_USER and SD_KEY already set

./stack-deploy add –file stacks/zipkin-standalone.stack
./stack-deploy run zipkin-standalone –var “kafka_zk_url=kafka_zk_url>” –var “cassandra_url=cassandra_contact_points>”

Launching from existing infrastructure

If you want to include Zipkin in already running infrastructure, use the following file: zipkin-standalone.stack

In order to run, you will need a Cassandra node and 0.8 Kafka broker on your cluster which you get in the full stack or use the setup you have is ok too.

Is it working?

The Easiest way to verify if the Zipkin stack has launched successfully is to use the trace generator provided by Zipkin Mesos framework. Please refer to the corresponding documentation section on details:https://github.com/elodina/zipkin-mesos-framework#verifying-all-components-running

After the traces aresent, you can check them out by opening Zipkin web UI in your browser. If you plan on using  stack file on Consul-enabled clusters, the link will look like this:

http://zipkin-web-0.service.{cluster_name}:31983

And you are good to go!

If you need a client example how to generate these trace events, take a look at our Go Zipkin sample for that utility here https://github.com/elodina/go-zipkin.

To learn more about Elodina please contact us or come meet us in person at theApache Mesos NYC or Apache Kafka NYC Meetups about getting started.

~ Joestein

 

Categories: Uncategorized

Apache Mesos at Bloomberg

November 6, 2015 Leave a comment

http://www.elodina.net/2015/11/05/mesosbloomberg/

In the video recording of the talk Mr. Gupta discusses BVault, a massive, scalable, archiving and e-discovery solution for communications that has been adopted by more than 800 enterprises, processing more than 220 million daily messages with an archive containing more than 90 billion communication objects. He describes how the service is optimized for fast deployment of data-centric and processing-intensive applications using elastic cloud computing strategies. Mr. Gupta further explains how his team uses Apache Mesos to abstract heterogeneous data center assets as a homogeneous set of resources, prefabricated hardware in secure and geographically distributed data centers to provide on-demand capacity management, and Continuous development and integration using containers as an emerging standard in cloud infrastructure.

Mr. Gupta encourages anyone who is interested in working on Mesos @ Bloomberg to view Bloomberg’s relevant job listings and contact him on Twitter @skandsg.

Tweet @elodinadotnet If you are interested in learning more about the Apache Mesos NYC Meetup. We are always looking for speakers and sponsors for upcoming events.

~ Joe Stein

Categories: Uncategorized

How to choose the number of topics/partitions in a Kafka cluster?

March 12, 2015 Leave a comment

Confluent

This is a common question asked by many Kafka users. The goal of this post is to explain a few important determining factors and provide a few simple formulas.

More partitions lead to higher throughput

The first thing to understand is that a topic partition is the unit of parallelism in Kafka. On both the producer and the broker side, writes to different partitions can be done fully in parallel. So expensive operations such as compression can utilize more hardware resources. On the consumer side, Kafka always gives a single partition’s data to one consumer thread. Thus, the degree of parallelism in the consumer (within a consumer group) is bounded by the number of partitions being consumed. Therefore, in general, the more partitions there are in a Kafka cluster, the higher the throughput one can achieve.

A rough formula for picking the number of partitions is based on throughput. You measure the throughout…

View original post 1,439 more words

Categories: Uncategorized

Using Scala To Work With Hadoop

Cloudera has a great toolkit to work with Hadoop.  Specifically it is focused on building distributed systems and services on top of the Hadoop Ecosystem.

http://cloudera.github.io/cdk/docs/0.2.0/cdk-data/guide.html

And the examples are in Scala!!!!

Here is how you you work with generic stuff on the file system including Avro files reading and writing.

https://github.com/cloudera/cdk/blob/master/cdk-examples/src/main/scala/creategeneric.scala

/**
* Copyright 2013 Cloudera Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import com.cloudera.data.{DatasetDescriptor, DatasetWriter}
import com.cloudera.data.filesystem.FileSystemDatasetRepository
import java.io.FileInputStream
import org.apache.avro.Schema
import org.apache.avro.Schema.Parser
import org.apache.avro.generic.{GenericRecord, GenericRecordBuilder}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.compat.Platform
import scala.util.Random

// Construct a local filesystem dataset repository rooted at /tmp/data
val repo = new FileSystemDatasetRepository(
FileSystem.get(new Configuration()),
new Path("/tmp/data")
)

// Read an Avro schema from the user.avsc file on the classpath
val schema = new Parser().parse(new FileInputStream("src/main/resources/user.avsc"))

// Create a dataset of users with the Avro schema in the repository
val descriptor = new DatasetDescriptor.Builder().schema(schema).get()
val users = repo.create("users", descriptor)

// Get a writer for the dataset and write some users to it
val writer = users.getWriter().asInstanceOf[DatasetWriter[GenericRecord]]
writer.open()
val colors = Array("green", "blue", "pink", "brown", "yellow")
val rand = new Random()
for (i val builder = new GenericRecordBuilder(schema)
val record = builder.set("username", "user-" + i)
.set("creationDate", Platform.currentTime)
.set("favoriteColor", colors(rand.nextInt(colors.length))).build()
writer.write(record)
}
writer.close()

Big ups to the Cloudera team!

/*
Joe Stein
https://twitter.com/allthingshadoop
*/

Categories: Uncategorized