Real-Time Data Pipelines and Analytics with Apache Kafka and Apache Samza
Jay talked about the open source work he has done while @ LinkedIn
- Project Voldermort – http://www.project-voldemort.com/voldemort/ – a distributed database
- Apache Kafka – http://kafka.apache.org – A high-throughput distributed data pipeline and messaging system
- Apache Samza – http://samza.incubator.apache.org/ – Apache Samza is a distributed stream processing framework
- Azkaban – http://data.linkedin.com/opensource/azkaban – a batch job scheduler for Apache Hadoop
Most of the conversation was about the Apache Kafka pipeline and the use of Apache Samza for processing it.
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
What does all that mean?
First let’s review some basic messaging terminology:
- Kafka maintains feeds of messages in categories called topics.
- We’ll call processes that publish messages to a Kafka topic producers.
- We’ll call processes that subscribe to topics and process the feed of published messages consumers..
- Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:
Communication between the clients and the servers is done with a simple, high-performance, language agnosticTCP protocol. We provide a java client for Kafka, but clients are available in many languages.
Topics and Logs
Let’s first dive into the high-level abstraction Kafka provides—the topic.
A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so retaining lots of data is not a problem.
In fact the only metadata retained on a per-consumer basis is the position of the consumer in in the log, called the “offset”. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads messages, but in fact the position is controlled by the consumer and it can consume messages in any order it likes. For example a consumer can reset to an older offset to reprocess.
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to “tail” the contents of any topic without changing what is consumed by any existing consumers.
Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
- Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple call-back based “process message” API that should be familiar to anyone that’s used Map/Reduce.
- Managed state: Samza manages snapshotting and restoration of a stream processor’s state. Samza will restore a stream processor’s state to a snapshot consistent with the processor’s last read messages when the processor is restarted.
- Fault tolerance: Samza will work with YARN to restart your stream processor if there is a machine or processor failure.
- Durability: Samza uses Kafka to guarantee that messages will be processed in the order they were written to a partition, and that no messages will ever be lost.
- Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, re-playable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
- Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
- Processor isolation: Samza works with Apache YARN, which supports processor security through Hadoop’s security model, and resource isolation through Linux CGroups.