Multi Datacenter Replication with Apache Kafka
When requiring multi datacenter replication with Apache Kafka folks most often rely on the project’s MirrorMaker tool. This tool works great but was designed for a specific set of use cases requiring more work to get it working for all needs. We found a common need in the community for additional use cases so we built a new MirrorMaker tool using our Go Kafka Client to support these needs and more!
The Go Kafka Mirror Maker supports:
- No JVM required when consuming from source cluster and producing to destination cluster.
- Guarantees of at least once mirroring of data from source to destination.
- Preservation of ordering from source partition to destination partition.
- Ability to prefix destination topic to avoid collisions of topic names between clusters.
- Everything else the existing MirrorMaker tool supports.
go run mirror_maker.go --consumer.config sourceCluster1Consumer.config --consumer.config sourceCluster2Consumer.config --num.streams 2 --producer.config targetClusterProducer.config --whitelist=".*"
--blacklist – whitelist or blacklist of topics to mirror. Exactly one whitelist or blacklist is allowed, e.g. passing both whitelist and blacklist will cause panic. This parameter is required.
--consumer.config – consumer property files to consume from a source cluster. You can pass multiple of those like this:
--consumer.config sourceCluster1Consumer.config --consumer.config sourceCluster2Consumer.config. At least one consumer config is required.
--producer.config – property file to configure embedded producers. This parameter is required.
--num.producers – number of producer instances. This can be used to increase throughput. This helps because each producer’s requests are effectively handled by a single thread on the receiving Kafka broker. i.e., even if you have multiple consumption streams (see next section), the throughput can be bottle-necked at handling stage of the mirror maker’s producer request. Defaults to 1.
--num.streams – used to specify the number of mirror consumer goroutines to create. If the number of consumption streams is higher than number of available partitions then some of the mirroring routines will be idle by virtue of the consumer rebalancing algorithm (if they do not end up owning any partitions for consumption). Defaults to 1.
--preserve.partitions – flag to preserve partition number. E.g. if message was read from partition 5 it’ll be written to partition 5. Note that this can affect performance. Defaults to false.
--preserve.order – flag to preserve message order. E.g. message sequence 1, 2, 3, 4, 5 will remain 1, 2, 3, 4, 5 in destination topic. Note that this can affect performance. Defaults to false.
--prefix – destination topic prefix. E.g. if message was read from topic “test” and prefix is “dc1_” it’ll be written to topic “dc1_test”. Defaults to empty string.
--queue.size – number of messages that are buffered between the consumer and producer. Defaults to 10000.
Big Data Open Source Security LLC provides professional services and product solutions for the collection, storage, transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and distributed systems. BDOSS is all about the “glue” and helping companies to not only figure out what Big Data Infrastructure Components to use but also how to change their existing (or build new) systems to work with them. The focus of our services and solutions are end to end including architecture, development, implementation, documentation, training and support for complex data streams, data sets and distributed systems using Open Source Software.