Kafka Mirror Maker Best Practices
Article
Kafka's mirroring feature makes it possible to maintain a replica of an existing Kafka cluster. This tool uses Kafka consumer to consume messages from the source cluster, and re-publishes those messages to the target cluster using an embedded Kafka producer.
Running Mirror Maker
To set up a mirror, run kafka.tools.MirrorMaker
At a minimum, MirrorMaker requires one or more consumer configuration files, a producer configuration file, and either a whitelist or a blacklist of topics. In the consumer and producer configuration files, point the consumer to the the source cluster, and point the producer to the destination (mirror) cluster, respectively.
- bin/kafka-run-class.sh kafka.tools.MirrorMaker--consumer.config sourceCluster1Consumer.config --consumer.config sourceCluster2Consumer.config --num.streams 2--producer.config targetClusterProducer.config --whitelist=".*"
Parameter | Description | Examples |
|
Specifies a file that contains configuration settings for the source cluster. For more information about this file, see the "Consumer Configuration File" subsection. |
|
|
Specifies the file that contains configuration settings for the target cluster. For more information about this file, see the "Producer Configuration File" subsection. |
|
|
(Optional) For a partial mirror, you can specify exactly one comma-separated list of topics to include (--whitelist) or exclude (--blacklist). In general, these options accept Java regex patterns. For caveats, see the note after this table. |
|
|
Specifies the number of consumer stream threads to create. |
|
|
Specifies the number of producer instances. Setting this to a value greater than one establishes a producer pool that can increase throughput. |
|
|
Queue size: number of messages that are buffered, in terms of number of messages between the consumer and producer. Default = 10000. |
|
|
List MirrorMaker command-line options. |
- A comma (',') is interpreted as the regex-choice symbol ('|') for convenience.
- If you specify -
-white-list=".*"
, MirrorMaker tries to fetch data from the system-level topic__consumer-offsets
and produce that data to the target cluster. Make sure you addedexclude.internal.topics=true
in consumer propertiesWorkaround: Specify topic names, or to replicate all topics, specify
--blacklist="__consumer-offsets"
.
Example Consumer config file
Consumer config bootstrap.servers should point to source cluster
Here is a sample consumer configuration file:
- bootstrap.servers=kafka-source-server1:6667,kafka-source-server2:6667,kafka-source-server3-6667
- groupid=dp-MirrorMaker-group
- exclude.internal.topics=true
- mirror.topics.whitelist=app_log
- client.id=mirror_maker_consumer
Example Producer Configuration File
Producer config bootstrap.servers should point to target cluster
Here is a sample producer configuration file:
- bootstrap.servers=kafka-target-server1:6667,kafka-target-server2:6667,kafka-target-server3-6667
- acks=1
- batch.size=100
- client.id=mirror_maker_producer
Best Practices
Create topics in target cluster
If you have consumers that are going to consume data from target cluster and your parallelism requirement for a consumer is same as your source cluster, Its important that you create a same topic in target cluster with same no.of partitions.
Example:
If you have a topic name called "click-logs" with 6 partitions in source cluster , make sure you have same no.of partitions in the target cluster. If you are using a target cluster as more of a backup, not active this might not need to be same.
If users didn't create a topic in target cluster, producer in mirrormaker will attempt to create a topic and target cluster broker will create a topic with configured num.partitions and num.replicas, this may not be the partitions and replication that the user wants.
Where to run MirrorMaker
We recommend to run MirrorMaker on target cluster.
No data loss
Make sure you've following configs in consumer config and producer config for No data loss.
For Consumer, set auto.commit.enabled=false in consumer.properties
For Producer
- max.in.flight.requests.per.connection=1
- retries=Int.MaxValue
- acks=-1
- block.on.buffer.full=true
For MirrorMaker, set --abortOnSendFail
The following actions will be taken by MirrorMaker
- Mirror maker will only send one request to a broker at any given point.
- If any exception is caught in mirror maker thread, mirror maker will try to commit the acked offsets then exit immediately.
- For RetriableException in producer, producer will retry indefinitely. If retry did not work, eventually the entire mirror maker will block on producer buffer full.
- For None-retriable exception, if --abort.on.send.fail is specified, stops the mirror maker. Otherwise producer callback will record the message that was not successfully sent but let the mirror maker move on. In this case, that message will be lost in target cluster.
As the last point stated if there is any error occurred your mirror maker process will be killed. So users are recommend to use a watchdog process like supervisord to restart the killed mirrormaker process.
Mirror Maker Sizing
num.streams
num.streams option in mirror-maker allows you to create specified no.of consumers.
Mirror-Maker deploys the specified no.of threads in num.streams
- Each thread instantiates and uses one consumer. That is a 1:1 mapping between mirror-maker threads and consumers.
- Each thread shares the same producer. That is a N:1 mapping between threads and producers.
Keep in mind, a topic-partition is the unit of parallelism in Kafka. If you have a topic called "click-logs" with 6 partitions then max no.of consumers you can run is 6. If you run more than 6 , additional consumers will be idle and if you run less than 6 , all 6 partitions will be distributed among available consumers. More partitions leads to more throughput.
So before going further into num.streams, we recommend you to run multiple instances of mirror-maker across the machines with same "groupId" in consumer.config. This will help if a mirror-maker process goes down for any reason and the topic-partitions owned by killed mirror-maker will be re-balanced among other running mirror-maker processes. So this will give high-availability of mirror-maker.
Coming back to choosing num.streams. Lets say you've 3 topics with 4 partitions each and you are running 3 mirror maker processes. You can choose 4 as your num.streams this way each instance of mirror-maker starts 4 consumers reading 4 topic-partitions each and writing to target cluster.
If you just run 1 mirror maker by choosing 4 as num.streams then 4 consumers will be reading from all 12 topic-partitions . This means lot more traffic into a single machine and it will be slower. Also if the mirror-maker process is stopped there are no other mirror-maker processes to take over.
For maximum performance, total number of num.streams should match all of the topic partitions that the mirror maker trying to copy to target cluster.
co-locate
One co-locate more than one mirror maker in a single machine. Always run more than one mirror-maker processes. Make sure you use the same groupId in consumer config.
Socket buffer sizes
In general, you should set a high value for the socket buffer size on the mirror-maker's consumer configuration (socket.buffersize) and the source cluster's broker configuration (socket.send.buffer). Also, the mirror-maker consumer's fetch size (fetch.size) should be higher than the consumer's socket buffer size. Note that the socket buffer size configurations are a hint to the underlying platform's networking code.
Check the health of Mirror-Maker
The consumer offset checker tool is useful to gauge how well your mirror is keeping up with the source cluster. Note that the --zkconnect argument should point to the source cluster's ZooKeeper. Also, if the topic is not specified, then the tool prints information for all topics under the given consumer group.
For example:
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group KafkaMirror --zkconnect dc1-zookeeper:2181 --topic test-topic
minimal lag here indicates healthy Mirror-Maker
Running in Secure Clusters
SSL
We recommend to use SSL in mirror-maker from kafka 0.10.x version onwards. You can learn more details about setting SSL for brokers, producers and consumers here
Users can share the same key & trust stores for both producer & consumer in Mirror-Maker. Make sure you give read & write permissions for the certificate/hostname if you are using a authorizer with SSL.
KERBEROS
In kafka 0.9.x and 0.10.0.1, 0.10.1.0 , consumers and producers in mirror-maker cannot run with different principals/keytabs as they both run inside a single JVM. So the users need to use single principal to configure both consumer and producer. This means same principal needs to have at least read & describe access on the source cluster topics and write & describe access to topics on target cluster.
In future version of kafka users can configure different principal/keytabs for consumer & producer in mirror-maker.