I've been reading some documentation on Kafka replication and decided to summarize my notes in this blog post.
MirrorMaker
Data replication across Kafka clusters is handled by MirrorMaker. It is a solution based on Kafka Connect that utilizes a set of Producers and Consumers to read from source clusters and write to target clusters.
Basic Configuration
The basic MirrorMaker configuration specifies aliases for all of the clusters.
These aliases are then used to specify the broker addresses for each cluster
Replication Flows
The aliases specified in the basic configuration can be used to define a replication flow.
This replication flow needs to be enabled, and additional properties can be configured per replication flow. A full list of properties can be found here
Since MirrorMaker is based on Connect, it inherits Connect properties too. A full list of those are available here
Replication Strategies
Active - Passive
Active Active
MultiCluster GeoReplication
In a geo replication setup a leader cluster from each region can be used to replicate to a leader cluster of another region.
Clusters in a region replicate amongst themselves.
Target Topic Names
By default the topic names of the source and target cluster will not match. The target topic will have the source cluster's alias prefixed to the topic name along with a delimiter that can be customized.
You can also use replication.policy=IdentityReplicationPolicy to avoid the renaming of a topic with the prefix. This makes sense if you are trying to replicate a topic one-way or trying to migrate from a legacy Kafka cluster. When using this policy, MirrorMaker cannot prevent cycles, so it is necessary to ensure that the strategy is acyclic.
Consuming from Topics in Active Active Replicated Clusters
As seen in the previous section, topic names are renamed when replicated to another cluster. Consumers would need to aggregate all data in both the original and the replicated topic(s).
Alternatively, data from these multiple topics can be aggregated into a single topic via a solution like KSQLDB. This will simplify the consumer solution.
Best Practices
Consistent Configuration
Consuming from the Source rather that producing to the target
Kafka MirrorMaker is based on Connect and has a set of Producers and Consumers. Consumers in Kafka are more resilient to recovering from failure, and Producers can have issues if there is high network latency. It therefore makes sense to host the MirrorMaker instances closer to the target cluster.
We can enforce the production of data to a nearby or local cluster by using the special clusters flag and the target cluster alias
Starting and Stopping MirrorMaker
MirrorMaker is based on Kafka Connect and multiple instances can co-ordinate with each other (via Kafka) to share the replication load. This makes scaling horizontally a matter of adding new instances with the same configuration.
MirrorMaker can be stopped by killing the process.
Monitoring
- record-count - number of records transferred / replicated
- replication-latency - time taken for a record to be replicated
- checkpoint-latency - time taken to transfer consumer offsets