7.2 Kafka Replication¶
This page covers bidirectional cross-cluster Kafka replication between the testing and proxmox environments using Strimzi MirrorMaker 2 (MM2). Each cluster runs its own MM2 instance responsible for one replication direction — proxmox's MM2 handles proxmox→testing, and testing's MM2 handles testing→proxmox. This avoids duplicate data and provides resilience: if one cluster goes down, only its outbound replication stops.
Key Decisions¶
- Topology: Active-Active (bidirectional, testing ↔ proxmox)
- Replication Tool: Strimzi KafkaMirrorMaker2 CRD (one per cluster, one direction each)
- Replication Policy:
DefaultReplicationPolicywith.separator (topics prefixed with source cluster alias) - Topics:
.*(all topics, excluding internal topics) - Consumer Groups: All groups (offset synchronisation enabled)
- Authentication: TLS with mTLS (cluster CA certificates)
- Networking: Cilium Cluster Mesh with global services
Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ Testing Cluster (Active — Kafka 4.1.0) │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ ESB │───>│ Kafka Broker │ │ mm2-testing-to-proxmox │ │
│ │ Producer │ │ (local │ │ (connectCluster: │ │
│ └──────────┘ │ topics) │ │ testing) │ │
│ └──────┬───────┘ │ │ │
│ │ │ testing ──────► proxmox │ │
│ │ Port 9094 └───────────┬────────────┘ │
│ │ (TLS + │ │
│ │ Cilium Global Svc) │ │
└─────────────────────────┼────────────────────────┼──────────────────┘
│ │
│ Cilium Cluster Mesh │
│ │
┌─────────────────────────┼────────────────────────┼──────────────────┐
│ │ Port 9094 │ │
│ │ (TLS + │ │
│ │ Cilium Global Svc) │ │
│ ┌──────────┐ ┌──────┴───────┐ ┌───────────┴────────────┐ │
│ │ ESB │───>│ Kafka Broker │ │ mm2-proxmox-to-testing │ │
│ │ Producer │ │ (local │ │ (connectCluster: │ │
│ └──────────┘ │ topics) │ │ proxmox) │ │
│ └──────────────┘ │ │ │
│ │ proxmox ──────► testing │ │
│ └────────────────────────┘ │
│ Proxmox Cluster (Active — Kafka 3.9.0) │
└─────────────────────────────────────────────────────────────────────┘
MirrorMaker 2 Components¶
Each MM2 instance runs three connectors for its single replication direction:
| Component | Purpose | Internal Topic |
|---|---|---|
| MirrorSourceConnector | Consumes from source, produces to target with DefaultReplicationPolicy (topics prefixed with source alias, e.g. proxmox.rect). |
mm2-offset-syncs.<target>.internal |
| MirrorCheckpointConnector | Translates consumer group offsets from source to target. Emits checkpoints every 60 seconds. | <source>.checkpoints.internal |
| MirrorHeartbeatConnector | Monitors replication health. Emits heartbeat every 10 seconds. | heartbeats |
Topic Naming¶
With DefaultReplicationPolicy (separator: .), replicated topics are prefixed with the source cluster alias:
| Cluster | Local Topic | Replicated To Remote As |
|---|---|---|
| proxmox | rect |
proxmox.rect (on testing cluster) |
| testing | rect |
testing.rect (on proxmox cluster) |
Infinite Loop Prevention¶
The topicsExcludePattern prevents re-replicating already-replicated topics:
Since replicated topics contain the source alias (e.g. proxmox.rect), and DefaultReplicationPolicy only replicates topics that don't already carry a remote prefix, infinite loops are inherently prevented. The exclude pattern additionally filters out MM2 internal topics (mm2-offset-syncs.*.internal, *.checkpoints.internal) and Kafka system topics (__consumer_offsets, __transaction_state).
Prerequisites¶
Cilium Cluster Mesh¶
- Cilium installed on both clusters
- Cilium Cluster Mesh connected between clusters
- Pod-to-pod connectivity verified across clusters
- DNS resolution working:
*.svc.clustermesh.local
# Verify Cluster Mesh status
cilium clustermesh status
# Expected:
# Service "clustermesh-apiserver" of type "LoadBalancer" found
# Cluster access information is available:
# - testing-cluster: available
# - proxmox-cluster: available
Network Requirements¶
- Non-overlapping Pod CIDRs between clusters
- Firewall rules allow Kafka traffic (port 9094 TCP)
- Network latency <150ms (measured between clusters)
Strimzi Operator¶
- Strimzi operator deployed in both clusters
- One operator per cluster
- proxmox: Kafka 3.9.0 running in KRaft mode
- testing: Kafka 4.1.0 running in KRaft mode
Storage and Resources¶
- Both clusters have sufficient storage for replicated data
- Storage class supports ReadWriteOnce (RWO) volumes
- Sufficient CPU/Memory for MM2 pods (2 replicas x 4Gi memory per cluster)
Critical Fix: Resource Sizing¶
Fix Before Deploying MM2
The current Kafka configuration has a JVM max heap (-Xmx: 8192m) that exceeds the pod memory limit (4Gi). Brokers will be OOMKilled when the JVM attempts to allocate 8 GB heap. This must be fixed before proceeding.
Apply to both clusters:
- Edit
apps/rciis/kafka/proxmox/extra/kafka.yaml - Edit
apps/rciis/kafka/testing/extra/kafka.yaml - Apply changes via FluxCD reconciliation or
kubectl apply - Wait for rolling restart to complete
- Verify brokers are healthy:
kubectl get pods -n rciis-prod -l strimzi.io/cluster=kafka-rciis-prod
kubectl get pods -n rciis-testing -l strimzi.io/cluster=kafka-rciis-testing
kubectl get events -n rciis-prod --sort-by='.lastTimestamp' | grep OOMKilled
kubectl get events -n rciis-testing --sort-by='.lastTimestamp' | grep OOMKilled
Implementation Phases¶
Phase 1: Add External TLS Listener¶
Configure a cross-cluster TLS listener on port 9094 in both Kafka clusters. The listener uses cluster-ip type with Cilium global service annotations so each cluster's Kafka services are discoverable from the other cluster.
Files:
apps/rciis/kafka/proxmox/extra/kafka.yamlapps/rciis/kafka/testing/extra/kafka.yaml
The external listener in the spec.kafka.listeners array:
listeners:
# Existing listeners...
- name: scram
port: 9092
tls: false
type: internal
authentication:
type: scram-sha-512
- name: plain
port: 9093
tls: false
type: internal
# Cross-cluster TLS listener
- name: external
port: 9094
type: cluster-ip
tls: true
authentication:
type: tls
configuration:
bootstrap:
annotations:
service.cilium.io/global: "true"
brokers:
- broker: 0
annotations:
service.cilium.io/global: "true"
- broker: 1
annotations:
service.cilium.io/global: "true"
- broker: 2
annotations:
service.cilium.io/global: "true"
Verify (repeat for both clusters):
# Check new services created
kubectl get svc -n rciis-prod | grep external
# Expected:
# kafka-rciis-prod-kafka-external-bootstrap ClusterIP ... 9094/TCP
# kafka-rciis-prod-kafka-external-0 ClusterIP ... 9094/TCP
# kafka-rciis-prod-kafka-external-1 ClusterIP ... 9094/TCP
# kafka-rciis-prod-kafka-external-2 ClusterIP ... 9094/TCP
# Verify Cilium global service annotation
kubectl get svc kafka-rciis-prod-kafka-external-bootstrap -n rciis-prod -o yaml | grep global
# Expected: service.cilium.io/global: "true"
Phase 2: Cilium Global Services & DNS¶
Verify Kafka services are discoverable across clusters via Cilium Cluster Mesh.
# From testing cluster pod, resolve proxmox cluster bootstrap service
nslookup kafka-rciis-prod-kafka-external-bootstrap.rciis-prod.svc.clustermesh.local
# From proxmox cluster pod, resolve testing cluster bootstrap service
nslookup kafka-rciis-testing-kafka-external-bootstrap.rciis-testing.svc.clustermesh.local
# Test TLS connection from proxmox to testing
kubectl run -it --rm debug --image=alpine --restart=Never -n rciis-prod -- sh
apk add openssl
openssl s_client -connect kafka-rciis-testing-kafka-external-bootstrap.rciis-testing.svc.clustermesh.local:9094 -showcerts
# Monitor Cilium Cluster Mesh
cilium clustermesh status --context proxmox-cluster
cilium clustermesh status --context testing-cluster
Phase 3: Exchange CA Certificates¶
Each cluster's MM2 needs to trust the remote cluster's CA and authenticate with a remote user certificate. This requires copying secrets bidirectionally.
proxmox cluster → testing namespace:
# Copy proxmox CA cert to testing namespace
kubectl get secret kafka-rciis-prod-cluster-ca-cert -n rciis-prod -o yaml \
| sed 's/namespace: rciis-prod/namespace: rciis-testing/' \
| kubectl apply --context testing-cluster -f -
# Verify
kubectl get secret kafka-rciis-prod-cluster-ca-cert -n rciis-testing --context testing-cluster
testing cluster → proxmox namespace:
# Copy testing CA cert to proxmox namespace
kubectl get secret kafka-rciis-testing-cluster-ca-cert -n rciis-testing -o yaml \
| sed 's/namespace: rciis-testing/namespace: rciis-prod/' \
| kubectl apply --context proxmox-cluster -f -
# Verify
kubectl get secret kafka-rciis-testing-cluster-ca-cert -n rciis-prod --context proxmox-cluster
Phase 4: KafkaUser Resources¶
Each cluster has a single mm2-user with wildcard ACLs for all topics, consumer groups, and MM2 internal topics.
Files:
apps/rciis/kafka/proxmox/extra/kafka-mm2-user.yamlapps/rciis/kafka/testing/extra/kafka-mm2-user.yaml
Both files are identical except for the strimzi.io/cluster label. Example (proxmox):
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: mm2-user
labels:
strimzi.io/cluster: kafka-rciis-prod
spec:
authentication:
type: tls
authorization:
type: simple
acls:
# Read/Write all application topics
- resource:
type: topic
name: "*"
patternType: literal
operations: [Read, Write, Create, Describe]
# Consumer groups
- resource:
type: group
name: "*"
patternType: literal
operations: [Read, Write, Describe]
# MM2 internal topics
- resource:
type: topic
name: mm2-offset-syncs
patternType: prefix
operations: [Read, Write, Create, Describe]
- resource:
type: topic
name: heartbeats
patternType: literal
operations: [Read, Write, Create, Describe]
- resource:
type: topic
name: mirrormaker2-cluster
patternType: prefix
operations: [Read, Write, Create, Describe]
# Kafka Connect internal topics
- resource:
type: topic
name: connect-cluster
patternType: prefix
operations: [Read, Write, Create, Describe]
Apply and copy user secrets cross-cluster:
# Apply KafkaUser in each cluster
kubectl apply -f apps/rciis/kafka/proxmox/extra/kafka-mm2-user.yaml -n rciis-prod
kubectl apply -f apps/rciis/kafka/testing/extra/kafka-mm2-user.yaml -n rciis-testing
# Wait for Strimzi User Operator to generate TLS certificates
kubectl get secret mm2-user -n rciis-prod
kubectl get secret mm2-user -n rciis-testing
# Copy proxmox mm2-user secret → testing namespace (as mm2-user-proxmox)
kubectl get secret mm2-user -n rciis-prod -o json \
| jq '.metadata.name = "mm2-user-proxmox" | .metadata.namespace = "rciis-testing" | del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp)' \
| kubectl apply --context testing-cluster -f -
# Copy testing mm2-user secret → proxmox namespace (as mm2-user-testing)
kubectl get secret mm2-user -n rciis-testing -o json \
| jq '.metadata.name = "mm2-user-testing" | .metadata.namespace = "rciis-prod" | del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp)' \
| kubectl apply --context proxmox-cluster -f -
# Verify all required secrets in proxmox namespace
kubectl get secrets -n rciis-prod | grep -E '(mm2|cluster-ca-cert)'
# Expected:
# kafka-rciis-prod-cluster-ca-cert (local cluster CA)
# kafka-rciis-testing-cluster-ca-cert (remote cluster CA)
# mm2-user (local user cert)
# mm2-user-testing (remote user cert)
Phase 5: Deploy KafkaMirrorMaker2 CRD¶
Each cluster runs its own MM2 instance responsible for one replication direction only. This ensures no duplicate data and provides resilience — if one cluster goes down, only its outbound replication stops while inbound replication from the other cluster's MM2 continues unaffected.
Files:
apps/rciis/kafka/proxmox/extra/kafka-mirrormaker2.yaml—mm2-proxmox-to-testing(version 3.9.0, proxmox→testing only)apps/rciis/kafka/testing/extra/kafka-mirrormaker2.yaml—mm2-testing-to-proxmox(version 4.1.0, testing→proxmox only)
Example — proxmox cluster CRD (testing cluster is symmetric with swapped aliases, version 4.1.0, and direction reversed):
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
name: mm2-proxmox-to-testing
labels:
app: kafka-mirrormaker2
environment: proxmox
spec:
version: 3.9.0
replicas: 2
connectCluster: "proxmox"
clusters:
- alias: "proxmox"
bootstrapServers: kafka-rciis-prod-kafka-external-bootstrap.rciis-prod.svc:9094
tls:
trustedCertificates:
- secretName: kafka-rciis-prod-cluster-ca-cert
certificate: ca.crt
authentication:
type: tls
certificateAndKey:
secretName: mm2-user
certificate: user.crt
key: user.key
config:
config.storage.replication.factor: 3
offset.storage.replication.factor: 3
status.storage.replication.factor: 3
- alias: "testing"
bootstrapServers: kafka-rciis-testing-kafka-external-bootstrap.rciis-testing.svc.clustermesh.local:9094
tls:
trustedCertificates:
- secretName: kafka-rciis-testing-cluster-ca-cert
certificate: ca.crt
authentication:
type: tls
certificateAndKey:
secretName: mm2-user-testing
certificate: user.crt
key: user.key
mirrors:
- sourceCluster: "proxmox"
targetCluster: "testing"
sourceConnector:
tasksMax: 5
autoRestart:
enabled: true
config:
replication.factor: 3
offset-syncs.topic.replication.factor: 3
offset-syncs.topic.location: "target"
refresh.topics.interval.seconds: 60
replication.policy.class: "org.apache.kafka.connect.mirror.DefaultReplicationPolicy"
replication.policy.separator: "."
sync.topic.acls.enabled: "false"
producer.override.acks: all
producer.override.compression.type: lz4
producer.override.batch.size: 327680
producer.override.linger.ms: 100
checkpointConnector:
autoRestart:
enabled: true
config:
checkpoints.topic.replication.factor: 3
sync.group.offsets.enabled: true
sync.group.offsets.interval.seconds: 60
emit.checkpoints.interval.seconds: 60
refresh.groups.interval.seconds: 600
replication.policy.class: "org.apache.kafka.connect.mirror.DefaultReplicationPolicy"
heartbeatConnector:
config:
heartbeats.topic.replication.factor: 3
emit.heartbeats.interval.seconds: 10
topicsPattern: ".*"
topicsExcludePattern: ".*[\\-\\.]internal|__.*"
groupsPattern: ".*"
resources:
requests:
cpu: "500m"
memory: 1Gi
limits:
cpu: "2000m"
memory: 4Gi
jvmOptions:
-Xms: 1024m
-Xmx: 3072m
logging:
type: inline
loggers:
connect.root.logger.level: INFO
log4j.logger.org.apache.kafka.connect.mirror: DEBUG
livenessProbe:
initialDelaySeconds: 60
timeoutSeconds: 5
readinessProbe:
initialDelaySeconds: 30
timeoutSeconds: 5
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: mm2-jmx-metrics
key: mm2-metrics-config.yml
Key differences between the two CRDs:
| Field | proxmox CRD | testing CRD |
|---|---|---|
metadata.name |
mm2-proxmox-to-testing |
mm2-testing-to-proxmox |
spec.version |
3.9.0 |
4.1.0 |
spec.connectCluster |
proxmox |
testing |
mirrors[0] direction |
proxmox → testing | testing → proxmox |
Local cluster bootstrapServers |
...rciis-prod.svc:9094 |
...rciis-testing.svc:9094 |
Remote cluster bootstrapServers |
...rciis-testing.svc.clustermesh.local:9094 |
...rciis-prod.svc.clustermesh.local:9094 |
| Local user secret | mm2-user |
mm2-user |
| Remote user secret | mm2-user-testing |
mm2-user-proxmox |
Deploy:
kubectl apply -f apps/rciis/kafka/proxmox/extra/kafka-mirrormaker2.yaml -n rciis-prod
kubectl apply -f apps/rciis/kafka/testing/extra/kafka-mirrormaker2.yaml -n rciis-testing
Verify:
# Check pods in both clusters (2 replicas each)
kubectl get pods -n rciis-prod -l app=kafka-mirrormaker2
kubectl get pods -n rciis-testing -l app=kafka-mirrormaker2
# Check connector status (proxmox cluster — 3 connectors for proxmox→testing)
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
curl -s http://localhost:8083/connectors | jq
# Expected:
# proxmox->testing.MirrorSourceConnector
# proxmox->testing.MirrorCheckpointConnector
# proxmox->testing.MirrorHeartbeatConnector
# Check connector status (testing cluster — 3 connectors for testing→proxmox)
kubectl exec -n rciis-testing mm2-testing-to-proxmox-mirrormaker2-connect-0 -- \
curl -s http://localhost:8083/connectors | jq
# Expected:
# testing->proxmox.MirrorSourceConnector
# testing->proxmox.MirrorCheckpointConnector
# testing->proxmox.MirrorHeartbeatConnector
# Check individual connector status
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
curl -s http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/status | jq
# Expected: "state": "RUNNING"
Phase 6: JMX Metrics ConfigMap¶
Deploy the JMX metrics configuration for Prometheus scraping of MM2 connector metrics.
Files:
apps/rciis/kafka/proxmox/extra/kafka-mm2-metrics-configmap.yamlapps/rciis/kafka/testing/extra/kafka-mm2-metrics-configmap.yaml
Both files are identical:
apiVersion: v1
kind: ConfigMap
metadata:
name: mm2-jmx-metrics
data:
mm2-metrics-config.yml: |
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: kafka.connect<type=connect-worker-metrics><>(connector-count|task-count|connector-startup-attempts-total|...)
name: kafka_connect_$1
type: GAUGE
- pattern: kafka.connect.mirror<type=MirrorSourceConnector, target=(.+), topic=(.+), partition=([0-9]+)><>([a-z-]+)
name: kafka_connect_mirror_source_connector_$4
labels:
target: "$1"
topic: "$2"
partition: "$3"
type: GAUGE
- pattern: kafka.connect.mirror<type=MirrorSourceConnector, target=(.+), topic=(.+), partition=([0-9]+)><>replication-latency-ms
name: kafka_connect_mirror_replication_latency_ms
labels:
target: "$1"
topic: "$2"
partition: "$3"
type: GAUGE
- pattern: kafka.connect.mirror<type=MirrorCheckpointConnector, source=(.+), target=(.+)><>([a-z-]+)
name: kafka_connect_mirror_checkpoint_connector_$3
labels:
source: "$1"
target: "$2"
type: GAUGE
kubectl apply -f apps/rciis/kafka/proxmox/extra/kafka-mm2-metrics-configmap.yaml -n rciis-prod
kubectl apply -f apps/rciis/kafka/testing/extra/kafka-mm2-metrics-configmap.yaml -n rciis-testing
Phase 7: PrometheusRule Alerts¶
Deploy alerting rules for MM2 health and replication lag in both clusters.
Files:
apps/rciis/kafka/proxmox/extra/kafka-mm2-prometheus-rules.yamlapps/rciis/kafka/testing/extra/kafka-mm2-prometheus-rules.yaml
Both files are identical:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kafka-mirrormaker2-alerts
labels:
release: prometheus
spec:
groups:
- name: mirrormaker2
interval: 30s
rules:
- alert: MM2ConnectorDown
expr: kafka_connect_connector_status{state="RUNNING"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "MirrorMaker 2 connector {{ $labels.connector }} is down"
- alert: MM2ReplicationLagHigh
expr: |
(kafka_connect_source_task_source_record_poll_total -
kafka_connect_source_task_source_record_write_total) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "MirrorMaker 2 replication lag exceeds 10,000 messages"
- alert: MM2ReplicationLagCritical
expr: |
(kafka_connect_source_task_source_record_poll_total -
kafka_connect_source_task_source_record_write_total) > 50000
for: 2m
labels:
severity: critical
annotations:
summary: "MirrorMaker 2 replication lag critical (>50k messages)"
- alert: MM2CheckpointStale
expr: time() - kafka_connect_source_task_last_checkpoint_timestamp_seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "MirrorMaker 2 checkpoints are stale (no checkpoint in 5 minutes)"
kubectl apply -f apps/rciis/kafka/proxmox/extra/kafka-mm2-prometheus-rules.yaml -n rciis-prod
kubectl apply -f apps/rciis/kafka/testing/extra/kafka-mm2-prometheus-rules.yaml -n rciis-testing
Grafana dashboards — import from the Strimzi GitHub repository:
strimzi-kafka.json(broker metrics)strimzi-kafka-exporter.json(consumer lag)strimzi-kafka-connect.json(MM2 connector metrics)
Operational Runbooks¶
Pause/Resume One Replication Direction¶
Since both clusters are active, you may need to temporarily pause replication in one direction (e.g. during maintenance on a single cluster).
# Pause proxmox→testing replication (on proxmox cluster MM2)
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
curl -X PUT http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/pause
# Verify paused
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
curl -s http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/status | jq '.connector.state'
# Expected: "PAUSED"
# ... perform maintenance ...
# Resume
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
curl -X PUT http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/resume
To pause all replication from a cluster, scale down its MM2 instance:
# Scale down proxmox MM2 (pauses proxmox→testing replication)
kubectl scale kafkamirrormaker2/mm2-proxmox-to-testing --replicas=0 -n rciis-prod
# Scale back up
kubectl scale kafkamirrormaker2/mm2-proxmox-to-testing --replicas=2 -n rciis-prod
Network Partition Recovery¶
Symptoms: MM2ConnectorDown alert, logs showing "Connection reset by peer" or "Timed out waiting for".
# Diagnose
cilium clustermesh status --context proxmox-cluster
cilium clustermesh status --context testing-cluster
# Test cross-cluster connectivity
kubectl exec -it -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
nc -zv kafka-rciis-testing-kafka-external-bootstrap.rciis-testing.svc.clustermesh.local 9094
# Resolve — if Cilium Cluster Mesh is down, restart API server
kubectl rollout restart deployment/clustermesh-apiserver -n kube-system
# MM2 will automatically resume once connectivity is restored (autoRestart: enabled)
# Monitor catch-up progress:
watch 'kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
curl -s http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/status | jq'
High Replication Lag Troubleshooting¶
Symptoms: MM2ReplicationLagHigh alert, lag >10,000 messages.
# Check connector status on both clusters
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
curl -s http://localhost:8083/connectors | jq
kubectl exec -n rciis-testing mm2-testing-to-proxmox-mirrormaker2-connect-0 -- \
curl -s http://localhost:8083/connectors | jq
# Check MM2 logs for errors
kubectl logs -n rciis-prod -l app=kafka-mirrormaker2 --tail=100 | grep -i error
kubectl logs -n rciis-testing -l app=kafka-mirrormaker2 --tail=100 | grep -i error
Common causes:
| Cause | Resolution |
|---|---|
| Burst of messages on source | Wait for burst to complete; optionally scale MM2 replicas to 3 |
| Network congestion | Check bandwidth utilisation; investigate Cilium Cluster Mesh health |
| Target broker failure | Check broker health: kubectl get pods -n rciis-testing -l strimzi.io/cluster=kafka-rciis-testing |
| Insufficient MM2 task parallelism | Increase tasksMax from 5 to 10 in KafkaMirrorMaker2 CRD |
References¶
- Strimzi Kafka Operator Documentation
- Apache Kafka MirrorMaker 2 (KIP-382)
- Cilium Cluster Mesh Documentation
Key Files¶
| File | Description |
|---|---|
apps/rciis/kafka/proxmox/extra/kafka.yaml |
Proxmox cluster Kafka CRD (v3.9.0) |
apps/rciis/kafka/testing/extra/kafka.yaml |
Testing cluster Kafka CRD (v4.1.0) |
apps/rciis/kafka/proxmox/extra/kafka-mirrormaker2.yaml |
Proxmox MM2 CRD (proxmox → testing) |
apps/rciis/kafka/testing/extra/kafka-mirrormaker2.yaml |
Testing MM2 CRD (testing → proxmox) |
apps/rciis/kafka/proxmox/extra/kafka-mm2-user.yaml |
Proxmox cluster KafkaUser for MM2 |
apps/rciis/kafka/testing/extra/kafka-mm2-user.yaml |
Testing cluster KafkaUser for MM2 |
apps/rciis/kafka/proxmox/extra/kafka-mm2-metrics-configmap.yaml |
Proxmox JMX metrics ConfigMap |
apps/rciis/kafka/testing/extra/kafka-mm2-metrics-configmap.yaml |
Testing JMX metrics ConfigMap |
apps/rciis/kafka/proxmox/extra/kafka-mm2-prometheus-rules.yaml |
Proxmox PrometheusRule for MM2 alerts |
apps/rciis/kafka/testing/extra/kafka-mm2-prometheus-rules.yaml |
Testing PrometheusRule for MM2 alerts |