Skip to content

7.2 Kafka Replication

This page covers bidirectional cross-cluster Kafka replication between the testing and proxmox environments using Strimzi MirrorMaker 2 (MM2). Each cluster runs its own MM2 instance responsible for one replication direction — proxmox's MM2 handles proxmox→testing, and testing's MM2 handles testing→proxmox. This avoids duplicate data and provides resilience: if one cluster goes down, only its outbound replication stops.


Key Decisions

  • Topology: Active-Active (bidirectional, testing ↔ proxmox)
  • Replication Tool: Strimzi KafkaMirrorMaker2 CRD (one per cluster, one direction each)
  • Replication Policy: DefaultReplicationPolicy with . separator (topics prefixed with source cluster alias)
  • Topics: .* (all topics, excluding internal topics)
  • Consumer Groups: All groups (offset synchronisation enabled)
  • Authentication: TLS with mTLS (cluster CA certificates)
  • Networking: Cilium Cluster Mesh with global services

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                   Testing Cluster (Active — Kafka 4.1.0)            │
│                                                                     │
│  ┌──────────┐    ┌──────────────┐    ┌────────────────────────┐     │
│  │ ESB      │───>│ Kafka Broker │    │ mm2-testing-to-proxmox    │     │
│  │ Producer │    │ (local       │    │ (connectCluster:       │     │
│  └──────────┘    │  topics)     │    │  testing)              │     │
│                  └──────┬───────┘    │                        │     │
│                         │            │ testing ──────► proxmox   │     │
│                         │ Port 9094  └───────────┬────────────┘     │
│                         │ (TLS +                 │                  │
│                         │  Cilium Global Svc)    │                  │
└─────────────────────────┼────────────────────────┼──────────────────┘
                          │                        │
                          │  Cilium Cluster Mesh   │
                          │                        │
┌─────────────────────────┼────────────────────────┼──────────────────┐
│                         │ Port 9094              │                  │
│                         │ (TLS +                 │                  │
│                         │  Cilium Global Svc)    │                  │
│  ┌──────────┐    ┌──────┴───────┐    ┌───────────┴────────────┐     │
│  │ ESB      │───>│ Kafka Broker │    │ mm2-proxmox-to-testing    │     │
│  │ Producer │    │ (local       │    │ (connectCluster:       │     │
│  └──────────┘    │  topics)     │    │  proxmox)                 │     │
│                  └──────────────┘    │                        │     │
│                                      │ proxmox ──────► testing   │     │
│                                      └────────────────────────┘     │
│                   Proxmox Cluster (Active — Kafka 3.9.0)               │
└─────────────────────────────────────────────────────────────────────┘

MirrorMaker 2 Components

Each MM2 instance runs three connectors for its single replication direction:

Component Purpose Internal Topic
MirrorSourceConnector Consumes from source, produces to target with DefaultReplicationPolicy (topics prefixed with source alias, e.g. proxmox.rect). mm2-offset-syncs.<target>.internal
MirrorCheckpointConnector Translates consumer group offsets from source to target. Emits checkpoints every 60 seconds. <source>.checkpoints.internal
MirrorHeartbeatConnector Monitors replication health. Emits heartbeat every 10 seconds. heartbeats

Topic Naming

With DefaultReplicationPolicy (separator: .), replicated topics are prefixed with the source cluster alias:

Cluster Local Topic Replicated To Remote As
proxmox rect proxmox.rect (on testing cluster)
testing rect testing.rect (on proxmox cluster)

Infinite Loop Prevention

The topicsExcludePattern prevents re-replicating already-replicated topics:

topicsPattern: ".*"
topicsExcludePattern: ".*[\\-\\.]internal|__.*"

Since replicated topics contain the source alias (e.g. proxmox.rect), and DefaultReplicationPolicy only replicates topics that don't already carry a remote prefix, infinite loops are inherently prevented. The exclude pattern additionally filters out MM2 internal topics (mm2-offset-syncs.*.internal, *.checkpoints.internal) and Kafka system topics (__consumer_offsets, __transaction_state).


Prerequisites

Cilium Cluster Mesh

  • Cilium installed on both clusters
  • Cilium Cluster Mesh connected between clusters
  • Pod-to-pod connectivity verified across clusters
  • DNS resolution working: *.svc.clustermesh.local
# Verify Cluster Mesh status
cilium clustermesh status

# Expected:
# Service "clustermesh-apiserver" of type "LoadBalancer" found
# Cluster access information is available:
#   - testing-cluster: available
#   - proxmox-cluster: available

Network Requirements

  • Non-overlapping Pod CIDRs between clusters
  • Firewall rules allow Kafka traffic (port 9094 TCP)
  • Network latency <150ms (measured between clusters)

Strimzi Operator

  • Strimzi operator deployed in both clusters
  • One operator per cluster
  • proxmox: Kafka 3.9.0 running in KRaft mode
  • testing: Kafka 4.1.0 running in KRaft mode

Storage and Resources

  • Both clusters have sufficient storage for replicated data
  • Storage class supports ReadWriteOnce (RWO) volumes
  • Sufficient CPU/Memory for MM2 pods (2 replicas x 4Gi memory per cluster)

Critical Fix: Resource Sizing

Fix Before Deploying MM2

The current Kafka configuration has a JVM max heap (-Xmx: 8192m) that exceeds the pod memory limit (4Gi). Brokers will be OOMKilled when the JVM attempts to allocate 8 GB heap. This must be fixed before proceeding.

resources:
  requests:
    memory: 16Gi
    cpu: "2"
  limits:
    memory: 16Gi
    cpu: "4"

jvmOptions:
  -Xms: 8192m       # Initial heap: 8 GB
  -Xmx: 8192m       # Max heap: 8 GB (50% of pod memory for OS page cache)
resources:
  requests:
    memory: 4Gi
    cpu: "1"
  limits:
    memory: 4Gi
    cpu: "2"

jvmOptions:
  -Xms: 3072m       # Initial heap: 3 GB
  -Xmx: 3072m       # Max heap: 3 GB (leave 1 GB for OS page cache)

Apply to both clusters:

  1. Edit apps/rciis/kafka/proxmox/extra/kafka.yaml
  2. Edit apps/rciis/kafka/testing/extra/kafka.yaml
  3. Apply changes via FluxCD reconciliation or kubectl apply
  4. Wait for rolling restart to complete
  5. Verify brokers are healthy:
kubectl get pods -n rciis-prod -l strimzi.io/cluster=kafka-rciis-prod
kubectl get pods -n rciis-testing -l strimzi.io/cluster=kafka-rciis-testing
kubectl get events -n rciis-prod --sort-by='.lastTimestamp' | grep OOMKilled
kubectl get events -n rciis-testing --sort-by='.lastTimestamp' | grep OOMKilled

Implementation Phases

Phase 1: Add External TLS Listener

Configure a cross-cluster TLS listener on port 9094 in both Kafka clusters. The listener uses cluster-ip type with Cilium global service annotations so each cluster's Kafka services are discoverable from the other cluster.

Files:

  • apps/rciis/kafka/proxmox/extra/kafka.yaml
  • apps/rciis/kafka/testing/extra/kafka.yaml

The external listener in the spec.kafka.listeners array:

listeners:
  # Existing listeners...
  - name: scram
    port: 9092
    tls: false
    type: internal
    authentication:
      type: scram-sha-512
  - name: plain
    port: 9093
    tls: false
    type: internal

  # Cross-cluster TLS listener
  - name: external
    port: 9094
    type: cluster-ip
    tls: true
    authentication:
      type: tls
    configuration:
      bootstrap:
        annotations:
          service.cilium.io/global: "true"
      brokers:
        - broker: 0
          annotations:
            service.cilium.io/global: "true"
        - broker: 1
          annotations:
            service.cilium.io/global: "true"
        - broker: 2
          annotations:
            service.cilium.io/global: "true"

Verify (repeat for both clusters):

# Check new services created
kubectl get svc -n rciis-prod | grep external

# Expected:
# kafka-rciis-prod-kafka-external-bootstrap    ClusterIP   ...   9094/TCP
# kafka-rciis-prod-kafka-external-0            ClusterIP   ...   9094/TCP
# kafka-rciis-prod-kafka-external-1            ClusterIP   ...   9094/TCP
# kafka-rciis-prod-kafka-external-2            ClusterIP   ...   9094/TCP

# Verify Cilium global service annotation
kubectl get svc kafka-rciis-prod-kafka-external-bootstrap -n rciis-prod -o yaml | grep global
# Expected: service.cilium.io/global: "true"

Phase 2: Cilium Global Services & DNS

Verify Kafka services are discoverable across clusters via Cilium Cluster Mesh.

# From testing cluster pod, resolve proxmox cluster bootstrap service
nslookup kafka-rciis-prod-kafka-external-bootstrap.rciis-prod.svc.clustermesh.local

# From proxmox cluster pod, resolve testing cluster bootstrap service
nslookup kafka-rciis-testing-kafka-external-bootstrap.rciis-testing.svc.clustermesh.local

# Test TLS connection from proxmox to testing
kubectl run -it --rm debug --image=alpine --restart=Never -n rciis-prod -- sh
apk add openssl
openssl s_client -connect kafka-rciis-testing-kafka-external-bootstrap.rciis-testing.svc.clustermesh.local:9094 -showcerts

# Monitor Cilium Cluster Mesh
cilium clustermesh status --context proxmox-cluster
cilium clustermesh status --context testing-cluster

Phase 3: Exchange CA Certificates

Each cluster's MM2 needs to trust the remote cluster's CA and authenticate with a remote user certificate. This requires copying secrets bidirectionally.

proxmox cluster → testing namespace:

# Copy proxmox CA cert to testing namespace
kubectl get secret kafka-rciis-prod-cluster-ca-cert -n rciis-prod -o yaml \
  | sed 's/namespace: rciis-prod/namespace: rciis-testing/' \
  | kubectl apply --context testing-cluster -f -

# Verify
kubectl get secret kafka-rciis-prod-cluster-ca-cert -n rciis-testing --context testing-cluster

testing cluster → proxmox namespace:

# Copy testing CA cert to proxmox namespace
kubectl get secret kafka-rciis-testing-cluster-ca-cert -n rciis-testing -o yaml \
  | sed 's/namespace: rciis-testing/namespace: rciis-prod/' \
  | kubectl apply --context proxmox-cluster -f -

# Verify
kubectl get secret kafka-rciis-testing-cluster-ca-cert -n rciis-prod --context proxmox-cluster

Phase 4: KafkaUser Resources

Each cluster has a single mm2-user with wildcard ACLs for all topics, consumer groups, and MM2 internal topics.

Files:

  • apps/rciis/kafka/proxmox/extra/kafka-mm2-user.yaml
  • apps/rciis/kafka/testing/extra/kafka-mm2-user.yaml

Both files are identical except for the strimzi.io/cluster label. Example (proxmox):

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: mm2-user
  labels:
    strimzi.io/cluster: kafka-rciis-prod
spec:
  authentication:
    type: tls
  authorization:
    type: simple
    acls:
      # Read/Write all application topics
      - resource:
          type: topic
          name: "*"
          patternType: literal
        operations: [Read, Write, Create, Describe]
      # Consumer groups
      - resource:
          type: group
          name: "*"
          patternType: literal
        operations: [Read, Write, Describe]
      # MM2 internal topics
      - resource:
          type: topic
          name: mm2-offset-syncs
          patternType: prefix
        operations: [Read, Write, Create, Describe]
      - resource:
          type: topic
          name: heartbeats
          patternType: literal
        operations: [Read, Write, Create, Describe]
      - resource:
          type: topic
          name: mirrormaker2-cluster
          patternType: prefix
        operations: [Read, Write, Create, Describe]
      # Kafka Connect internal topics
      - resource:
          type: topic
          name: connect-cluster
          patternType: prefix
        operations: [Read, Write, Create, Describe]

Apply and copy user secrets cross-cluster:

# Apply KafkaUser in each cluster
kubectl apply -f apps/rciis/kafka/proxmox/extra/kafka-mm2-user.yaml -n rciis-prod
kubectl apply -f apps/rciis/kafka/testing/extra/kafka-mm2-user.yaml -n rciis-testing

# Wait for Strimzi User Operator to generate TLS certificates
kubectl get secret mm2-user -n rciis-prod
kubectl get secret mm2-user -n rciis-testing

# Copy proxmox mm2-user secret → testing namespace (as mm2-user-proxmox)
kubectl get secret mm2-user -n rciis-prod -o json \
  | jq '.metadata.name = "mm2-user-proxmox" | .metadata.namespace = "rciis-testing" | del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp)' \
  | kubectl apply --context testing-cluster -f -

# Copy testing mm2-user secret → proxmox namespace (as mm2-user-testing)
kubectl get secret mm2-user -n rciis-testing -o json \
  | jq '.metadata.name = "mm2-user-testing" | .metadata.namespace = "rciis-prod" | del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp)' \
  | kubectl apply --context proxmox-cluster -f -

# Verify all required secrets in proxmox namespace
kubectl get secrets -n rciis-prod | grep -E '(mm2|cluster-ca-cert)'
# Expected:
# kafka-rciis-prod-cluster-ca-cert       (local cluster CA)
# kafka-rciis-testing-cluster-ca-cert    (remote cluster CA)
# mm2-user                                (local user cert)
# mm2-user-testing                        (remote user cert)

Phase 5: Deploy KafkaMirrorMaker2 CRD

Each cluster runs its own MM2 instance responsible for one replication direction only. This ensures no duplicate data and provides resilience — if one cluster goes down, only its outbound replication stops while inbound replication from the other cluster's MM2 continues unaffected.

Files:

  • apps/rciis/kafka/proxmox/extra/kafka-mirrormaker2.yamlmm2-proxmox-to-testing (version 3.9.0, proxmox→testing only)
  • apps/rciis/kafka/testing/extra/kafka-mirrormaker2.yamlmm2-testing-to-proxmox (version 4.1.0, testing→proxmox only)

Example — proxmox cluster CRD (testing cluster is symmetric with swapped aliases, version 4.1.0, and direction reversed):

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: mm2-proxmox-to-testing
  labels:
    app: kafka-mirrormaker2
    environment: proxmox
spec:
  version: 3.9.0
  replicas: 2
  connectCluster: "proxmox"

  clusters:
    - alias: "proxmox"
      bootstrapServers: kafka-rciis-prod-kafka-external-bootstrap.rciis-prod.svc:9094
      tls:
        trustedCertificates:
          - secretName: kafka-rciis-prod-cluster-ca-cert
            certificate: ca.crt
      authentication:
        type: tls
        certificateAndKey:
          secretName: mm2-user
          certificate: user.crt
          key: user.key
      config:
        config.storage.replication.factor: 3
        offset.storage.replication.factor: 3
        status.storage.replication.factor: 3

    - alias: "testing"
      bootstrapServers: kafka-rciis-testing-kafka-external-bootstrap.rciis-testing.svc.clustermesh.local:9094
      tls:
        trustedCertificates:
          - secretName: kafka-rciis-testing-cluster-ca-cert
            certificate: ca.crt
      authentication:
        type: tls
        certificateAndKey:
          secretName: mm2-user-testing
          certificate: user.crt
          key: user.key

  mirrors:
    - sourceCluster: "proxmox"
      targetCluster: "testing"
      sourceConnector:
        tasksMax: 5
        autoRestart:
          enabled: true
        config:
          replication.factor: 3
          offset-syncs.topic.replication.factor: 3
          offset-syncs.topic.location: "target"
          refresh.topics.interval.seconds: 60
          replication.policy.class: "org.apache.kafka.connect.mirror.DefaultReplicationPolicy"
          replication.policy.separator: "."
          sync.topic.acls.enabled: "false"
          producer.override.acks: all
          producer.override.compression.type: lz4
          producer.override.batch.size: 327680
          producer.override.linger.ms: 100
      checkpointConnector:
        autoRestart:
          enabled: true
        config:
          checkpoints.topic.replication.factor: 3
          sync.group.offsets.enabled: true
          sync.group.offsets.interval.seconds: 60
          emit.checkpoints.interval.seconds: 60
          refresh.groups.interval.seconds: 600
          replication.policy.class: "org.apache.kafka.connect.mirror.DefaultReplicationPolicy"
      heartbeatConnector:
        config:
          heartbeats.topic.replication.factor: 3
          emit.heartbeats.interval.seconds: 10
      topicsPattern: ".*"
      topicsExcludePattern: ".*[\\-\\.]internal|__.*"
      groupsPattern: ".*"

  resources:
    requests:
      cpu: "500m"
      memory: 1Gi
    limits:
      cpu: "2000m"
      memory: 4Gi
  jvmOptions:
    -Xms: 1024m
    -Xmx: 3072m
  logging:
    type: inline
    loggers:
      connect.root.logger.level: INFO
      log4j.logger.org.apache.kafka.connect.mirror: DEBUG
  livenessProbe:
    initialDelaySeconds: 60
    timeoutSeconds: 5
  readinessProbe:
    initialDelaySeconds: 30
    timeoutSeconds: 5
  metricsConfig:
    type: jmxPrometheusExporter
    valueFrom:
      configMapKeyRef:
        name: mm2-jmx-metrics
        key: mm2-metrics-config.yml

Key differences between the two CRDs:

Field proxmox CRD testing CRD
metadata.name mm2-proxmox-to-testing mm2-testing-to-proxmox
spec.version 3.9.0 4.1.0
spec.connectCluster proxmox testing
mirrors[0] direction proxmox → testing testing → proxmox
Local cluster bootstrapServers ...rciis-prod.svc:9094 ...rciis-testing.svc:9094
Remote cluster bootstrapServers ...rciis-testing.svc.clustermesh.local:9094 ...rciis-prod.svc.clustermesh.local:9094
Local user secret mm2-user mm2-user
Remote user secret mm2-user-testing mm2-user-proxmox

Deploy:

kubectl apply -f apps/rciis/kafka/proxmox/extra/kafka-mirrormaker2.yaml -n rciis-prod
kubectl apply -f apps/rciis/kafka/testing/extra/kafka-mirrormaker2.yaml -n rciis-testing

Verify:

# Check pods in both clusters (2 replicas each)
kubectl get pods -n rciis-prod -l app=kafka-mirrormaker2
kubectl get pods -n rciis-testing -l app=kafka-mirrormaker2

# Check connector status (proxmox cluster — 3 connectors for proxmox→testing)
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq
# Expected:
# proxmox->testing.MirrorSourceConnector
# proxmox->testing.MirrorCheckpointConnector
# proxmox->testing.MirrorHeartbeatConnector

# Check connector status (testing cluster — 3 connectors for testing→proxmox)
kubectl exec -n rciis-testing mm2-testing-to-proxmox-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq
# Expected:
# testing->proxmox.MirrorSourceConnector
# testing->proxmox.MirrorCheckpointConnector
# testing->proxmox.MirrorHeartbeatConnector

# Check individual connector status
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/status | jq
# Expected: "state": "RUNNING"

Phase 6: JMX Metrics ConfigMap

Deploy the JMX metrics configuration for Prometheus scraping of MM2 connector metrics.

Files:

  • apps/rciis/kafka/proxmox/extra/kafka-mm2-metrics-configmap.yaml
  • apps/rciis/kafka/testing/extra/kafka-mm2-metrics-configmap.yaml

Both files are identical:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mm2-jmx-metrics
data:
  mm2-metrics-config.yml: |
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    rules:
    - pattern: kafka.connect<type=connect-worker-metrics><>(connector-count|task-count|connector-startup-attempts-total|...)
      name: kafka_connect_$1
      type: GAUGE

    - pattern: kafka.connect.mirror<type=MirrorSourceConnector, target=(.+), topic=(.+), partition=([0-9]+)><>([a-z-]+)
      name: kafka_connect_mirror_source_connector_$4
      labels:
        target: "$1"
        topic: "$2"
        partition: "$3"
      type: GAUGE

    - pattern: kafka.connect.mirror<type=MirrorSourceConnector, target=(.+), topic=(.+), partition=([0-9]+)><>replication-latency-ms
      name: kafka_connect_mirror_replication_latency_ms
      labels:
        target: "$1"
        topic: "$2"
        partition: "$3"
      type: GAUGE

    - pattern: kafka.connect.mirror<type=MirrorCheckpointConnector, source=(.+), target=(.+)><>([a-z-]+)
      name: kafka_connect_mirror_checkpoint_connector_$3
      labels:
        source: "$1"
        target: "$2"
      type: GAUGE
kubectl apply -f apps/rciis/kafka/proxmox/extra/kafka-mm2-metrics-configmap.yaml -n rciis-prod
kubectl apply -f apps/rciis/kafka/testing/extra/kafka-mm2-metrics-configmap.yaml -n rciis-testing

Phase 7: PrometheusRule Alerts

Deploy alerting rules for MM2 health and replication lag in both clusters.

Files:

  • apps/rciis/kafka/proxmox/extra/kafka-mm2-prometheus-rules.yaml
  • apps/rciis/kafka/testing/extra/kafka-mm2-prometheus-rules.yaml

Both files are identical:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-mirrormaker2-alerts
  labels:
    release: prometheus
spec:
  groups:
    - name: mirrormaker2
      interval: 30s
      rules:
        - alert: MM2ConnectorDown
          expr: kafka_connect_connector_status{state="RUNNING"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "MirrorMaker 2 connector {{ $labels.connector }} is down"

        - alert: MM2ReplicationLagHigh
          expr: |
            (kafka_connect_source_task_source_record_poll_total -
             kafka_connect_source_task_source_record_write_total) > 10000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "MirrorMaker 2 replication lag exceeds 10,000 messages"

        - alert: MM2ReplicationLagCritical
          expr: |
            (kafka_connect_source_task_source_record_poll_total -
             kafka_connect_source_task_source_record_write_total) > 50000
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "MirrorMaker 2 replication lag critical (>50k messages)"

        - alert: MM2CheckpointStale
          expr: time() - kafka_connect_source_task_last_checkpoint_timestamp_seconds > 300
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "MirrorMaker 2 checkpoints are stale (no checkpoint in 5 minutes)"
kubectl apply -f apps/rciis/kafka/proxmox/extra/kafka-mm2-prometheus-rules.yaml -n rciis-prod
kubectl apply -f apps/rciis/kafka/testing/extra/kafka-mm2-prometheus-rules.yaml -n rciis-testing

Grafana dashboards — import from the Strimzi GitHub repository:

  • strimzi-kafka.json (broker metrics)
  • strimzi-kafka-exporter.json (consumer lag)
  • strimzi-kafka-connect.json (MM2 connector metrics)

Operational Runbooks

Pause/Resume One Replication Direction

Since both clusters are active, you may need to temporarily pause replication in one direction (e.g. during maintenance on a single cluster).

# Pause proxmox→testing replication (on proxmox cluster MM2)
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
  curl -X PUT http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/pause

# Verify paused
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/status | jq '.connector.state'
# Expected: "PAUSED"

# ... perform maintenance ...

# Resume
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
  curl -X PUT http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/resume

To pause all replication from a cluster, scale down its MM2 instance:

# Scale down proxmox MM2 (pauses proxmox→testing replication)
kubectl scale kafkamirrormaker2/mm2-proxmox-to-testing --replicas=0 -n rciis-prod

# Scale back up
kubectl scale kafkamirrormaker2/mm2-proxmox-to-testing --replicas=2 -n rciis-prod

Network Partition Recovery

Symptoms: MM2ConnectorDown alert, logs showing "Connection reset by peer" or "Timed out waiting for".

# Diagnose
cilium clustermesh status --context proxmox-cluster
cilium clustermesh status --context testing-cluster

# Test cross-cluster connectivity
kubectl exec -it -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
  nc -zv kafka-rciis-testing-kafka-external-bootstrap.rciis-testing.svc.clustermesh.local 9094

# Resolve — if Cilium Cluster Mesh is down, restart API server
kubectl rollout restart deployment/clustermesh-apiserver -n kube-system

# MM2 will automatically resume once connectivity is restored (autoRestart: enabled)
# Monitor catch-up progress:
watch 'kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors/proxmox->testing.MirrorSourceConnector/status | jq'

High Replication Lag Troubleshooting

Symptoms: MM2ReplicationLagHigh alert, lag >10,000 messages.

# Check connector status on both clusters
kubectl exec -n rciis-prod mm2-proxmox-to-testing-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq
kubectl exec -n rciis-testing mm2-testing-to-proxmox-mirrormaker2-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq

# Check MM2 logs for errors
kubectl logs -n rciis-prod -l app=kafka-mirrormaker2 --tail=100 | grep -i error
kubectl logs -n rciis-testing -l app=kafka-mirrormaker2 --tail=100 | grep -i error

Common causes:

Cause Resolution
Burst of messages on source Wait for burst to complete; optionally scale MM2 replicas to 3
Network congestion Check bandwidth utilisation; investigate Cilium Cluster Mesh health
Target broker failure Check broker health: kubectl get pods -n rciis-testing -l strimzi.io/cluster=kafka-rciis-testing
Insufficient MM2 task parallelism Increase tasksMax from 5 to 10 in KafkaMirrorMaker2 CRD

References


Key Files

File Description
apps/rciis/kafka/proxmox/extra/kafka.yaml Proxmox cluster Kafka CRD (v3.9.0)
apps/rciis/kafka/testing/extra/kafka.yaml Testing cluster Kafka CRD (v4.1.0)
apps/rciis/kafka/proxmox/extra/kafka-mirrormaker2.yaml Proxmox MM2 CRD (proxmox → testing)
apps/rciis/kafka/testing/extra/kafka-mirrormaker2.yaml Testing MM2 CRD (testing → proxmox)
apps/rciis/kafka/proxmox/extra/kafka-mm2-user.yaml Proxmox cluster KafkaUser for MM2
apps/rciis/kafka/testing/extra/kafka-mm2-user.yaml Testing cluster KafkaUser for MM2
apps/rciis/kafka/proxmox/extra/kafka-mm2-metrics-configmap.yaml Proxmox JMX metrics ConfigMap
apps/rciis/kafka/testing/extra/kafka-mm2-metrics-configmap.yaml Testing JMX metrics ConfigMap
apps/rciis/kafka/proxmox/extra/kafka-mm2-prometheus-rules.yaml Proxmox PrometheusRule for MM2 alerts
apps/rciis/kafka/testing/extra/kafka-mm2-prometheus-rules.yaml Testing PrometheusRule for MM2 alerts