Skip to content

6.2 Configure Geo Load Balancing

Cloudflare Load Balancing provides automatic traffic failover between the primary EAC DC cluster and the DR cluster in AWS Cape Town. Each RCIIS service hostname has its own load balancer, but all share the same two origin pools.

The load balancer operates in active-passive mode: all traffic goes to the primary pool (EAC DC) unless a health check failure triggers failover to the standby pool (AWS).


Architecture

                        EAC Partner States
    ┌──────────┬──────────┬──────────┬──────────┐
    │ Kenya    │ Tanzania │ Uganda   │ Rwanda   │
    │ (Nairobi)│ (DSM)    │ (Kampala)│ (Kigali) │
    ├──────────┼──────────┼──────────┴──────────┤
    │ Burundi  │ DR Congo │ South Sudan         │
    │(Bujumbura│(Kinshasa)│ (Juba)              │
    └────┬─────┴────┬─────┴────┬────────────────┘
         │          │          │
         │   Public Internet   │
         └──────────┼──────────┘
         ┌──────────┴──────────┐
         │   Cloudflare Edge   │
         │  (Anycast network)  │
         │  DNS + Load Balancer│
         └──────────┬──────────┘
    ┌───────────────┴───────────────┐
    │ Per-Service Load Balancers    │
    │  auth.rciis.africa            │
    │  keycloak.rciis.africa        │
    │  grafana.rciis.africa         │
    │  kafka.rciis.africa   ...     │
    └───────┬───────────────┬───────┘
            │               │
     ┌──────┴──────┐ ┌─────┴────────┐
     │ Primary Pool│ │ Failover Pool│
     │  (EAC DC)   │ │  (AWS ZA)    │
     │ ┌─────────┐ │ │ ┌──────────┐ │
     │ │Origin:  │ │ │ │Origin:   │ │
     │ │<pub-ip> │ │ │ │NLB DNS   │ │
     │ └─────────┘ │ │ └──────────┘ │
     └──────┬──────┘ └──────┬───────┘
            │               │
     ┌──────┴──────┐ ┌─────┴────────┐
     │ Cilium GW   │ │ Cilium GW    │
     │ (EAC DC)    │ │ (AWS)        │
     │ cert-manager│ │ cert-manager │
     └─────────────┘ └──────────────┘

All service hostnames share the same origin pools. The Cilium Gateway on each cluster routes requests to the correct backend pod based on the Host HTTP header.


Origin Pools

You create two pools — one per environment. All 8 load balancers reference the same two pools.

Pool 1 — EAC DC (Primary)

Navigate to Traffic > Load Balancing > Pools > Create Pool.

Setting Value Explanation
Pool name eac-dc Descriptive name for the on-premise data centre
Pool description EAC Data Centre - Primary
Endpoint steering Random Only one endpoint in the pool, so steering is irrelevant
Endpoint
Endpoint name eac-dc-gateway Name for the origin server
Endpoint address <EAC DC public IP> The public IP of the EAC DC cluster's ingress (Cilium Gateway)
Port 443 HTTPS port — TLS terminated by the cluster's cert-manager certificates
Weight (leave empty) Single origin, no weighting needed
Health threshold 1 Pool is marked unhealthy if the single endpoint fails
Monitor See Health Checks Attach after creating the monitor
Health check regions EMEA Closest Cloudflare PoPs to the EAC region

Pool 2 — AWS Cape Town (Failover)

Setting Value Explanation
Pool name aws-af-south-1 Named after the AWS region
Pool description AWS South Africa
Endpoint steering Random Single endpoint
Endpoint
Endpoint name aws-gateway-nlb The Cilium Gateway's AWS NLB
Endpoint address <NLB DNS hostname> e.g., k8s-kubesyst-ciliumga-xxxx.elb.af-south-1.amazonaws.com — the external address of the cilium-gateway-aws-gateway service
Port 443
Weight (leave empty)
Health threshold 1
Monitor See Health Checks
Health check regions EMEA

Finding the AWS NLB hostname

kubectl get svc cilium-gateway-aws-gateway -n kube-system \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Tip

Each pool contains a single origin since both sites use a single entry point (a public IP for EAC DC, an NLB for AWS). The failover is between pools, not between origins within a pool.


Health Checks

Health checks (called Monitors in Cloudflare) probe each origin pool to determine availability. When the primary pool fails consecutive checks, Cloudflare automatically routes traffic to the failover pool.

Monitor Configuration

Navigate to Traffic > Load Balancing > Monitors > Create.

Setting Value Explanation
Type TCP TCP check on port 443 — verifies the Cilium Gateway NLB is accepting connections. Use TCP rather than HTTPS because the health check runs independently of any specific hostname/certificate.
Port 443 The HTTPS port on the Cilium Gateway
Interval 30s How often Cloudflare probes the origin
Timeout 10s How long to wait for a response before marking the check as failed
Retries 3 Number of retries within a single check before marking it failed

Attach this monitor to both origin pools.

Why TCP instead of HTTPS?

An HTTPS health check requires a specific Host header and valid TLS certificate. Since the same origin pool serves multiple hostnames (auth, grafana, kafka, etc.), a TCP check is simpler and equally effective — if the Cilium Gateway is accepting TCP connections on port 443, all services behind it are reachable. The individual service health is monitored by Kubernetes liveness/readiness probes.

Bootstrapping caveat

During initial cluster setup, the TCP health check will fail until the Cilium Gateway is fully programmed with TLS certificates. The Gateway requires cert-manager to issue certificates via DNS-01 challenge, which requires the DNS records to exist. To break this chicken-and-egg cycle:

  1. Create the load balancers without a monitor first
  2. Wait for cert-manager to issue certificates (verify with kubectl get certificates -n kube-system)
  3. Verify TCP connectivity: nc -vz <NLB hostname> 443
  4. Attach the monitor to the pools

Failure Detection Timeline

With the above settings, the worst-case detection time is:

3 retries × 30s interval = ~90 seconds

After 3 consecutive failures, Cloudflare marks the pool as unhealthy and begins routing traffic to the failover pool.

Recovery

When a pool comes back online, Cloudflare requires 2 consecutive successful checks before marking it healthy again:

2 successes × 30s interval = ~60 seconds

Traffic automatically returns to the primary pool once it is marked healthy.


Load Balancers

Create one load balancer per service hostname. All load balancers share the same pools and settings — only the hostname differs.

Create Load Balancers

Navigate to Traffic > Load Balancing > Create Load Balancer for each hostname.

Step 1: Hostname

Setting Value
Hostname e.g., auth.rciis.africa
Proxy status DNS only (grey cloud)

Why DNS only?

TLS is terminated at the cluster by cert-manager (Let's Encrypt). Enabling Cloudflare's proxy would require separate certificate management at the Cloudflare edge and would conflict with the cluster's certificates. See DNS Zones > Proxied vs DNS-Only for details.

Step 2: Origin Pools

Add both pools in priority order:

Priority Pool Role
1 eac-dc Primary — all traffic goes here when healthy
2 aws-af-south-1 Failover — receives traffic when primary is unhealthy

Set the Fallback pool to aws-af-south-1.

Step 3: Monitors

Attach the TCP monitor created in Health Checks to both pools.

Step 4: Traffic Steering

Setting Value Explanation
Steering policy Off (Failover) Routes all traffic to the highest-priority healthy pool. No load distribution — strictly active-passive.

Step 5: Session Affinity

Setting Value Explanation
Session affinity Enabled Pins a user to one pool for the session duration
Affinity type By Cloudflare cookie only Uses a __cflb cookie to track pool assignment. Does not use client IP, which avoids issues with users behind shared NAT or changing IPs.
Session TTL 82800 (23 hours) How long the affinity cookie is valid. Set to ~23 hours so users stay pinned for a full working day, avoiding mid-session pool switches.
Endpoint drain duration 300 (5 minutes) During failover, existing sessions are drained over 5 minutes. This gives active requests time to complete before connections are moved to the new pool.
Zero-downtime failover Sticky When the pinned pool goes unhealthy, the user is moved to the new pool and the cookie is updated. The user stays on the new pool even after the original recovers — preventing session flapping. The user only returns to the original pool after the session TTL expires.

Why Sticky instead of Temporary?

Temporary would move the user back to the original pool as soon as it recovers. For RCIIS, this causes problems because:

  • Keycloak sessions and tokens are cluster-local — switching back mid-session forces re-authentication
  • Kafka consumer offsets may differ between clusters during split-brain recovery
  • Database state may not be fully synchronized yet after a failover event

Sticky keeps users on the failover pool until their session naturally expires, giving the operations team time to verify data consistency before traffic returns.

Step 6: Adaptive Routing

Setting Value Explanation
Failover across pools Enabled Critical for active-passive failover. Without this, zero-downtime failover only operates between endpoints within a pool. Since each pool has only one endpoint, failover would never trigger. Enabling this allows Cloudflare to fail over from the EAC DC pool to the AWS pool.

Hostnames to Create

Repeat the above steps for each hostname, using identical settings:

# Hostname
1 auth.rciis.africa
2 keycloak.rciis.africa
3 grafana.rciis.africa
4 kafka.rciis.africa
5 gateway.rciis.africa
6 api.gateway.rciis.africa
7 flux.rciis.africa
8 esb.rciis.africa

Faster setup via API

After creating the first load balancer manually, use the Cloudflare API to create the remaining seven. The only parameter that changes is the hostname:

# List existing LBs to get the configuration
curl -s "https://api.cloudflare.com/client/v4/zones/<ZONE_ID>/load_balancers" \
  -H "Authorization: Bearer <API_TOKEN>" | jq '.result[0]'

Then POST the same configuration with a different name field for each hostname.


Failover Behaviour

What Happens When EAC DC Goes Down

Normal:  States → CF LB → Primary Pool (EAC DC) ✓

Failover:
  1. CF TCP health check → EAC DC ✗ (3 failures, ~90s)
  2. CF marks eac-dc pool unhealthy
  3. CF routes traffic → aws-af-south-1 pool ✓
  4. Sticky session cookie updated — user stays on AWS
  5. Ops team runs data layer failover (manual)

Recovery:
  1. EAC DC restored, TCP check passes (2 successes, ~60s)
  2. CF marks eac-dc pool healthy
  3. Existing users stay on AWS (sticky cookie)
  4. New users (no cookie) route to EAC DC
  5. Existing users return to EAC DC after session TTL expires (~23h)

Step-by-Step Failover Sequence

  1. Health check failure detected — Cloudflare's TCP health checks fail against the EAC DC origin 3 consecutive times over approximately 90 seconds.

  2. Primary pool marked unhealthy — Cloudflare removes the EAC DC pool from the load balancer rotation.

  3. Traffic shifts to failover pool — New requests are routed to the AWS NLB. Existing sessions are drained over 5 minutes (endpoint drain duration), then their sticky cookie is updated to point to AWS.

  4. Data layer failover (manual) — The operations team executes the emergency failover procedure to promote the AWS cluster from standby to active. This includes:

    • Promoting CNPG replicas to primary
    • Scaling up Camel-K consumers
    • Verifying Kafka MirrorMaker2 offset sync
    • Promoting MS SQL replicas
  5. Recovery — When the EAC DC is restored:

    • The data layer is re-synchronized
    • Cloudflare detects the recovery via TCP health checks (~60 seconds)
    • New users (no cookie or expired cookie) route to EAC DC
    • Users with active sticky cookies stay on AWS until TTL expires
    • Operations team verifies data consistency before considering the failback complete

RTO Analysis

Component Estimated Recovery Time
Cloudflare traffic failover ~90 seconds (automatic)
Data layer failover ~5–10 minutes (manual)
Total RTO ~6–11 minutes

Warning

The Cloudflare layer fails over automatically, but the data layer failover is a manual procedure. Traffic arriving at the AWS cluster before the data layer failover completes will hit a read-only or partially available system. Coordinate both layers during an incident.


Monitoring & Alerts

Cloudflare Dashboard

The load balancer analytics dashboard (Traffic > Load Balancing > Analytics) provides:

  • Request distribution across pools
  • Health check status history
  • Failover event timeline
  • Latency metrics per origin

Notifications

Configure pool health notifications under Notifications > Create:

Notification Trigger Channel
Pool health change eac-dc pool goes unhealthy Email, webhook
Pool health change aws-af-south-1 pool goes unhealthy Email, webhook
Load balancer health All pools unhealthy Email, webhook, PagerDuty

Warning

If both pools are unhealthy, Cloudflare will still attempt to route traffic to the fallback pool. Ensure alerting is configured so the operations team is immediately notified when any pool health changes.


Terraform Alternative

If managing Cloudflare via Terraform, the equivalent resources are:

resource "cloudflare_load_balancer_pool" "eac_dc" {
  account_id  = var.cloudflare_account_id
  name        = "eac-dc"
  description = "EAC Data Centre - Primary"

  origins {
    name    = "eac-dc-gateway"
    address = var.eac_dc_public_ip
    enabled = true
  }

  notification_email = "[email protected]"
  minimum_origins    = 1
}

resource "cloudflare_load_balancer_pool" "aws_af_south_1" {
  account_id  = var.cloudflare_account_id
  name        = "aws-af-south-1"
  description = "AWS South Africa"

  origins {
    name    = "aws-gateway-nlb"
    address = var.aws_nlb_dns
    enabled = true
  }

  notification_email = "[email protected]"
  minimum_origins    = 1
}

resource "cloudflare_load_balancer_monitor" "tcp_health" {
  account_id  = var.cloudflare_account_id
  type        = "tcp"
  port        = 443
  timeout     = 10
  interval    = 30
  retries     = 3
  description = "RCIIS TCP health check on port 443"
}

# Create one load balancer per service hostname
locals {
  service_hostnames = [
    "auth", "keycloak", "grafana", "kafka",
    "gateway", "api.gateway", "flux", "esb",
  ]
}

resource "cloudflare_load_balancer" "services" {
  for_each = toset(local.service_hostnames)

  zone_id          = var.cloudflare_zone_id
  name             = "${each.key}.rciis.africa"
  fallback_pool_id = cloudflare_load_balancer_pool.aws_af_south_1.id
  default_pool_ids = [
    cloudflare_load_balancer_pool.eac_dc.id,
    cloudflare_load_balancer_pool.aws_af_south_1.id,
  ]
  description     = "RCIIS ${each.key} Load Balancer"
  proxied         = false  # DNS only — TLS terminated at the cluster
  steering_policy = "off"  # Failover / priority-based

  session_affinity     = "cookie"
  session_affinity_ttl = 82800  # ~23 hours

  adaptive_routing {
    failover_across_pools = true
  }

  session_affinity_attributes {
    drain_duration         = 300    # 5 minutes
    zero_downtime_failover = "sticky"
    samesite               = "None"
    secure                 = "Always"
  }
}