6.2 Configure Geo Load Balancing¶
Cloudflare Load Balancing provides automatic traffic failover between the primary EAC DC cluster and the DR cluster in AWS Cape Town. Each RCIIS service hostname has its own load balancer, but all share the same two origin pools.
The load balancer operates in active-passive mode: all traffic goes to the primary pool (EAC DC) unless a health check failure triggers failover to the standby pool (AWS).
Architecture¶
EAC Partner States
┌──────────┬──────────┬──────────┬──────────┐
│ Kenya │ Tanzania │ Uganda │ Rwanda │
│ (Nairobi)│ (DSM) │ (Kampala)│ (Kigali) │
├──────────┼──────────┼──────────┴──────────┤
│ Burundi │ DR Congo │ South Sudan │
│(Bujumbura│(Kinshasa)│ (Juba) │
└────┬─────┴────┬─────┴────┬────────────────┘
│ │ │
│ Public Internet │
└──────────┼──────────┘
│
┌──────────┴──────────┐
│ Cloudflare Edge │
│ (Anycast network) │
│ DNS + Load Balancer│
└──────────┬──────────┘
│
┌───────────────┴───────────────┐
│ Per-Service Load Balancers │
│ auth.rciis.africa │
│ keycloak.rciis.africa │
│ grafana.rciis.africa │
│ kafka.rciis.africa ... │
└───────┬───────────────┬───────┘
│ │
┌──────┴──────┐ ┌─────┴────────┐
│ Primary Pool│ │ Failover Pool│
│ (EAC DC) │ │ (AWS ZA) │
│ ┌─────────┐ │ │ ┌──────────┐ │
│ │Origin: │ │ │ │Origin: │ │
│ │<pub-ip> │ │ │ │NLB DNS │ │
│ └─────────┘ │ │ └──────────┘ │
└──────┬──────┘ └──────┬───────┘
│ │
┌──────┴──────┐ ┌─────┴────────┐
│ Cilium GW │ │ Cilium GW │
│ (EAC DC) │ │ (AWS) │
│ cert-manager│ │ cert-manager │
└─────────────┘ └──────────────┘
All service hostnames share the same origin pools. The Cilium Gateway on each cluster routes requests to the correct backend pod based on the Host HTTP header.
Origin Pools¶
You create two pools — one per environment. All 8 load balancers reference the same two pools.
Pool 1 — EAC DC (Primary)¶
Navigate to Traffic > Load Balancing > Pools > Create Pool.
| Setting | Value | Explanation |
|---|---|---|
| Pool name | eac-dc |
Descriptive name for the on-premise data centre |
| Pool description | EAC Data Centre - Primary |
|
| Endpoint steering | Random |
Only one endpoint in the pool, so steering is irrelevant |
| Endpoint | ||
| Endpoint name | eac-dc-gateway |
Name for the origin server |
| Endpoint address | <EAC DC public IP> |
The public IP of the EAC DC cluster's ingress (Cilium Gateway) |
| Port | 443 |
HTTPS port — TLS terminated by the cluster's cert-manager certificates |
| Weight | (leave empty) | Single origin, no weighting needed |
| Health threshold | 1 |
Pool is marked unhealthy if the single endpoint fails |
| Monitor | See Health Checks | Attach after creating the monitor |
| Health check regions | EMEA |
Closest Cloudflare PoPs to the EAC region |
Pool 2 — AWS Cape Town (Failover)¶
| Setting | Value | Explanation |
|---|---|---|
| Pool name | aws-af-south-1 |
Named after the AWS region |
| Pool description | AWS South Africa |
|
| Endpoint steering | Random |
Single endpoint |
| Endpoint | ||
| Endpoint name | aws-gateway-nlb |
The Cilium Gateway's AWS NLB |
| Endpoint address | <NLB DNS hostname> |
e.g., k8s-kubesyst-ciliumga-xxxx.elb.af-south-1.amazonaws.com — the external address of the cilium-gateway-aws-gateway service |
| Port | 443 |
|
| Weight | (leave empty) | |
| Health threshold | 1 |
|
| Monitor | See Health Checks | |
| Health check regions | EMEA |
Finding the AWS NLB hostname
Tip
Each pool contains a single origin since both sites use a single entry point (a public IP for EAC DC, an NLB for AWS). The failover is between pools, not between origins within a pool.
Health Checks¶
Health checks (called Monitors in Cloudflare) probe each origin pool to determine availability. When the primary pool fails consecutive checks, Cloudflare automatically routes traffic to the failover pool.
Monitor Configuration¶
Navigate to Traffic > Load Balancing > Monitors > Create.
| Setting | Value | Explanation |
|---|---|---|
| Type | TCP |
TCP check on port 443 — verifies the Cilium Gateway NLB is accepting connections. Use TCP rather than HTTPS because the health check runs independently of any specific hostname/certificate. |
| Port | 443 |
The HTTPS port on the Cilium Gateway |
| Interval | 30s |
How often Cloudflare probes the origin |
| Timeout | 10s |
How long to wait for a response before marking the check as failed |
| Retries | 3 |
Number of retries within a single check before marking it failed |
Attach this monitor to both origin pools.
Why TCP instead of HTTPS?
An HTTPS health check requires a specific Host header and valid TLS certificate. Since the same origin pool serves multiple hostnames (auth, grafana, kafka, etc.), a TCP check is simpler and equally effective — if the Cilium Gateway is accepting TCP connections on port 443, all services behind it are reachable. The individual service health is monitored by Kubernetes liveness/readiness probes.
Bootstrapping caveat
During initial cluster setup, the TCP health check will fail until the Cilium Gateway is fully programmed with TLS certificates. The Gateway requires cert-manager to issue certificates via DNS-01 challenge, which requires the DNS records to exist. To break this chicken-and-egg cycle:
- Create the load balancers without a monitor first
- Wait for cert-manager to issue certificates (verify with
kubectl get certificates -n kube-system) - Verify TCP connectivity:
nc -vz <NLB hostname> 443 - Attach the monitor to the pools
Failure Detection Timeline¶
With the above settings, the worst-case detection time is:
After 3 consecutive failures, Cloudflare marks the pool as unhealthy and begins routing traffic to the failover pool.
Recovery¶
When a pool comes back online, Cloudflare requires 2 consecutive successful checks before marking it healthy again:
Traffic automatically returns to the primary pool once it is marked healthy.
Load Balancers¶
Create one load balancer per service hostname. All load balancers share the same pools and settings — only the hostname differs.
Create Load Balancers¶
Navigate to Traffic > Load Balancing > Create Load Balancer for each hostname.
Step 1: Hostname¶
| Setting | Value |
|---|---|
| Hostname | e.g., auth.rciis.africa |
| Proxy status | DNS only (grey cloud) |
Why DNS only?
TLS is terminated at the cluster by cert-manager (Let's Encrypt). Enabling Cloudflare's proxy would require separate certificate management at the Cloudflare edge and would conflict with the cluster's certificates. See DNS Zones > Proxied vs DNS-Only for details.
Step 2: Origin Pools¶
Add both pools in priority order:
| Priority | Pool | Role |
|---|---|---|
| 1 | eac-dc |
Primary — all traffic goes here when healthy |
| 2 | aws-af-south-1 |
Failover — receives traffic when primary is unhealthy |
Set the Fallback pool to aws-af-south-1.
Step 3: Monitors¶
Attach the TCP monitor created in Health Checks to both pools.
Step 4: Traffic Steering¶
| Setting | Value | Explanation |
|---|---|---|
| Steering policy | Off (Failover) |
Routes all traffic to the highest-priority healthy pool. No load distribution — strictly active-passive. |
Step 5: Session Affinity¶
| Setting | Value | Explanation |
|---|---|---|
| Session affinity | Enabled | Pins a user to one pool for the session duration |
| Affinity type | By Cloudflare cookie only | Uses a __cflb cookie to track pool assignment. Does not use client IP, which avoids issues with users behind shared NAT or changing IPs. |
| Session TTL | 82800 (23 hours) |
How long the affinity cookie is valid. Set to ~23 hours so users stay pinned for a full working day, avoiding mid-session pool switches. |
| Endpoint drain duration | 300 (5 minutes) |
During failover, existing sessions are drained over 5 minutes. This gives active requests time to complete before connections are moved to the new pool. |
| Zero-downtime failover | Sticky | When the pinned pool goes unhealthy, the user is moved to the new pool and the cookie is updated. The user stays on the new pool even after the original recovers — preventing session flapping. The user only returns to the original pool after the session TTL expires. |
Why Sticky instead of Temporary?
Temporary would move the user back to the original pool as soon as it recovers. For RCIIS, this causes problems because:
- Keycloak sessions and tokens are cluster-local — switching back mid-session forces re-authentication
- Kafka consumer offsets may differ between clusters during split-brain recovery
- Database state may not be fully synchronized yet after a failover event
Sticky keeps users on the failover pool until their session naturally expires, giving the operations team time to verify data consistency before traffic returns.
Step 6: Adaptive Routing¶
| Setting | Value | Explanation |
|---|---|---|
| Failover across pools | Enabled | Critical for active-passive failover. Without this, zero-downtime failover only operates between endpoints within a pool. Since each pool has only one endpoint, failover would never trigger. Enabling this allows Cloudflare to fail over from the EAC DC pool to the AWS pool. |
Hostnames to Create¶
Repeat the above steps for each hostname, using identical settings:
| # | Hostname |
|---|---|
| 1 | auth.rciis.africa |
| 2 | keycloak.rciis.africa |
| 3 | grafana.rciis.africa |
| 4 | kafka.rciis.africa |
| 5 | gateway.rciis.africa |
| 6 | api.gateway.rciis.africa |
| 7 | flux.rciis.africa |
| 8 | esb.rciis.africa |
Faster setup via API
After creating the first load balancer manually, use the Cloudflare API to create the remaining seven. The only parameter that changes is the hostname:
# List existing LBs to get the configuration
curl -s "https://api.cloudflare.com/client/v4/zones/<ZONE_ID>/load_balancers" \
-H "Authorization: Bearer <API_TOKEN>" | jq '.result[0]'
Then POST the same configuration with a different name field for each hostname.
Failover Behaviour¶
What Happens When EAC DC Goes Down¶
Normal: States → CF LB → Primary Pool (EAC DC) ✓
Failover:
1. CF TCP health check → EAC DC ✗ (3 failures, ~90s)
2. CF marks eac-dc pool unhealthy
3. CF routes traffic → aws-af-south-1 pool ✓
4. Sticky session cookie updated — user stays on AWS
5. Ops team runs data layer failover (manual)
Recovery:
1. EAC DC restored, TCP check passes (2 successes, ~60s)
2. CF marks eac-dc pool healthy
3. Existing users stay on AWS (sticky cookie)
4. New users (no cookie) route to EAC DC
5. Existing users return to EAC DC after session TTL expires (~23h)
Step-by-Step Failover Sequence¶
-
Health check failure detected — Cloudflare's TCP health checks fail against the EAC DC origin 3 consecutive times over approximately 90 seconds.
-
Primary pool marked unhealthy — Cloudflare removes the EAC DC pool from the load balancer rotation.
-
Traffic shifts to failover pool — New requests are routed to the AWS NLB. Existing sessions are drained over 5 minutes (endpoint drain duration), then their sticky cookie is updated to point to AWS.
-
Data layer failover (manual) — The operations team executes the emergency failover procedure to promote the AWS cluster from standby to active. This includes:
- Promoting CNPG replicas to primary
- Scaling up Camel-K consumers
- Verifying Kafka MirrorMaker2 offset sync
- Promoting MS SQL replicas
-
Recovery — When the EAC DC is restored:
- The data layer is re-synchronized
- Cloudflare detects the recovery via TCP health checks (~60 seconds)
- New users (no cookie or expired cookie) route to EAC DC
- Users with active sticky cookies stay on AWS until TTL expires
- Operations team verifies data consistency before considering the failback complete
RTO Analysis¶
| Component | Estimated Recovery Time |
|---|---|
| Cloudflare traffic failover | ~90 seconds (automatic) |
| Data layer failover | ~5–10 minutes (manual) |
| Total RTO | ~6–11 minutes |
Warning
The Cloudflare layer fails over automatically, but the data layer failover is a manual procedure. Traffic arriving at the AWS cluster before the data layer failover completes will hit a read-only or partially available system. Coordinate both layers during an incident.
Monitoring & Alerts¶
Cloudflare Dashboard¶
The load balancer analytics dashboard (Traffic > Load Balancing > Analytics) provides:
- Request distribution across pools
- Health check status history
- Failover event timeline
- Latency metrics per origin
Notifications¶
Configure pool health notifications under Notifications > Create:
| Notification | Trigger | Channel |
|---|---|---|
| Pool health change | eac-dc pool goes unhealthy |
Email, webhook |
| Pool health change | aws-af-south-1 pool goes unhealthy |
Email, webhook |
| Load balancer health | All pools unhealthy | Email, webhook, PagerDuty |
Warning
If both pools are unhealthy, Cloudflare will still attempt to route traffic to the fallback pool. Ensure alerting is configured so the operations team is immediately notified when any pool health changes.
Terraform Alternative¶
If managing Cloudflare via Terraform, the equivalent resources are:
resource "cloudflare_load_balancer_pool" "eac_dc" {
account_id = var.cloudflare_account_id
name = "eac-dc"
description = "EAC Data Centre - Primary"
origins {
name = "eac-dc-gateway"
address = var.eac_dc_public_ip
enabled = true
}
notification_email = "[email protected]"
minimum_origins = 1
}
resource "cloudflare_load_balancer_pool" "aws_af_south_1" {
account_id = var.cloudflare_account_id
name = "aws-af-south-1"
description = "AWS South Africa"
origins {
name = "aws-gateway-nlb"
address = var.aws_nlb_dns
enabled = true
}
notification_email = "[email protected]"
minimum_origins = 1
}
resource "cloudflare_load_balancer_monitor" "tcp_health" {
account_id = var.cloudflare_account_id
type = "tcp"
port = 443
timeout = 10
interval = 30
retries = 3
description = "RCIIS TCP health check on port 443"
}
# Create one load balancer per service hostname
locals {
service_hostnames = [
"auth", "keycloak", "grafana", "kafka",
"gateway", "api.gateway", "flux", "esb",
]
}
resource "cloudflare_load_balancer" "services" {
for_each = toset(local.service_hostnames)
zone_id = var.cloudflare_zone_id
name = "${each.key}.rciis.africa"
fallback_pool_id = cloudflare_load_balancer_pool.aws_af_south_1.id
default_pool_ids = [
cloudflare_load_balancer_pool.eac_dc.id,
cloudflare_load_balancer_pool.aws_af_south_1.id,
]
description = "RCIIS ${each.key} Load Balancer"
proxied = false # DNS only — TLS terminated at the cluster
steering_policy = "off" # Failover / priority-based
session_affinity = "cookie"
session_affinity_ttl = 82800 # ~23 hours
adaptive_routing {
failover_across_pools = true
}
session_affinity_attributes {
drain_duration = 300 # 5 minutes
zero_downtime_failover = "sticky"
samesite = "None"
secure = "Always"
}
}
Related Pages¶
- Set Up DNS Zones — DNS record configuration and proxy status explanation