6.2 Configure Geo Load Balancing¶

Cloudflare Load Balancing provides automatic traffic failover between the primary EAC DC cluster and the DR cluster in AWS Cape Town. Each RCIIS service hostname has its own load balancer, but all share the same two origin pools.

The load balancer operates in active-passive mode: all traffic goes to the primary pool (EAC DC) unless a health check failure triggers failover to the standby pool (AWS).

Architecture¶

                        EAC Partner States
    ┌──────────┬──────────┬──────────┬──────────┐
    │ Kenya    │ Tanzania │ Uganda   │ Rwanda   │
    │ (Nairobi)│ (DSM)    │ (Kampala)│ (Kigali) │
    ├──────────┼──────────┼──────────┴──────────┤
    │ Burundi  │ DR Congo │ South Sudan         │
    │(Bujumbura│(Kinshasa)│ (Juba)              │
    └────┬─────┴────┬─────┴────┬────────────────┘
         │          │          │
         │   Public Internet   │
         └──────────┼──────────┘
                    │
         ┌──────────┴──────────┐
         │   Cloudflare Edge   │
         │  (Anycast network)  │
         │  DNS + Load Balancer│
         └──────────┬──────────┘
                    │
    ┌───────────────┴───────────────┐
    │ Per-Service Load Balancers    │
    │  auth.rciis.africa            │
    │  keycloak.rciis.africa        │
    │  grafana.rciis.africa         │
    │  kafka.rciis.africa   ...     │
    └───────┬───────────────┬───────┘
            │               │
     ┌──────┴──────┐ ┌─────┴────────┐
     │ Primary Pool│ │ Failover Pool│
     │  (EAC DC)   │ │  (AWS ZA)    │
     │ ┌─────────┐ │ │ ┌──────────┐ │
     │ │Origin:  │ │ │ │Origin:   │ │
     │ │<pub-ip> │ │ │ │NLB DNS   │ │
     │ └─────────┘ │ │ └──────────┘ │
     └──────┬──────┘ └──────┬───────┘
            │               │
     ┌──────┴──────┐ ┌─────┴────────┐
     │ Cilium GW   │ │ Cilium GW    │
     │ (EAC DC)    │ │ (AWS)        │
     │ cert-manager│ │ cert-manager │
     └─────────────┘ └──────────────┘

All service hostnames share the same origin pools. The Cilium Gateway on each cluster routes requests to the correct backend pod based on the Host HTTP header.

Origin Pools¶

You create two pools — one per environment. All 8 load balancers reference the same two pools.

Pool 1 — EAC DC (Primary)¶

Navigate to Traffic > Load Balancing > Pools > Create Pool.

Setting	Value	Explanation
Pool name	`eac-dc`	Descriptive name for the on-premise data centre
Pool description	`EAC Data Centre - Primary`
Endpoint steering	`Random`	Only one endpoint in the pool, so steering is irrelevant
Endpoint
Endpoint name	`eac-dc-gateway`	Name for the origin server
Endpoint address	`<EAC DC public IP>`	The public IP of the EAC DC cluster's ingress (Cilium Gateway)
Port	`443`	HTTPS port — TLS terminated by the cluster's cert-manager certificates
Weight	(leave empty)	Single origin, no weighting needed
Health threshold	`1`	Pool is marked unhealthy if the single endpoint fails
Monitor	See Health Checks	Attach after creating the monitor
Health check regions	`EMEA`	Closest Cloudflare PoPs to the EAC region

Pool 2 — AWS Cape Town (Failover)¶

Setting	Value	Explanation
Pool name	`aws-af-south-1`	Named after the AWS region
Pool description	`AWS South Africa`
Endpoint steering	`Random`	Single endpoint
Endpoint
Endpoint name	`aws-gateway-nlb`	The Cilium Gateway's AWS NLB
Endpoint address	`<NLB DNS hostname>`	e.g., `k8s-kubesyst-ciliumga-xxxx.elb.af-south-1.amazonaws.com` — the external address of the `cilium-gateway-aws-gateway` service
Port	`443`
Weight	(leave empty)
Health threshold	`1`
Monitor	See Health Checks
Health check regions	`EMEA`

Finding the AWS NLB hostname

kubectl get svc cilium-gateway-aws-gateway -n kube-system \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Tip

Each pool contains a single origin since both sites use a single entry point (a public IP for EAC DC, an NLB for AWS). The failover is between pools, not between origins within a pool.

Health Checks¶

Health checks (called Monitors in Cloudflare) probe each origin pool to determine availability. When the primary pool fails consecutive checks, Cloudflare automatically routes traffic to the failover pool.

Monitor Configuration¶

Navigate to Traffic > Load Balancing > Monitors > Create.

Setting	Value	Explanation
Type	`TCP`	TCP check on port 443 — verifies the Cilium Gateway NLB is accepting connections. Use TCP rather than HTTPS because the health check runs independently of any specific hostname/certificate.
Port	`443`	The HTTPS port on the Cilium Gateway
Interval	`30s`	How often Cloudflare probes the origin
Timeout	`10s`	How long to wait for a response before marking the check as failed
Retries	`3`	Number of retries within a single check before marking it failed

Attach this monitor to both origin pools.

Why TCP instead of HTTPS?

An HTTPS health check requires a specific Host header and valid TLS certificate. Since the same origin pool serves multiple hostnames (auth, grafana, kafka, etc.), a TCP check is simpler and equally effective — if the Cilium Gateway is accepting TCP connections on port 443, all services behind it are reachable. The individual service health is monitored by Kubernetes liveness/readiness probes.

Bootstrapping caveat

During initial cluster setup, the TCP health check will fail until the Cilium Gateway is fully programmed with TLS certificates. The Gateway requires cert-manager to issue certificates via DNS-01 challenge, which requires the DNS records to exist. To break this chicken-and-egg cycle:

Create the load balancers without a monitor first
Wait for cert-manager to issue certificates (verify with kubectl get certificates -n kube-system)
Verify TCP connectivity: nc -vz <NLB hostname> 443
Attach the monitor to the pools

Failure Detection Timeline¶

With the above settings, the worst-case detection time is:

3 retries × 30s interval = ~90 seconds

After 3 consecutive failures, Cloudflare marks the pool as unhealthy and begins routing traffic to the failover pool.

Recovery¶

When a pool comes back online, Cloudflare requires 2 consecutive successful checks before marking it healthy again:

2 successes × 30s interval = ~60 seconds

Traffic automatically returns to the primary pool once it is marked healthy.

Load Balancers¶

Create one load balancer per service hostname. All load balancers share the same pools and settings — only the hostname differs.

Create Load Balancers¶

Navigate to Traffic > Load Balancing > Create Load Balancer for each hostname.

Step 1: Hostname¶

Setting	Value
Hostname	e.g., `auth.rciis.africa`
Proxy status	DNS only (grey cloud)

Why DNS only?

TLS is terminated at the cluster by cert-manager (Let's Encrypt). Enabling Cloudflare's proxy would require separate certificate management at the Cloudflare edge and would conflict with the cluster's certificates. See DNS Zones > Proxied vs DNS-Only for details.

Step 2: Origin Pools¶

Add both pools in priority order:

Priority	Pool	Role
1	`eac-dc`	Primary — all traffic goes here when healthy
2	`aws-af-south-1`	Failover — receives traffic when primary is unhealthy

Set the Fallback pool to aws-af-south-1.

Step 3: Monitors¶

Attach the TCP monitor created in Health Checks to both pools.

Step 4: Traffic Steering¶

Setting	Value	Explanation
Steering policy	`Off` (Failover)	Routes all traffic to the highest-priority healthy pool. No load distribution — strictly active-passive.

Step 5: Session Affinity¶

Setting	Value	Explanation
Session affinity	Enabled	Pins a user to one pool for the session duration
Affinity type	By Cloudflare cookie only	Uses a `__cflb` cookie to track pool assignment. Does not use client IP, which avoids issues with users behind shared NAT or changing IPs.
Session TTL	`82800` (23 hours)	How long the affinity cookie is valid. Set to ~23 hours so users stay pinned for a full working day, avoiding mid-session pool switches.
Endpoint drain duration	`300` (5 minutes)	During failover, existing sessions are drained over 5 minutes. This gives active requests time to complete before connections are moved to the new pool.
Zero-downtime failover	Sticky	When the pinned pool goes unhealthy, the user is moved to the new pool and the cookie is updated. The user stays on the new pool even after the original recovers — preventing session flapping. The user only returns to the original pool after the session TTL expires.

Why Sticky instead of Temporary?

Temporary would move the user back to the original pool as soon as it recovers. For RCIIS, this causes problems because:

Keycloak sessions and tokens are cluster-local — switching back mid-session forces re-authentication
Kafka consumer offsets may differ between clusters during split-brain recovery
Database state may not be fully synchronized yet after a failover event

Sticky keeps users on the failover pool until their session naturally expires, giving the operations team time to verify data consistency before traffic returns.

Step 6: Adaptive Routing¶

Setting	Value	Explanation
Failover across pools	Enabled	Critical for active-passive failover. Without this, zero-downtime failover only operates between endpoints within a pool. Since each pool has only one endpoint, failover would never trigger. Enabling this allows Cloudflare to fail over from the EAC DC pool to the AWS pool.

Hostnames to Create¶

Repeat the above steps for each hostname, using identical settings:

#	Hostname
1	`auth.rciis.africa`
2	`keycloak.rciis.africa`
3	`grafana.rciis.africa`
4	`kafka.rciis.africa`
5	`gateway.rciis.africa`
6	`api.gateway.rciis.africa`
7	`flux.rciis.africa`
8	`esb.rciis.africa`

Faster setup via API

After creating the first load balancer manually, use the Cloudflare API to create the remaining seven. The only parameter that changes is the hostname:

# List existing LBs to get the configuration
curl -s "https://api.cloudflare.com/client/v4/zones/<ZONE_ID>/load_balancers" \
  -H "Authorization: Bearer <API_TOKEN>" | jq '.result[0]'

Then POST the same configuration with a different name field for each hostname.

Failover Behaviour¶

What Happens When EAC DC Goes Down¶

Normal:  States → CF LB → Primary Pool (EAC DC) ✓

Failover:
  1. CF TCP health check → EAC DC ✗ (3 failures, ~90s)
  2. CF marks eac-dc pool unhealthy
  3. CF routes traffic → aws-af-south-1 pool ✓
  4. Sticky session cookie updated — user stays on AWS
  5. Ops team runs data layer failover (manual)

Recovery:
  1. EAC DC restored, TCP check passes (2 successes, ~60s)
  2. CF marks eac-dc pool healthy
  3. Existing users stay on AWS (sticky cookie)
  4. New users (no cookie) route to EAC DC
  5. Existing users return to EAC DC after session TTL expires (~23h)

Step-by-Step Failover Sequence¶

Health check failure detected — Cloudflare's TCP health checks fail against the EAC DC origin 3 consecutive times over approximately 90 seconds.
Primary pool marked unhealthy — Cloudflare removes the EAC DC pool from the load balancer rotation.
Traffic shifts to failover pool — New requests are routed to the AWS NLB. Existing sessions are drained over 5 minutes (endpoint drain duration), then their sticky cookie is updated to point to AWS.
Data layer failover (manual) — The operations team executes the emergency failover procedure to promote the AWS cluster from standby to active. This includes:
- Promoting CNPG replicas to primary
- Scaling up Camel-K consumers
- Verifying Kafka MirrorMaker2 offset sync
- Promoting MS SQL replicas
Recovery — When the EAC DC is restored:
- The data layer is re-synchronized
- Cloudflare detects the recovery via TCP health checks (~60 seconds)
- New users (no cookie or expired cookie) route to EAC DC
- Users with active sticky cookies stay on AWS until TTL expires
- Operations team verifies data consistency before considering the failback complete

RTO Analysis¶

Component	Estimated Recovery Time
Cloudflare traffic failover	~90 seconds (automatic)
Data layer failover	~5–10 minutes (manual)
Total RTO	~6–11 minutes

Warning

The Cloudflare layer fails over automatically, but the data layer failover is a manual procedure. Traffic arriving at the AWS cluster before the data layer failover completes will hit a read-only or partially available system. Coordinate both layers during an incident.

Monitoring & Alerts¶

Cloudflare Dashboard¶

The load balancer analytics dashboard (Traffic > Load Balancing > Analytics) provides:

Request distribution across pools
Health check status history
Failover event timeline
Latency metrics per origin

Notifications¶

Configure pool health notifications under Notifications > Create:

Notification	Trigger	Channel
Pool health change	`eac-dc` pool goes unhealthy	Email, webhook
Pool health change	`aws-af-south-1` pool goes unhealthy	Email, webhook
Load balancer health	All pools unhealthy	Email, webhook, PagerDuty