SRE & Observability Concepts¶

This page explains the Site Reliability Engineering (SRE) and observability concepts used throughout this documentation. It is intended as a learning resource for platform operators who are deploying and managing the RCIIS infrastructure.

Service Levels: SLI, SLO, SLA¶

These three terms form a hierarchy that defines, measures, and contractually guarantees service reliability.

Service Level Indicator (SLI)¶

An SLI is a quantitative measurement of a specific aspect of the service. It answers: "How is the service performing right now?"

SLIs are expressed as ratios — good events divided by total events — yielding a percentage:

SLI = (good events / total events) × 100%

RCIIS examples:

SLI	Measurement	Good Event	Source
API availability	HTTP requests to the RCIIS ESB gateway	Response status is not 5xx	APISIX / ingress-nginx metrics
Request latency	HTTP requests to the Kubernetes API	Response time < 500ms	kube-apiserver metrics
Data freshness	Kafka consumer lag	Lag < 1000 messages	Strimzi / Kafka exporter metrics
Database availability	PostgreSQL health checks	Connection succeeds and query returns	CloudNativePG metrics

Service Level Objective (SLO)¶

An SLO is a target value for an SLI over a rolling time window. It answers: "How reliable should the service be?"

SLO: 99.9% of API requests return a non-5xx response over a 30-day window
      ^^^^                                              ^^^^^^^^^^^^^^^^^
      target                                            time window

SLOs are internal commitments — the team agrees to maintain this level of reliability. They are deliberately set below 100% because perfect reliability is neither achievable nor cost-effective.

RCIIS examples:

SLI	SLO Target	Window	Meaning
API availability	99.9%	30 days	At most 43 minutes of downtime per month
Request latency (P99)	< 500ms	30 days	99% of requests complete in under 500ms
Kafka consumer lag	< 1000 messages	1 hour	Data propagation stays near real-time

Service Level Agreement (SLA)¶

An SLA is a contractual commitment — a legal document between a service provider and its customers that defines consequences (refunds, penalties, escalation) if the SLO is not met.

SLA ⊂ SLO ⊂ SLI

SLI: "We measure API availability"
SLO: "We target 99.9% availability"
SLA: "We guarantee 99.5% availability; if breached, partner states may escalate"

SLAs are always set looser than SLOs to provide a safety margin. If the SLO is 99.9%, the SLA might guarantee 99.5%.

Practical guidance

Start by defining SLIs (what to measure), then set SLOs (internal targets). Only create SLAs when contractual obligations with partner states require them. Most teams operate effectively with SLIs and SLOs alone.

Error Budgets & Burn Rate¶

Error Budget¶

The error budget is the inverse of the SLO — the amount of unreliability you can tolerate before breaching the objective.

Error budget = 100% - SLO target

Example: SLO = 99.9% → Error budget = 0.1%
Over 30 days: 0.1% × 30 × 24 × 60 = 43.2 minutes of allowed downtime

The error budget is spent by:

Planned maintenance windows (Talos upgrades, Ceph rebalancing)
Unplanned incidents (pod crashes, network partitions, misconfigurations)
Deployments that cause brief errors (rolling updates, canary failures)

When the budget is exhausted, the team should:

Freeze non-critical changes
Focus engineering effort on reliability improvements
Increase testing and validation before deploying
Review recent incidents for systemic causes

Burn Rate¶

Burn rate measures how fast the error budget is being consumed relative to the budget window:

Burn rate = (observed error rate / allowed error rate)

Burn rate = 1.0 → Budget consumed evenly over the window (on track)
Burn rate = 2.0 → Budget consumed 2× faster than planned (will exhaust in 15 days instead of 30)
Burn rate = 10.0 → Budget consumed 10× faster (critical — will exhaust in 3 days)

Alerting on burn rate (recommended over raw error rate):

Burn Rate	Severity	Response	Alert Window
> 14.4	Critical (P1)	Page on-call immediately	5-minute rate over 1-hour window
> 6.0	High (P2)	Respond within 30 minutes	30-minute rate over 6-hour window
> 1.0	Warning (P3)	Investigate during business hours	6-hour rate over 3-day window

These thresholds come from the Google SRE workbook's multi-window, multi-burn-rate alerting approach.

PromQL example — burn rate alert for API availability:

# 5-minute error rate / allowed error rate
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) / (1 - 0.999)   # 0.999 = SLO target

Latency Percentiles (P50, P95, P99, P999)¶

Why Percentiles, Not Averages¶

Averages hide the experience of your worst-affected users. Consider two scenarios with the same 200ms average:

Scenario A: All requests take 200ms            → Average: 200ms
Scenario B: 99 requests at 100ms, 1 at 10,100ms (10s) → Average: 200ms

Both have a 200ms average, but Scenario B has users experiencing 10-second responses. Percentiles reveal this.

What Percentiles Mean¶

A percentile answers: "What is the maximum latency experienced by X% of requests?"

Percentile	Meaning	Tells You
P50 (median)	50% of requests are faster than this value	Typical user experience
P90	90% of requests are faster	Experience for most users
P95	95% of requests are faster	Experience excluding edge cases
P99	99% of requests are faster	Worst 1-in-100 experience — the "tail latency"
P999 (P99.9)	99.9% of requests are faster	Extreme tail — often dominated by GC pauses, cold caches, retries

Visualising the Distribution¶

Number of
requests
    │
  ██│
  ██│
  ██│██
  ██│██
  ██│██ ██
  ██│██ ██
  ██│██ ██ ██
  ██│██ ██ ██ ██
  ██│██ ██ ██ ██ ██
  ██│██ ██ ██ ██ ██ ██ ░░ ░░                     ░░
──┴─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──── Latency (ms)
  50 100 150 200 250 300 350 400 450 500     2000
       ▲                 ▲           ▲          ▲
      P50               P90        P99        P999

Most requests cluster around P50. The "long tail" to the right catches the slow outliers that P99 and P999 measure.

Why P99 Matters More Than P50¶

If your service handles 10,000 requests per minute:

P50 = 100ms means 5,000 requests are faster than 100ms — the "happy path"
P99 = 2,000ms means 100 requests per minute take over 2 seconds — that is 100 frustrated users every minute
P999 = 10,000ms means 10 requests per minute take over 10 seconds — likely timeouts and retries

For the RCIIS platform, customs declaration processing that hits the P99 tail causes visible delays at border checkpoints. SLOs should target P99, not P50.

PromQL — Calculating Percentiles¶

Prometheus stores latency data in histograms with configurable bucket boundaries. Use histogram_quantile() to compute percentiles:

# P50 latency for HTTP requests over the last 5 minutes
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# P99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# P99 latency broken down by service
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Histogram bucket boundaries

The accuracy of histogram_quantile depends on the bucket boundaries configured in the application. If the buckets are [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds, Prometheus interpolates within the bucket that contains the percentile. Finer buckets give better accuracy but consume more storage (higher cardinality).

The Four Golden Signals¶

Defined in the Google SRE book, these are the four most important metrics to monitor for any service:

Signal	What It Measures	RCIIS Example	Key Metric
Latency	Time to serve a request	APISIX gateway response time	`histogram_quantile(0.99, ...)`
Traffic	Demand on the system	Requests per second to the ESB API	`sum(rate(http_requests_total[5m]))`
Errors	Rate of failed requests	5xx responses from the API gateway	`sum(rate(http_requests_total{status=~"5.."}[5m]))`
Saturation	How "full" the system is	CPU, memory, disk usage on worker nodes	`node_cpu_seconds_total`, `node_memory_MemAvailable_bytes`

Latency should be measured for both successful and failed requests separately — a fast error (e.g., 500 in 2ms) should not artificially lower the latency metric for successful requests.

Saturation is the hardest to measure well. Look for:

CPU throttling (container_cpu_cfs_throttled_seconds_total)
Memory pressure (container_memory_working_set_bytes approaching limits)
Disk I/O wait (node_disk_io_time_seconds_total)
Ceph OSD utilisation (ceph_osd_pgs, ceph_osd_utilization)
Kafka consumer lag (messages produced minus consumed)

RED Method¶

The RED method is a simplified framework for monitoring request-driven services (APIs, web servers, gateways). It focuses on three metrics:

Metric	Definition	PromQL Pattern
Rate	Requests per second	`sum(rate(http_requests_total[5m]))`
Errors	Failed requests per second	`sum(rate(http_requests_total{status=~"5.."}[5m]))`
Duration	Distribution of request latency	`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))`

When to use RED: For any component that serves requests — APISIX gateway, Keycloak authentication endpoints, FluxCD controllers, Kubernetes API server.

RCIIS component mapping:

Component	Rate	Errors	Duration
APISIX Gateway	Customs API requests/sec	5xx + 4xx responses	Response time P99
Keycloak	Auth token requests/sec	Failed logins	Token issuance latency
FluxCD	Reconciliation operations/min	Failed reconciliations	Reconciliation duration
Kubernetes API	API requests/sec	5xx responses	API call latency

USE Method¶

The USE method is a framework for monitoring infrastructure resources (CPU, memory, disk, network). Developed by Brendan Gregg, it measures:

Metric	Definition	What Indicates a Problem
Utilization	Percentage of resource capacity in use	Sustained > 80% — approaching limits
Saturation	Work that is queued or waiting	Any non-zero value — resource is overloaded
Errors	Count of error events	Any errors — hardware or software fault

When to use USE: For every physical or virtual resource — CPU, memory, disk, network interfaces, Ceph OSDs.

RCIIS infrastructure mapping:

Resource	Utilization	Saturation	Errors
CPU	`rate(node_cpu_seconds_total{mode!="idle"}[5m])`	`node_load15` > CPU count	Machine check exceptions
Memory	`1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)`	`node_vmstat_pswpin` (swap in)	ECC errors
Disk	`rate(node_disk_io_time_seconds_total[5m])`	`node_disk_io_time_weighted_seconds_total`	`node_disk_io_now` errors, SMART alerts
Network	`rate(node_network_transmit_bytes_total[5m])`	`node_network_transmit_drop_total`	`node_network_transmit_errs_total`
Ceph OSD	`ceph_osd_utilization`	`ceph_osd_pgs` vs max	`ceph_osd_down`

RED vs USE

Use RED for services that handle requests (APIs, web apps). Use USE for resources those services run on (nodes, disks, network). Together they cover both the application and infrastructure layers.

Availability & "Nines"¶

Availability is expressed as a percentage of uptime. Each additional "nine" represents a 10× improvement in reliability:

Availability	Common Name	Downtime / Year	Downtime / Month	Downtime / Week
99%	Two nines	3.65 days	7.3 hours	1.68 hours
99.9%	Three nines	8.76 hours	43.8 minutes	10.1 minutes
99.95%	Three and a half nines	4.38 hours	21.9 minutes	5.0 minutes
99.99%	Four nines	52.6 minutes	4.38 minutes	1.01 minutes
99.999%	Five nines	5.26 minutes	26.3 seconds	6.05 seconds

Practical considerations for RCIIS:

99.9% (three nines) is a realistic SLO for a multi-region customs platform with planned maintenance windows. It allows ~43 minutes of downtime per month.
99.99% (four nines) requires zero-downtime deployments, automated failover, and minimal planned maintenance. Achievable with Cilium + HA deployments + geo-load balancing but demanding operationally.
99.999% (five nines) requires active-active multi-region with automatic traffic shifting. The Cloudflare geo-load balancing layer helps achieve this for the external-facing API, but internal services are harder.

The error budget calculation from earlier ties directly to availability:

99.9% SLO → 0.1% error budget → 43.8 minutes/month of allowed downtime

Incident Metrics: MTTR, MTTF, MTBF¶

These metrics quantify the reliability and recoverability of the platform:

Metric	Full Name	Definition	Formula
MTTR	Mean Time to Recovery	Average time from incident detection to service restoration	`sum(recovery times) / count(incidents)`
MTTF	Mean Time to Failure	Average time between failures (for non-repairable items)	`sum(uptime periods) / count(failures)`
MTBF	Mean Time Between Failures	Average time between consecutive failures	`MTBF = MTTF + MTTR`

         ┌──── MTTF ──────┐┌─ MTTR ─┐┌──── MTTF ──────┐┌─ MTTR ─┐
         │                ││        ││                ││        │
─────────┤    Running     ├┤  Down  ├┤    Running     ├┤  Down  ├────
         │                ││        ││                ││        │
         └────────────────┘└────────┘└────────────────┘└────────┘
         ├─────────── MTBF ─────────┤

Reducing MTTR (the most actionable metric):

Action	Impact on MTTR	RCIIS Implementation
Faster detection	Reduces time-to-detect	Prometheus alerting with burn rate rules
Clear runbooks	Reduces time-to-diagnose	Alert annotations linking to runbook URLs
Automated remediation	Reduces time-to-fix	Kyverno auto-remediation, FluxCD drift correction
Practiced response	Reduces coordination overhead	Regular game days, documented incident response

See Incident Response for the RCIIS incident management process.

Prometheus & PromQL Basics¶

Metric Types¶

Prometheus collects four types of metrics:

Type	Description	Example	Operations
Counter	Monotonically increasing value (resets to 0 on restart)	`http_requests_total`, `node_cpu_seconds_total`	`rate()`, `increase()`
Gauge	Value that goes up and down	`node_memory_MemAvailable_bytes`, `kube_pod_status_ready`	Direct value, `avg_over_time()`
Histogram	Distribution of values in configurable buckets	`http_request_duration_seconds_bucket`	`histogram_quantile()`, `rate()` on `_bucket`
Summary	Pre-calculated percentiles (less common)	`go_gc_duration_seconds`	Direct quantile values

Essential PromQL Patterns¶

Rate of change (for counters — always use rate(), never raw counter values):

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

# CPU usage as a percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Aggregation:

# Total requests per second across all pods
sum(rate(http_requests_total[5m]))

# Average memory usage by namespace
avg by (namespace) (container_memory_working_set_bytes)

# Top 5 pods by CPU usage
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))

Percentiles (from histograms):

# P99 request latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Alerting expressions:

# Alert if error rate exceeds 1% for 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01

# Alert if disk is >85% full
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15

Recording Rules¶

Recording rules pre-compute expensive PromQL queries and store the result as a new time series. This improves dashboard load time and alert evaluation performance:

prometheus-recording-rules.yaml

groups:
  - name: rciis-sli
    interval: 30s
    rules:
      # Pre-compute API availability SLI
      - record: rciis:api_availability:ratio_rate5m
        expr: |
          1 - (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )

      # Pre-compute P99 latency
      - record: rciis:api_latency_p99:seconds
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

Use the recorded metric name in dashboards and alerts instead of recomputing the full expression each time.

Loki & LogQL Basics¶

Loki is the log aggregation system deployed alongside Prometheus. LogQL queries follow a similar syntax to PromQL but operate on log streams.

Log Stream Selection¶

# All logs from the keycloak namespace
{namespace="keycloak"}

# Logs from a specific pod
{namespace="flux-system", pod=~"source-controller.*"}

# All error-level logs across all namespaces
{level="error"}

Filtering¶

# Lines containing "connection refused"
{namespace="keycloak"} |= "connection refused"

# Lines NOT containing "health"
{namespace="flux-system"} != "health"

# Regex match
{namespace="rciis"} |~ "status=(4|5)\\d\\d"

Log-Based Metrics¶

# Error log lines per second (useful for alerting)
sum(rate({namespace="keycloak", level="error"}[5m]))

# Count of unique error messages in the last hour
sum by (message) (count_over_time({namespace="rciis"} |= "ERROR" [1h]))

When to use logs vs metrics

Use metrics (Prometheus) for numerical measurements: request rates, latencies, resource usage. Use logs (Loki) for event context: error messages, stack traces, audit trails, request details. Alert on metrics first; use logs for diagnosis.

Cardinality¶

Cardinality is the number of unique time series in Prometheus. Each unique combination of metric name and label values creates a separate time series.

Why High Cardinality Is Dangerous¶

http_requests_total{method="GET", status="200", path="/api/v1/health"}  → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users"}   → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users/123"} → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users/456"} → 1 series
...
http_requests_total{method="GET", status="200", path="/api/v1/users/999999"} → 1 series

If the path label includes user IDs, each unique user creates a new time series. With 100,000 users, that is 100,000 series per method/status combination — millions of series total.

Symptoms of cardinality explosion:

Prometheus memory usage spikes
Slow query performance in Grafana
TSDB compaction errors in Prometheus logs
Series churn warnings: "too many active series"

How to Prevent It¶

Rule	Good Label	Bad Label
Bounded values only	`status="200"`, `method="GET"`	`user_id="12345"`, `request_id="abc-def"`
Use path templates	`path="/api/v1/users/:id"`	`path="/api/v1/users/12345"`
Keep label count low	5-7 labels per metric	15+ labels per metric
Drop high-cardinality labels	Use `metric_relabel_configs`	Ingest everything and hope

Prometheus relabel config to drop a high-cardinality label:

metric_relabel_configs:
  - source_labels: [__name__]
    regex: "http_requests_total"
    action: labeldrop
    regex: "request_id"

Check current cardinality:

# Top 10 metrics by series count
topk(10, count by (__name__) ({__name__=~".+"}))

Alerting Best Practices¶

Alert on Symptoms, Not Causes¶

Symptom-Based (Good)	Cause-Based (Bad)
"API error rate > 1% for 5 minutes"	"Pod restarted"
"P99 latency > 2s for 10 minutes"	"CPU usage > 80%"
"Error budget burn rate > 6x"	"Disk usage > 70%"

Cause-based alerts generate noise — a pod restart might be normal (rolling update), and high CPU might be expected (batch job). Symptom-based alerts fire only when users are actually affected.

Severity Levels¶

Severity	Response Time	Notification	Example
P1 — Critical	Immediate (page)	PagerDuty / phone	Error budget burning 14× — service down
P2 — High	30 minutes	Slack alert channel	Error budget burning 6× — degraded
P3 — Warning	Business hours	Slack / ticket	Error budget burning 1× — trending bad
P4 — Info	Next sprint	Dashboard only	Certificate expiring in 30 days

Runbooks¶

Every alert should link to a runbook — a documented procedure for diagnosing and resolving the alert:

PrometheusRule annotation

annotations:
  summary: "High API error rate"
  description: "API error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
  runbook_url: "https://docs.rciis.eac.int/runbooks/api-error-rate"

A runbook should contain:

What this alert means — plain language
Impact — who is affected and how
Diagnostic steps — specific commands to run
Resolution steps — ordered actions to take
Escalation — when and who to escalate to

Toil & Automation¶

Toil is work that is:

Manual — a human runs a command or clicks a button
Repetitive — done more than once or twice
Automatable — could be handled by a script or controller
Reactive — triggered by an event rather than planned
Without enduring value — does not permanently improve the system

Examples of toil in Kubernetes operations:

Toil	Automation
Manually restarting crashed pods	Kubernetes liveness probes + automatic restart
Manually scaling replicas during traffic spikes	Horizontal Pod Autoscaler (HPA)
Manually rotating certificates	cert-manager automatic renewal
Manually approving Renovate PRs for patch versions	Auto-merge policy for patch updates
Manually checking for CVEs in images	Trivy Operator continuous scanning
Manually applying Kyverno policy exceptions	PolicyException CRs in Git (GitOps)

The SRE principle is: spend no more than 50% of time on toil. If toil exceeds this, invest in automation. The RCIIS platform's GitOps approach, FluxCD drift correction, Kyverno auto-remediation, and cert-manager renewal are all examples of toil reduction.

Reliability Engineering Practices¶

Day-0, Day-1, Day-2 Operations¶

These terms describe the lifecycle phases of a platform:

Phase	When	Activities
Day-0	Before deployment	Architecture design, capacity planning, network design, security requirements — Phases 1-2 of this documentation
Day-1	Initial deployment	Infrastructure build, Talos install, platform service deployment, validation — Phases 3-8 of this documentation
Day-2	Ongoing operations	Upgrades, scaling, backup/recovery, incident response, certificate rotation — Phase 9 of this documentation

Day-2 is where teams spend the most time and where reliability practices (SLOs, alerting, runbooks, automation) have the most impact.

Chaos Engineering¶

Chaos engineering is the practice of deliberately injecting failures into the system to verify that it handles them gracefully. The goal is to find weaknesses before they cause real incidents.

Examples for RCIIS:

Experiment	What It Tests	Expected Outcome
Kill a random worker node	Pod rescheduling, Ceph rebalancing	Workloads migrate, storage remains available
Block network to Keycloak	Service degradation handling	Cached tokens still work, new logins show clear error
Fill a Ceph OSD disk to 85%	Near-full warnings, OSD auto-reweight	Alerts fire, Ceph rebalances data away
Inject 500ms latency on ingress	P99 SLO breach detection	Burn rate alert fires within expected window

Note

Chaos experiments should be run in non-production environments first and only in production with explicit approval and during business hours with the team on standby.

Game Days¶

A game day is a planned exercise where the team practices responding to a simulated incident:

Define a scenario (e.g., "a control plane node becomes unreachable")
Inject the failure (cordon + drain, or network partition)
The on-call team responds using normal incident procedures
Run a post-mortem reviewing what worked and what did not

Game days build muscle memory for real incidents and expose gaps in runbooks, alerting, and communication.

Post-Mortems¶

After every significant incident, write a blameless post-mortem that documents:

Timeline — when the incident started, was detected, escalated, and resolved
Impact — which services, how many users, for how long
Root cause — the underlying technical cause (not "human error")
Contributing factors — what made detection or resolution slower
Action items — specific, assigned tasks to prevent recurrence

The post-mortem should be stored in a shared location and reviewed as a team.

Resource	Description
Google SRE Book	The foundational text on Site Reliability Engineering — free to read online
Google SRE Workbook	Practical companion to the SRE book with worked examples
Brendan Gregg — USE Method	The original USE method page with per-resource checklists
Tom Wilkie — RED Method	Grafana blog post explaining the RED method
Prometheus Documentation	Official Prometheus docs including PromQL reference
Grafana Loki — LogQL	LogQL query language reference
OpenSLO Specification	Open standard for defining SLOs as code