SRE & Observability Concepts¶
This page explains the Site Reliability Engineering (SRE) and observability concepts used throughout this documentation. It is intended as a learning resource for platform operators who are deploying and managing the RCIIS infrastructure.
Service Levels: SLI, SLO, SLA¶
These three terms form a hierarchy that defines, measures, and contractually guarantees service reliability.
Service Level Indicator (SLI)¶
An SLI is a quantitative measurement of a specific aspect of the service. It answers: "How is the service performing right now?"
SLIs are expressed as ratios — good events divided by total events — yielding a percentage:
RCIIS examples:
| SLI | Measurement | Good Event | Source |
|---|---|---|---|
| API availability | HTTP requests to the RCIIS ESB gateway | Response status is not 5xx | APISIX / ingress-nginx metrics |
| Request latency | HTTP requests to the Kubernetes API | Response time < 500ms | kube-apiserver metrics |
| Data freshness | Kafka consumer lag | Lag < 1000 messages | Strimzi / Kafka exporter metrics |
| Database availability | PostgreSQL health checks | Connection succeeds and query returns | CloudNativePG metrics |
Service Level Objective (SLO)¶
An SLO is a target value for an SLI over a rolling time window. It answers: "How reliable should the service be?"
SLO: 99.9% of API requests return a non-5xx response over a 30-day window
^^^^ ^^^^^^^^^^^^^^^^^
target time window
SLOs are internal commitments — the team agrees to maintain this level of reliability. They are deliberately set below 100% because perfect reliability is neither achievable nor cost-effective.
RCIIS examples:
| SLI | SLO Target | Window | Meaning |
|---|---|---|---|
| API availability | 99.9% | 30 days | At most 43 minutes of downtime per month |
| Request latency (P99) | < 500ms | 30 days | 99% of requests complete in under 500ms |
| Kafka consumer lag | < 1000 messages | 1 hour | Data propagation stays near real-time |
Service Level Agreement (SLA)¶
An SLA is a contractual commitment — a legal document between a service provider and its customers that defines consequences (refunds, penalties, escalation) if the SLO is not met.
SLA ⊂ SLO ⊂ SLI
SLI: "We measure API availability"
SLO: "We target 99.9% availability"
SLA: "We guarantee 99.5% availability; if breached, partner states may escalate"
SLAs are always set looser than SLOs to provide a safety margin. If the SLO is 99.9%, the SLA might guarantee 99.5%.
Practical guidance
Start by defining SLIs (what to measure), then set SLOs (internal targets). Only create SLAs when contractual obligations with partner states require them. Most teams operate effectively with SLIs and SLOs alone.
Error Budgets & Burn Rate¶
Error Budget¶
The error budget is the inverse of the SLO — the amount of unreliability you can tolerate before breaching the objective.
Error budget = 100% - SLO target
Example: SLO = 99.9% → Error budget = 0.1%
Over 30 days: 0.1% × 30 × 24 × 60 = 43.2 minutes of allowed downtime
The error budget is spent by:
- Planned maintenance windows (Talos upgrades, Ceph rebalancing)
- Unplanned incidents (pod crashes, network partitions, misconfigurations)
- Deployments that cause brief errors (rolling updates, canary failures)
When the budget is exhausted, the team should:
- Freeze non-critical changes
- Focus engineering effort on reliability improvements
- Increase testing and validation before deploying
- Review recent incidents for systemic causes
Burn Rate¶
Burn rate measures how fast the error budget is being consumed relative to the budget window:
Burn rate = (observed error rate / allowed error rate)
Burn rate = 1.0 → Budget consumed evenly over the window (on track)
Burn rate = 2.0 → Budget consumed 2× faster than planned (will exhaust in 15 days instead of 30)
Burn rate = 10.0 → Budget consumed 10× faster (critical — will exhaust in 3 days)
Alerting on burn rate (recommended over raw error rate):
| Burn Rate | Severity | Response | Alert Window |
|---|---|---|---|
| > 14.4 | Critical (P1) | Page on-call immediately | 5-minute rate over 1-hour window |
| > 6.0 | High (P2) | Respond within 30 minutes | 30-minute rate over 6-hour window |
| > 1.0 | Warning (P3) | Investigate during business hours | 6-hour rate over 3-day window |
These thresholds come from the Google SRE workbook's multi-window, multi-burn-rate alerting approach.
PromQL example — burn rate alert for API availability:
# 5-minute error rate / allowed error rate
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) / (1 - 0.999) # 0.999 = SLO target
Latency Percentiles (P50, P95, P99, P999)¶
Why Percentiles, Not Averages¶
Averages hide the experience of your worst-affected users. Consider two scenarios with the same 200ms average:
Scenario A: All requests take 200ms → Average: 200ms
Scenario B: 99 requests at 100ms, 1 at 10,100ms (10s) → Average: 200ms
Both have a 200ms average, but Scenario B has users experiencing 10-second responses. Percentiles reveal this.
What Percentiles Mean¶
A percentile answers: "What is the maximum latency experienced by X% of requests?"
| Percentile | Meaning | Tells You |
|---|---|---|
| P50 (median) | 50% of requests are faster than this value | Typical user experience |
| P90 | 90% of requests are faster | Experience for most users |
| P95 | 95% of requests are faster | Experience excluding edge cases |
| P99 | 99% of requests are faster | Worst 1-in-100 experience — the "tail latency" |
| P999 (P99.9) | 99.9% of requests are faster | Extreme tail — often dominated by GC pauses, cold caches, retries |
Visualising the Distribution¶
Number of
requests
│
██│
██│
██│██
██│██
██│██ ██
██│██ ██
██│██ ██ ██
██│██ ██ ██ ██
██│██ ██ ██ ██ ██
██│██ ██ ██ ██ ██ ██ ░░ ░░ ░░
──┴─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──── Latency (ms)
50 100 150 200 250 300 350 400 450 500 2000
▲ ▲ ▲ ▲
P50 P90 P99 P999
Most requests cluster around P50. The "long tail" to the right catches the slow outliers that P99 and P999 measure.
Why P99 Matters More Than P50¶
If your service handles 10,000 requests per minute:
- P50 = 100ms means 5,000 requests are faster than 100ms — the "happy path"
- P99 = 2,000ms means 100 requests per minute take over 2 seconds — that is 100 frustrated users every minute
- P999 = 10,000ms means 10 requests per minute take over 10 seconds — likely timeouts and retries
For the RCIIS platform, customs declaration processing that hits the P99 tail causes visible delays at border checkpoints. SLOs should target P99, not P50.
PromQL — Calculating Percentiles¶
Prometheus stores latency data in histograms with configurable bucket boundaries. Use histogram_quantile() to compute percentiles:
# P50 latency for HTTP requests over the last 5 minutes
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# P99 latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# P99 latency broken down by service
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Histogram bucket boundaries
The accuracy of histogram_quantile depends on the bucket boundaries configured in the application. If the buckets are [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds, Prometheus interpolates within the bucket that contains the percentile. Finer buckets give better accuracy but consume more storage (higher cardinality).
The Four Golden Signals¶
Defined in the Google SRE book, these are the four most important metrics to monitor for any service:
| Signal | What It Measures | RCIIS Example | Key Metric |
|---|---|---|---|
| Latency | Time to serve a request | APISIX gateway response time | histogram_quantile(0.99, ...) |
| Traffic | Demand on the system | Requests per second to the ESB API | sum(rate(http_requests_total[5m])) |
| Errors | Rate of failed requests | 5xx responses from the API gateway | sum(rate(http_requests_total{status=~"5.."}[5m])) |
| Saturation | How "full" the system is | CPU, memory, disk usage on worker nodes | node_cpu_seconds_total, node_memory_MemAvailable_bytes |
Latency should be measured for both successful and failed requests separately — a fast error (e.g., 500 in 2ms) should not artificially lower the latency metric for successful requests.
Saturation is the hardest to measure well. Look for:
- CPU throttling (
container_cpu_cfs_throttled_seconds_total) - Memory pressure (
container_memory_working_set_bytesapproaching limits) - Disk I/O wait (
node_disk_io_time_seconds_total) - Ceph OSD utilisation (
ceph_osd_pgs,ceph_osd_utilization) - Kafka consumer lag (messages produced minus consumed)
RED Method¶
The RED method is a simplified framework for monitoring request-driven services (APIs, web servers, gateways). It focuses on three metrics:
| Metric | Definition | PromQL Pattern |
|---|---|---|
| Rate | Requests per second | sum(rate(http_requests_total[5m])) |
| Errors | Failed requests per second | sum(rate(http_requests_total{status=~"5.."}[5m])) |
| Duration | Distribution of request latency | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) |
When to use RED: For any component that serves requests — APISIX gateway, Keycloak authentication endpoints, FluxCD controllers, Kubernetes API server.
RCIIS component mapping:
| Component | Rate | Errors | Duration |
|---|---|---|---|
| APISIX Gateway | Customs API requests/sec | 5xx + 4xx responses | Response time P99 |
| Keycloak | Auth token requests/sec | Failed logins | Token issuance latency |
| FluxCD | Reconciliation operations/min | Failed reconciliations | Reconciliation duration |
| Kubernetes API | API requests/sec | 5xx responses | API call latency |
USE Method¶
The USE method is a framework for monitoring infrastructure resources (CPU, memory, disk, network). Developed by Brendan Gregg, it measures:
| Metric | Definition | What Indicates a Problem |
|---|---|---|
| Utilization | Percentage of resource capacity in use | Sustained > 80% — approaching limits |
| Saturation | Work that is queued or waiting | Any non-zero value — resource is overloaded |
| Errors | Count of error events | Any errors — hardware or software fault |
When to use USE: For every physical or virtual resource — CPU, memory, disk, network interfaces, Ceph OSDs.
RCIIS infrastructure mapping:
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | rate(node_cpu_seconds_total{mode!="idle"}[5m]) |
node_load15 > CPU count |
Machine check exceptions |
| Memory | 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) |
node_vmstat_pswpin (swap in) |
ECC errors |
| Disk | rate(node_disk_io_time_seconds_total[5m]) |
node_disk_io_time_weighted_seconds_total |
node_disk_io_now errors, SMART alerts |
| Network | rate(node_network_transmit_bytes_total[5m]) |
node_network_transmit_drop_total |
node_network_transmit_errs_total |
| Ceph OSD | ceph_osd_utilization |
ceph_osd_pgs vs max |
ceph_osd_down |
RED vs USE
Use RED for services that handle requests (APIs, web apps). Use USE for resources those services run on (nodes, disks, network). Together they cover both the application and infrastructure layers.
Availability & "Nines"¶
Availability is expressed as a percentage of uptime. Each additional "nine" represents a 10× improvement in reliability:
| Availability | Common Name | Downtime / Year | Downtime / Month | Downtime / Week |
|---|---|---|---|---|
| 99% | Two nines | 3.65 days | 7.3 hours | 1.68 hours |
| 99.9% | Three nines | 8.76 hours | 43.8 minutes | 10.1 minutes |
| 99.95% | Three and a half nines | 4.38 hours | 21.9 minutes | 5.0 minutes |
| 99.99% | Four nines | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% | Five nines | 5.26 minutes | 26.3 seconds | 6.05 seconds |
Practical considerations for RCIIS:
- 99.9% (three nines) is a realistic SLO for a multi-region customs platform with planned maintenance windows. It allows ~43 minutes of downtime per month.
- 99.99% (four nines) requires zero-downtime deployments, automated failover, and minimal planned maintenance. Achievable with Cilium + HA deployments + geo-load balancing but demanding operationally.
- 99.999% (five nines) requires active-active multi-region with automatic traffic shifting. The Cloudflare geo-load balancing layer helps achieve this for the external-facing API, but internal services are harder.
The error budget calculation from earlier ties directly to availability:
Incident Metrics: MTTR, MTTF, MTBF¶
These metrics quantify the reliability and recoverability of the platform:
| Metric | Full Name | Definition | Formula |
|---|---|---|---|
| MTTR | Mean Time to Recovery | Average time from incident detection to service restoration | sum(recovery times) / count(incidents) |
| MTTF | Mean Time to Failure | Average time between failures (for non-repairable items) | sum(uptime periods) / count(failures) |
| MTBF | Mean Time Between Failures | Average time between consecutive failures | MTBF = MTTF + MTTR |
┌──── MTTF ──────┐┌─ MTTR ─┐┌──── MTTF ──────┐┌─ MTTR ─┐
│ ││ ││ ││ │
─────────┤ Running ├┤ Down ├┤ Running ├┤ Down ├────
│ ││ ││ ││ │
└────────────────┘└────────┘└────────────────┘└────────┘
├─────────── MTBF ─────────┤
Reducing MTTR (the most actionable metric):
| Action | Impact on MTTR | RCIIS Implementation |
|---|---|---|
| Faster detection | Reduces time-to-detect | Prometheus alerting with burn rate rules |
| Clear runbooks | Reduces time-to-diagnose | Alert annotations linking to runbook URLs |
| Automated remediation | Reduces time-to-fix | Kyverno auto-remediation, FluxCD drift correction |
| Practiced response | Reduces coordination overhead | Regular game days, documented incident response |
See Incident Response for the RCIIS incident management process.
Prometheus & PromQL Basics¶
Metric Types¶
Prometheus collects four types of metrics:
| Type | Description | Example | Operations |
|---|---|---|---|
| Counter | Monotonically increasing value (resets to 0 on restart) | http_requests_total, node_cpu_seconds_total |
rate(), increase() |
| Gauge | Value that goes up and down | node_memory_MemAvailable_bytes, kube_pod_status_ready |
Direct value, avg_over_time() |
| Histogram | Distribution of values in configurable buckets | http_request_duration_seconds_bucket |
histogram_quantile(), rate() on _bucket |
| Summary | Pre-calculated percentiles (less common) | go_gc_duration_seconds |
Direct quantile values |
Essential PromQL Patterns¶
Rate of change (for counters — always use rate(), never raw counter values):
# Requests per second over the last 5 minutes
rate(http_requests_total[5m])
# CPU usage as a percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Aggregation:
# Total requests per second across all pods
sum(rate(http_requests_total[5m]))
# Average memory usage by namespace
avg by (namespace) (container_memory_working_set_bytes)
# Top 5 pods by CPU usage
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
Percentiles (from histograms):
# P99 request latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Alerting expressions:
# Alert if error rate exceeds 1% for 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
# Alert if disk is >85% full
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
Recording Rules¶
Recording rules pre-compute expensive PromQL queries and store the result as a new time series. This improves dashboard load time and alert evaluation performance:
groups:
- name: rciis-sli
interval: 30s
rules:
# Pre-compute API availability SLI
- record: rciis:api_availability:ratio_rate5m
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
# Pre-compute P99 latency
- record: rciis:api_latency_p99:seconds
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Use the recorded metric name in dashboards and alerts instead of recomputing the full expression each time.
Loki & LogQL Basics¶
Loki is the log aggregation system deployed alongside Prometheus. LogQL queries follow a similar syntax to PromQL but operate on log streams.
Log Stream Selection¶
# All logs from the keycloak namespace
{namespace="keycloak"}
# Logs from a specific pod
{namespace="flux-system", pod=~"source-controller.*"}
# All error-level logs across all namespaces
{level="error"}
Filtering¶
# Lines containing "connection refused"
{namespace="keycloak"} |= "connection refused"
# Lines NOT containing "health"
{namespace="flux-system"} != "health"
# Regex match
{namespace="rciis"} |~ "status=(4|5)\\d\\d"
Log-Based Metrics¶
# Error log lines per second (useful for alerting)
sum(rate({namespace="keycloak", level="error"}[5m]))
# Count of unique error messages in the last hour
sum by (message) (count_over_time({namespace="rciis"} |= "ERROR" [1h]))
When to use logs vs metrics
Use metrics (Prometheus) for numerical measurements: request rates, latencies, resource usage. Use logs (Loki) for event context: error messages, stack traces, audit trails, request details. Alert on metrics first; use logs for diagnosis.
Cardinality¶
Cardinality is the number of unique time series in Prometheus. Each unique combination of metric name and label values creates a separate time series.
Why High Cardinality Is Dangerous¶
http_requests_total{method="GET", status="200", path="/api/v1/health"} → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users"} → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users/123"} → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users/456"} → 1 series
...
http_requests_total{method="GET", status="200", path="/api/v1/users/999999"} → 1 series
If the path label includes user IDs, each unique user creates a new time series. With 100,000 users, that is 100,000 series per method/status combination — millions of series total.
Symptoms of cardinality explosion:
- Prometheus memory usage spikes
- Slow query performance in Grafana
TSDB compactionerrors in Prometheus logs- Series churn warnings:
"too many active series"
How to Prevent It¶
| Rule | Good Label | Bad Label |
|---|---|---|
| Bounded values only | status="200", method="GET" |
user_id="12345", request_id="abc-def" |
| Use path templates | path="/api/v1/users/:id" |
path="/api/v1/users/12345" |
| Keep label count low | 5-7 labels per metric | 15+ labels per metric |
| Drop high-cardinality labels | Use metric_relabel_configs |
Ingest everything and hope |
Prometheus relabel config to drop a high-cardinality label:
metric_relabel_configs:
- source_labels: [__name__]
regex: "http_requests_total"
action: labeldrop
regex: "request_id"
Check current cardinality:
Alerting Best Practices¶
Alert on Symptoms, Not Causes¶
| Symptom-Based (Good) | Cause-Based (Bad) |
|---|---|
| "API error rate > 1% for 5 minutes" | "Pod restarted" |
| "P99 latency > 2s for 10 minutes" | "CPU usage > 80%" |
| "Error budget burn rate > 6x" | "Disk usage > 70%" |
Cause-based alerts generate noise — a pod restart might be normal (rolling update), and high CPU might be expected (batch job). Symptom-based alerts fire only when users are actually affected.
Severity Levels¶
| Severity | Response Time | Notification | Example |
|---|---|---|---|
| P1 — Critical | Immediate (page) | PagerDuty / phone | Error budget burning 14× — service down |
| P2 — High | 30 minutes | Slack alert channel | Error budget burning 6× — degraded |
| P3 — Warning | Business hours | Slack / ticket | Error budget burning 1× — trending bad |
| P4 — Info | Next sprint | Dashboard only | Certificate expiring in 30 days |
Runbooks¶
Every alert should link to a runbook — a documented procedure for diagnosing and resolving the alert:
annotations:
summary: "High API error rate"
description: "API error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
runbook_url: "https://docs.rciis.eac.int/runbooks/api-error-rate"
A runbook should contain:
- What this alert means — plain language
- Impact — who is affected and how
- Diagnostic steps — specific commands to run
- Resolution steps — ordered actions to take
- Escalation — when and who to escalate to
Toil & Automation¶
Toil is work that is:
- Manual — a human runs a command or clicks a button
- Repetitive — done more than once or twice
- Automatable — could be handled by a script or controller
- Reactive — triggered by an event rather than planned
- Without enduring value — does not permanently improve the system
Examples of toil in Kubernetes operations:
| Toil | Automation |
|---|---|
| Manually restarting crashed pods | Kubernetes liveness probes + automatic restart |
| Manually scaling replicas during traffic spikes | Horizontal Pod Autoscaler (HPA) |
| Manually rotating certificates | cert-manager automatic renewal |
| Manually approving Renovate PRs for patch versions | Auto-merge policy for patch updates |
| Manually checking for CVEs in images | Trivy Operator continuous scanning |
| Manually applying Kyverno policy exceptions | PolicyException CRs in Git (GitOps) |
The SRE principle is: spend no more than 50% of time on toil. If toil exceeds this, invest in automation. The RCIIS platform's GitOps approach, FluxCD drift correction, Kyverno auto-remediation, and cert-manager renewal are all examples of toil reduction.
Reliability Engineering Practices¶
Day-0, Day-1, Day-2 Operations¶
These terms describe the lifecycle phases of a platform:
| Phase | When | Activities |
|---|---|---|
| Day-0 | Before deployment | Architecture design, capacity planning, network design, security requirements — Phases 1-2 of this documentation |
| Day-1 | Initial deployment | Infrastructure build, Talos install, platform service deployment, validation — Phases 3-8 of this documentation |
| Day-2 | Ongoing operations | Upgrades, scaling, backup/recovery, incident response, certificate rotation — Phase 9 of this documentation |
Day-2 is where teams spend the most time and where reliability practices (SLOs, alerting, runbooks, automation) have the most impact.
Chaos Engineering¶
Chaos engineering is the practice of deliberately injecting failures into the system to verify that it handles them gracefully. The goal is to find weaknesses before they cause real incidents.
Examples for RCIIS:
| Experiment | What It Tests | Expected Outcome |
|---|---|---|
| Kill a random worker node | Pod rescheduling, Ceph rebalancing | Workloads migrate, storage remains available |
| Block network to Keycloak | Service degradation handling | Cached tokens still work, new logins show clear error |
| Fill a Ceph OSD disk to 85% | Near-full warnings, OSD auto-reweight | Alerts fire, Ceph rebalances data away |
| Inject 500ms latency on ingress | P99 SLO breach detection | Burn rate alert fires within expected window |
Note
Chaos experiments should be run in non-production environments first and only in production with explicit approval and during business hours with the team on standby.
Game Days¶
A game day is a planned exercise where the team practices responding to a simulated incident:
- Define a scenario (e.g., "a control plane node becomes unreachable")
- Inject the failure (cordon + drain, or network partition)
- The on-call team responds using normal incident procedures
- Run a post-mortem reviewing what worked and what did not
Game days build muscle memory for real incidents and expose gaps in runbooks, alerting, and communication.
Post-Mortems¶
After every significant incident, write a blameless post-mortem that documents:
- Timeline — when the incident started, was detected, escalated, and resolved
- Impact — which services, how many users, for how long
- Root cause — the underlying technical cause (not "human error")
- Contributing factors — what made detection or resolution slower
- Action items — specific, assigned tasks to prevent recurrence
The post-mortem should be stored in a shared location and reviewed as a team.
Further Reading¶
| Resource | Description |
|---|---|
| Google SRE Book | The foundational text on Site Reliability Engineering — free to read online |
| Google SRE Workbook | Practical companion to the SRE book with worked examples |
| Brendan Gregg — USE Method | The original USE method page with per-resource checklists |
| Tom Wilkie — RED Method | Grafana blog post explaining the RED method |
| Prometheus Documentation | Official Prometheus docs including PromQL reference |
| Grafana Loki — LogQL | LogQL query language reference |
| OpenSLO Specification | Open standard for defining SLOs as code |