Skip to content

SRE & Observability Concepts

This page explains the Site Reliability Engineering (SRE) and observability concepts used throughout this documentation. It is intended as a learning resource for platform operators who are deploying and managing the RCIIS infrastructure.


Service Levels: SLI, SLO, SLA

These three terms form a hierarchy that defines, measures, and contractually guarantees service reliability.

Service Level Indicator (SLI)

An SLI is a quantitative measurement of a specific aspect of the service. It answers: "How is the service performing right now?"

SLIs are expressed as ratios — good events divided by total events — yielding a percentage:

SLI = (good events / total events) × 100%

RCIIS examples:

SLI Measurement Good Event Source
API availability HTTP requests to the RCIIS ESB gateway Response status is not 5xx APISIX / ingress-nginx metrics
Request latency HTTP requests to the Kubernetes API Response time < 500ms kube-apiserver metrics
Data freshness Kafka consumer lag Lag < 1000 messages Strimzi / Kafka exporter metrics
Database availability PostgreSQL health checks Connection succeeds and query returns CloudNativePG metrics

Service Level Objective (SLO)

An SLO is a target value for an SLI over a rolling time window. It answers: "How reliable should the service be?"

SLO: 99.9% of API requests return a non-5xx response over a 30-day window
      ^^^^                                              ^^^^^^^^^^^^^^^^^
      target                                            time window

SLOs are internal commitments — the team agrees to maintain this level of reliability. They are deliberately set below 100% because perfect reliability is neither achievable nor cost-effective.

RCIIS examples:

SLI SLO Target Window Meaning
API availability 99.9% 30 days At most 43 minutes of downtime per month
Request latency (P99) < 500ms 30 days 99% of requests complete in under 500ms
Kafka consumer lag < 1000 messages 1 hour Data propagation stays near real-time

Service Level Agreement (SLA)

An SLA is a contractual commitment — a legal document between a service provider and its customers that defines consequences (refunds, penalties, escalation) if the SLO is not met.

SLA ⊂ SLO ⊂ SLI

SLI: "We measure API availability"
SLO: "We target 99.9% availability"
SLA: "We guarantee 99.5% availability; if breached, partner states may escalate"

SLAs are always set looser than SLOs to provide a safety margin. If the SLO is 99.9%, the SLA might guarantee 99.5%.

Practical guidance

Start by defining SLIs (what to measure), then set SLOs (internal targets). Only create SLAs when contractual obligations with partner states require them. Most teams operate effectively with SLIs and SLOs alone.


Error Budgets & Burn Rate

Error Budget

The error budget is the inverse of the SLO — the amount of unreliability you can tolerate before breaching the objective.

Error budget = 100% - SLO target

Example: SLO = 99.9% → Error budget = 0.1%
Over 30 days: 0.1% × 30 × 24 × 60 = 43.2 minutes of allowed downtime

The error budget is spent by:

  • Planned maintenance windows (Talos upgrades, Ceph rebalancing)
  • Unplanned incidents (pod crashes, network partitions, misconfigurations)
  • Deployments that cause brief errors (rolling updates, canary failures)

When the budget is exhausted, the team should:

  1. Freeze non-critical changes
  2. Focus engineering effort on reliability improvements
  3. Increase testing and validation before deploying
  4. Review recent incidents for systemic causes

Burn Rate

Burn rate measures how fast the error budget is being consumed relative to the budget window:

Burn rate = (observed error rate / allowed error rate)

Burn rate = 1.0 → Budget consumed evenly over the window (on track)
Burn rate = 2.0 → Budget consumed 2× faster than planned (will exhaust in 15 days instead of 30)
Burn rate = 10.0 → Budget consumed 10× faster (critical — will exhaust in 3 days)

Alerting on burn rate (recommended over raw error rate):

Burn Rate Severity Response Alert Window
> 14.4 Critical (P1) Page on-call immediately 5-minute rate over 1-hour window
> 6.0 High (P2) Respond within 30 minutes 30-minute rate over 6-hour window
> 1.0 Warning (P3) Investigate during business hours 6-hour rate over 3-day window

These thresholds come from the Google SRE workbook's multi-window, multi-burn-rate alerting approach.

PromQL example — burn rate alert for API availability:

# 5-minute error rate / allowed error rate
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) / (1 - 0.999)   # 0.999 = SLO target

Latency Percentiles (P50, P95, P99, P999)

Why Percentiles, Not Averages

Averages hide the experience of your worst-affected users. Consider two scenarios with the same 200ms average:

Scenario A: All requests take 200ms            → Average: 200ms
Scenario B: 99 requests at 100ms, 1 at 10,100ms (10s) → Average: 200ms

Both have a 200ms average, but Scenario B has users experiencing 10-second responses. Percentiles reveal this.

What Percentiles Mean

A percentile answers: "What is the maximum latency experienced by X% of requests?"

Percentile Meaning Tells You
P50 (median) 50% of requests are faster than this value Typical user experience
P90 90% of requests are faster Experience for most users
P95 95% of requests are faster Experience excluding edge cases
P99 99% of requests are faster Worst 1-in-100 experience — the "tail latency"
P999 (P99.9) 99.9% of requests are faster Extreme tail — often dominated by GC pauses, cold caches, retries

Visualising the Distribution

Number of
requests
  ██│
  ██│
  ██│██
  ██│██
  ██│██ ██
  ██│██ ██
  ██│██ ██ ██
  ██│██ ██ ██ ██
  ██│██ ██ ██ ██ ██
  ██│██ ██ ██ ██ ██ ██ ░░ ░░                     ░░
──┴─┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──── Latency (ms)
  50 100 150 200 250 300 350 400 450 500     2000
       ▲                 ▲           ▲          ▲
      P50               P90        P99        P999

Most requests cluster around P50. The "long tail" to the right catches the slow outliers that P99 and P999 measure.

Why P99 Matters More Than P50

If your service handles 10,000 requests per minute:

  • P50 = 100ms means 5,000 requests are faster than 100ms — the "happy path"
  • P99 = 2,000ms means 100 requests per minute take over 2 seconds — that is 100 frustrated users every minute
  • P999 = 10,000ms means 10 requests per minute take over 10 seconds — likely timeouts and retries

For the RCIIS platform, customs declaration processing that hits the P99 tail causes visible delays at border checkpoints. SLOs should target P99, not P50.

PromQL — Calculating Percentiles

Prometheus stores latency data in histograms with configurable bucket boundaries. Use histogram_quantile() to compute percentiles:

# P50 latency for HTTP requests over the last 5 minutes
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# P99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# P99 latency broken down by service
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Histogram bucket boundaries

The accuracy of histogram_quantile depends on the bucket boundaries configured in the application. If the buckets are [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds, Prometheus interpolates within the bucket that contains the percentile. Finer buckets give better accuracy but consume more storage (higher cardinality).


The Four Golden Signals

Defined in the Google SRE book, these are the four most important metrics to monitor for any service:

Signal What It Measures RCIIS Example Key Metric
Latency Time to serve a request APISIX gateway response time histogram_quantile(0.99, ...)
Traffic Demand on the system Requests per second to the ESB API sum(rate(http_requests_total[5m]))
Errors Rate of failed requests 5xx responses from the API gateway sum(rate(http_requests_total{status=~"5.."}[5m]))
Saturation How "full" the system is CPU, memory, disk usage on worker nodes node_cpu_seconds_total, node_memory_MemAvailable_bytes

Latency should be measured for both successful and failed requests separately — a fast error (e.g., 500 in 2ms) should not artificially lower the latency metric for successful requests.

Saturation is the hardest to measure well. Look for:

  • CPU throttling (container_cpu_cfs_throttled_seconds_total)
  • Memory pressure (container_memory_working_set_bytes approaching limits)
  • Disk I/O wait (node_disk_io_time_seconds_total)
  • Ceph OSD utilisation (ceph_osd_pgs, ceph_osd_utilization)
  • Kafka consumer lag (messages produced minus consumed)

RED Method

The RED method is a simplified framework for monitoring request-driven services (APIs, web servers, gateways). It focuses on three metrics:

Metric Definition PromQL Pattern
Rate Requests per second sum(rate(http_requests_total[5m]))
Errors Failed requests per second sum(rate(http_requests_total{status=~"5.."}[5m]))
Duration Distribution of request latency histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

When to use RED: For any component that serves requests — APISIX gateway, Keycloak authentication endpoints, FluxCD controllers, Kubernetes API server.

RCIIS component mapping:

Component Rate Errors Duration
APISIX Gateway Customs API requests/sec 5xx + 4xx responses Response time P99
Keycloak Auth token requests/sec Failed logins Token issuance latency
FluxCD Reconciliation operations/min Failed reconciliations Reconciliation duration
Kubernetes API API requests/sec 5xx responses API call latency

USE Method

The USE method is a framework for monitoring infrastructure resources (CPU, memory, disk, network). Developed by Brendan Gregg, it measures:

Metric Definition What Indicates a Problem
Utilization Percentage of resource capacity in use Sustained > 80% — approaching limits
Saturation Work that is queued or waiting Any non-zero value — resource is overloaded
Errors Count of error events Any errors — hardware or software fault

When to use USE: For every physical or virtual resource — CPU, memory, disk, network interfaces, Ceph OSDs.

RCIIS infrastructure mapping:

Resource Utilization Saturation Errors
CPU rate(node_cpu_seconds_total{mode!="idle"}[5m]) node_load15 > CPU count Machine check exceptions
Memory 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) node_vmstat_pswpin (swap in) ECC errors
Disk rate(node_disk_io_time_seconds_total[5m]) node_disk_io_time_weighted_seconds_total node_disk_io_now errors, SMART alerts
Network rate(node_network_transmit_bytes_total[5m]) node_network_transmit_drop_total node_network_transmit_errs_total
Ceph OSD ceph_osd_utilization ceph_osd_pgs vs max ceph_osd_down

RED vs USE

Use RED for services that handle requests (APIs, web apps). Use USE for resources those services run on (nodes, disks, network). Together they cover both the application and infrastructure layers.


Availability & "Nines"

Availability is expressed as a percentage of uptime. Each additional "nine" represents a 10× improvement in reliability:

Availability Common Name Downtime / Year Downtime / Month Downtime / Week
99% Two nines 3.65 days 7.3 hours 1.68 hours
99.9% Three nines 8.76 hours 43.8 minutes 10.1 minutes
99.95% Three and a half nines 4.38 hours 21.9 minutes 5.0 minutes
99.99% Four nines 52.6 minutes 4.38 minutes 1.01 minutes
99.999% Five nines 5.26 minutes 26.3 seconds 6.05 seconds

Practical considerations for RCIIS:

  • 99.9% (three nines) is a realistic SLO for a multi-region customs platform with planned maintenance windows. It allows ~43 minutes of downtime per month.
  • 99.99% (four nines) requires zero-downtime deployments, automated failover, and minimal planned maintenance. Achievable with Cilium + HA deployments + geo-load balancing but demanding operationally.
  • 99.999% (five nines) requires active-active multi-region with automatic traffic shifting. The Cloudflare geo-load balancing layer helps achieve this for the external-facing API, but internal services are harder.

The error budget calculation from earlier ties directly to availability:

99.9% SLO → 0.1% error budget → 43.8 minutes/month of allowed downtime

Incident Metrics: MTTR, MTTF, MTBF

These metrics quantify the reliability and recoverability of the platform:

Metric Full Name Definition Formula
MTTR Mean Time to Recovery Average time from incident detection to service restoration sum(recovery times) / count(incidents)
MTTF Mean Time to Failure Average time between failures (for non-repairable items) sum(uptime periods) / count(failures)
MTBF Mean Time Between Failures Average time between consecutive failures MTBF = MTTF + MTTR
         ┌──── MTTF ──────┐┌─ MTTR ─┐┌──── MTTF ──────┐┌─ MTTR ─┐
         │                ││        ││                ││        │
─────────┤    Running     ├┤  Down  ├┤    Running     ├┤  Down  ├────
         │                ││        ││                ││        │
         └────────────────┘└────────┘└────────────────┘└────────┘
         ├─────────── MTBF ─────────┤

Reducing MTTR (the most actionable metric):

Action Impact on MTTR RCIIS Implementation
Faster detection Reduces time-to-detect Prometheus alerting with burn rate rules
Clear runbooks Reduces time-to-diagnose Alert annotations linking to runbook URLs
Automated remediation Reduces time-to-fix Kyverno auto-remediation, FluxCD drift correction
Practiced response Reduces coordination overhead Regular game days, documented incident response

See Incident Response for the RCIIS incident management process.


Prometheus & PromQL Basics

Metric Types

Prometheus collects four types of metrics:

Type Description Example Operations
Counter Monotonically increasing value (resets to 0 on restart) http_requests_total, node_cpu_seconds_total rate(), increase()
Gauge Value that goes up and down node_memory_MemAvailable_bytes, kube_pod_status_ready Direct value, avg_over_time()
Histogram Distribution of values in configurable buckets http_request_duration_seconds_bucket histogram_quantile(), rate() on _bucket
Summary Pre-calculated percentiles (less common) go_gc_duration_seconds Direct quantile values

Essential PromQL Patterns

Rate of change (for counters — always use rate(), never raw counter values):

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

# CPU usage as a percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Aggregation:

# Total requests per second across all pods
sum(rate(http_requests_total[5m]))

# Average memory usage by namespace
avg by (namespace) (container_memory_working_set_bytes)

# Top 5 pods by CPU usage
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))

Percentiles (from histograms):

# P99 request latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Alerting expressions:

# Alert if error rate exceeds 1% for 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01

# Alert if disk is >85% full
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15

Recording Rules

Recording rules pre-compute expensive PromQL queries and store the result as a new time series. This improves dashboard load time and alert evaluation performance:

prometheus-recording-rules.yaml
groups:
  - name: rciis-sli
    interval: 30s
    rules:
      # Pre-compute API availability SLI
      - record: rciis:api_availability:ratio_rate5m
        expr: |
          1 - (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )

      # Pre-compute P99 latency
      - record: rciis:api_latency_p99:seconds
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

Use the recorded metric name in dashboards and alerts instead of recomputing the full expression each time.


Loki & LogQL Basics

Loki is the log aggregation system deployed alongside Prometheus. LogQL queries follow a similar syntax to PromQL but operate on log streams.

Log Stream Selection

# All logs from the keycloak namespace
{namespace="keycloak"}

# Logs from a specific pod
{namespace="flux-system", pod=~"source-controller.*"}

# All error-level logs across all namespaces
{level="error"}

Filtering

# Lines containing "connection refused"
{namespace="keycloak"} |= "connection refused"

# Lines NOT containing "health"
{namespace="flux-system"} != "health"

# Regex match
{namespace="rciis"} |~ "status=(4|5)\\d\\d"

Log-Based Metrics

# Error log lines per second (useful for alerting)
sum(rate({namespace="keycloak", level="error"}[5m]))

# Count of unique error messages in the last hour
sum by (message) (count_over_time({namespace="rciis"} |= "ERROR" [1h]))

When to use logs vs metrics

Use metrics (Prometheus) for numerical measurements: request rates, latencies, resource usage. Use logs (Loki) for event context: error messages, stack traces, audit trails, request details. Alert on metrics first; use logs for diagnosis.


Cardinality

Cardinality is the number of unique time series in Prometheus. Each unique combination of metric name and label values creates a separate time series.

Why High Cardinality Is Dangerous

http_requests_total{method="GET", status="200", path="/api/v1/health"}  → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users"}   → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users/123"} → 1 series
http_requests_total{method="GET", status="200", path="/api/v1/users/456"} → 1 series
...
http_requests_total{method="GET", status="200", path="/api/v1/users/999999"} → 1 series

If the path label includes user IDs, each unique user creates a new time series. With 100,000 users, that is 100,000 series per method/status combination — millions of series total.

Symptoms of cardinality explosion:

  • Prometheus memory usage spikes
  • Slow query performance in Grafana
  • TSDB compaction errors in Prometheus logs
  • Series churn warnings: "too many active series"

How to Prevent It

Rule Good Label Bad Label
Bounded values only status="200", method="GET" user_id="12345", request_id="abc-def"
Use path templates path="/api/v1/users/:id" path="/api/v1/users/12345"
Keep label count low 5-7 labels per metric 15+ labels per metric
Drop high-cardinality labels Use metric_relabel_configs Ingest everything and hope

Prometheus relabel config to drop a high-cardinality label:

metric_relabel_configs:
  - source_labels: [__name__]
    regex: "http_requests_total"
    action: labeldrop
    regex: "request_id"

Check current cardinality:

# Top 10 metrics by series count
topk(10, count by (__name__) ({__name__=~".+"}))

Alerting Best Practices

Alert on Symptoms, Not Causes

Symptom-Based (Good) Cause-Based (Bad)
"API error rate > 1% for 5 minutes" "Pod restarted"
"P99 latency > 2s for 10 minutes" "CPU usage > 80%"
"Error budget burn rate > 6x" "Disk usage > 70%"

Cause-based alerts generate noise — a pod restart might be normal (rolling update), and high CPU might be expected (batch job). Symptom-based alerts fire only when users are actually affected.

Severity Levels

Severity Response Time Notification Example
P1 — Critical Immediate (page) PagerDuty / phone Error budget burning 14× — service down
P2 — High 30 minutes Slack alert channel Error budget burning 6× — degraded
P3 — Warning Business hours Slack / ticket Error budget burning 1× — trending bad
P4 — Info Next sprint Dashboard only Certificate expiring in 30 days

Runbooks

Every alert should link to a runbook — a documented procedure for diagnosing and resolving the alert:

PrometheusRule annotation
annotations:
  summary: "High API error rate"
  description: "API error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
  runbook_url: "https://docs.rciis.eac.int/runbooks/api-error-rate"

A runbook should contain:

  1. What this alert means — plain language
  2. Impact — who is affected and how
  3. Diagnostic steps — specific commands to run
  4. Resolution steps — ordered actions to take
  5. Escalation — when and who to escalate to

Toil & Automation

Toil is work that is:

  • Manual — a human runs a command or clicks a button
  • Repetitive — done more than once or twice
  • Automatable — could be handled by a script or controller
  • Reactive — triggered by an event rather than planned
  • Without enduring value — does not permanently improve the system

Examples of toil in Kubernetes operations:

Toil Automation
Manually restarting crashed pods Kubernetes liveness probes + automatic restart
Manually scaling replicas during traffic spikes Horizontal Pod Autoscaler (HPA)
Manually rotating certificates cert-manager automatic renewal
Manually approving Renovate PRs for patch versions Auto-merge policy for patch updates
Manually checking for CVEs in images Trivy Operator continuous scanning
Manually applying Kyverno policy exceptions PolicyException CRs in Git (GitOps)

The SRE principle is: spend no more than 50% of time on toil. If toil exceeds this, invest in automation. The RCIIS platform's GitOps approach, FluxCD drift correction, Kyverno auto-remediation, and cert-manager renewal are all examples of toil reduction.


Reliability Engineering Practices

Day-0, Day-1, Day-2 Operations

These terms describe the lifecycle phases of a platform:

Phase When Activities
Day-0 Before deployment Architecture design, capacity planning, network design, security requirements — Phases 1-2 of this documentation
Day-1 Initial deployment Infrastructure build, Talos install, platform service deployment, validation — Phases 3-8 of this documentation
Day-2 Ongoing operations Upgrades, scaling, backup/recovery, incident response, certificate rotation — Phase 9 of this documentation

Day-2 is where teams spend the most time and where reliability practices (SLOs, alerting, runbooks, automation) have the most impact.

Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into the system to verify that it handles them gracefully. The goal is to find weaknesses before they cause real incidents.

Examples for RCIIS:

Experiment What It Tests Expected Outcome
Kill a random worker node Pod rescheduling, Ceph rebalancing Workloads migrate, storage remains available
Block network to Keycloak Service degradation handling Cached tokens still work, new logins show clear error
Fill a Ceph OSD disk to 85% Near-full warnings, OSD auto-reweight Alerts fire, Ceph rebalances data away
Inject 500ms latency on ingress P99 SLO breach detection Burn rate alert fires within expected window

Note

Chaos experiments should be run in non-production environments first and only in production with explicit approval and during business hours with the team on standby.

Game Days

A game day is a planned exercise where the team practices responding to a simulated incident:

  1. Define a scenario (e.g., "a control plane node becomes unreachable")
  2. Inject the failure (cordon + drain, or network partition)
  3. The on-call team responds using normal incident procedures
  4. Run a post-mortem reviewing what worked and what did not

Game days build muscle memory for real incidents and expose gaps in runbooks, alerting, and communication.

Post-Mortems

After every significant incident, write a blameless post-mortem that documents:

  1. Timeline — when the incident started, was detected, escalated, and resolved
  2. Impact — which services, how many users, for how long
  3. Root cause — the underlying technical cause (not "human error")
  4. Contributing factors — what made detection or resolution slower
  5. Action items — specific, assigned tasks to prevent recurrence

The post-mortem should be stored in a shared location and reviewed as a team.


Further Reading

Resource Description
Google SRE Book The foundational text on Site Reliability Engineering — free to read online
Google SRE Workbook Practical companion to the SRE book with worked examples
Brendan Gregg — USE Method The original USE method page with per-resource checklists
Tom Wilkie — RED Method Grafana blog post explaining the RED method
Prometheus Documentation Official Prometheus docs including PromQL reference
Grafana Loki — LogQL LogQL query language reference
OpenSLO Specification Open standard for defining SLOs as code