5.3.1 Observability¶
The observability stack provides metrics collection and alerting (Prometheus), log aggregation (Loki), log shipping (Fluent Bit), dashboards (Grafana), endpoint probing (Blackbox Exporter), network device monitoring (SNMP Exporter), and resource right-sizing recommendations (Goldilocks).
How to use this page
Each component has an Install section showing the Flux HelmRelease, a Configuration section with Helm values, and a Verify section to confirm it is working.
All code blocks are labelled with their file path in the repository. Select your target environment (AWS or Bare Metal) in any tab group — the choice syncs across the entire page.
- Using the existing
rciis-devopsrepository: All files already exist. Skip themkdirandgit add/git commitcommands — they are for users building a new repository. Simply review the files, edit values for your environment, and push. - Building a new repository from scratch: Follow the
mkdir, file creation, andgitcommands in order. - No Git access: Expand the "Alternative: Helm CLI" block under each Install section.
Prometheus (kube-prometheus-stack)¶
The kube-prometheus-stack Helm chart deploys the Prometheus Operator, Prometheus server, Alertmanager, Grafana, node-exporter, and kube-state-metrics as a single release. It provides the complete metrics collection, alerting, and visualization pipeline. On AWS, Grafana integrates with Keycloak for OAuth authentication. On Bare Metal, PostgreSQL backs Grafana's user database.
Install¶
The base HelmRelease tells Flux which chart to install. This file is shared across all environments — environment-specific settings are applied via patches (shown in the Configuration section).
Create the base directory and file:
| Field | Value | Explanation |
|---|---|---|
chart |
kube-prometheus-stack |
The Helm chart name from the Prometheus Community registry |
version |
80.14.4 |
Pinned chart version — update this to upgrade Prometheus and components |
sourceRef.name |
prometheus-community |
References a HelmRepository CR pointing to https://prometheus-community.github.io/helm-charts |
targetNamespace |
monitoring |
Prometheus, Grafana, and related components run in the monitoring namespace |
crds: CreateReplace |
— | Automatically installs and updates Prometheus CRDs (PrometheusRule, ServiceMonitor, etc.) |
remediation.retries |
3 |
Flux retries up to 3 times if the install or upgrade fails |
Save the following as flux/infra/base/prometheus.yaml:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: prometheus
namespace: flux-system
spec:
targetNamespace: monitoring
interval: 30m
chart:
spec:
chart: kube-prometheus-stack
version: "80.14.4"
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
releaseName: prometheus
install:
crds: CreateReplace
remediation:
retries: 3
createNamespace: true
upgrade:
crds: CreateReplace
remediation:
retries: 3
values:
cleanPrometheusOperatorObjectNames: true
fullnameOverride: "prometheus"
crds:
enabled: true
upgradeJob:
enabled: true
forceConflicts: true
prometheusOperator:
createCustomResource: true
enabled: true
tls:
enabled: false
admissionWebhooks:
certManager:
enabled: true
enabled: true
serviceMonitor:
selfMonitor: true
prometheus:
thanosService:
enabled: false
prometheusSpec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
nodeTaintsPolicy: Honor
labelSelector:
matchLabels:
app.kubernetes.io/name: prometheus
replicas: 2
retention: 30d
enableRemoteWriteReceiver: true
enableFeatures:
- remote-write-receiver
replicaExternalLabelName: "__replica__"
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
alertmanager:
alertmanagerSpec:
replicas: 2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
nodeTaintsPolicy: Honor
labelSelector:
matchLabels:
app.kubernetes.io/name: alertmanager
kubelet:
enabled: true
serviceMonitor:
cAdvisor: true
grafana:
fullnameOverride: "grafana"
enabled: true
replicas: 1
admin:
existingSecret: "grafana-admin-creds"
userKey: admin-user
passwordKey: admin-password
initChownData:
enabled: false
sidecar:
dashboards:
enabled: true
label: grafana_dashboard
folderAnnotation: grafana_folder
provider:
allowUiUpdates: true
foldersFromFilesStructure: true
datasources:
enabled: true
defaultDatasourceEnabled: true
persistence:
enabled: false
additionalDataSources:
- name: Loki
type: loki
isDefault: false
access: proxy
url: http://loki-gateway.monitoring.svc.cluster.local:80
editable: true
kube-state-metrics:
fullnameOverride: "kube-state-metrics"
prometheus-node-exporter:
fullnameOverride: "node-exporter"
thanosRuler:
enabled: false
Alternative: Helm CLI
If you do not have Git access, install Prometheus directly:
Configuration¶
The environment patch overrides the base HelmRelease with cluster-specific settings, including storage class, resource scaling, and (on AWS) Keycloak OAuth for Grafana.
Create the environment overlay directory:
Environment Patch¶
The patch file sets storage class, replica counts, external labels, and Grafana OAuth configuration. These differ fundamentally between AWS and Bare Metal.
Save the following as the patch file for your environment:
On AWS, Prometheus uses gp3 EBS volumes for persistent storage and Grafana authenticates via Keycloak OAuth. Single Prometheus and Alertmanager replica for cost optimization.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: prometheus
spec:
values:
prometheus:
prometheusSpec:
replicas: 1
topologySpreadConstraints: []
resources:
requests:
cpu: 50m
memory: 256Mi
limits:
cpu: 1000m
memory: 2Gi
externalLabels:
cluster: "rciis-aws"
env: "aws"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
alertmanager:
alertmanagerSpec:
replicas: 1
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
kube-state-metrics:
prometheus:
monitor:
enabled: true
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-aws
kubelet:
serviceMonitor:
cAdvisorRelabelings:
- action: replace
targetLabel: cluster
replacement: rciis-aws
kubeApiServer:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-aws
prometheus-node-exporter:
prometheus:
monitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-aws
grafana:
defaultDashboardsTimezone: Africa/Johannesburg
grafana.ini:
server:
root_url: https://grafana.rciis.africa
auth.generic_oauth:
enabled: true
name: Keycloak
allow_sign_up: true
client_id: grafana
scopes: openid email profile roles
auth_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/auth
token_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/token
api_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/userinfo
role_attribute_path: "contains(realm_access.roles[*], 'admin') && 'Admin' || contains(realm_access.roles[*], 'editor') && 'Editor' || 'Viewer'"
env:
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
GF_DATABASE_NAME: grafana
GF_DATABASE_USER: grafana
GF_DATABASE_SSL_MODE: disable
envValueFrom:
GF_DATABASE_PASSWORD:
secretKeyRef:
name: grafana-pg-owner
key: password
GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET:
secretKeyRef:
name: grafana-keycloak-client-secret
key: clientSecret
extraInitContainers:
- name: wait-for-postgres
image: busybox:1.36
command:
- sh
- -c
- |
echo "Waiting for PostgreSQL to be ready..."
until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
echo "PostgreSQL not ready, waiting..."
sleep 5
done
echo "PostgreSQL is ready!"
ingress:
enabled: false
defaultRules:
additionalRuleLabels:
cluster: rciis-aws
env: aws
| Setting | Value | Why |
|---|---|---|
storageClassName |
gp3 |
AWS EBS gp3 volumes provide good price/performance for time-series data |
replicas |
1 |
Single replica saves costs on AWS. Use topology spread for HA. |
grafana.ini.auth.generic_oauth |
Keycloak config | Grafana users authenticate via Keycloak instead of local admin credentials |
GF_DATABASE_* |
PostgreSQL | Grafana data (dashboards, users) is stored in PostgreSQL instead of SQLite |
root_url |
https://grafana.rciis.africa |
Sets the Grafana external URL for OAuth redirects |
On Bare Metal, Prometheus uses Ceph RBD volumes for persistent storage. PostgreSQL backs Grafana's database for data persistence. No OAuth is configured in the patch — add it separately if needed.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: prometheus
spec:
values:
prometheus:
prometheusSpec:
externalLabels:
cluster: "rciis-kenya"
env: "baremetal"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ceph-rbd-single
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: ceph-rbd-single
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
kube-state-metrics:
prometheus:
monitor:
enabled: true
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-kenya
kubelet:
serviceMonitor:
cAdvisorRelabelings:
- action: replace
targetLabel: cluster
replacement: rciis-kenya
kubeApiServer:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-kenya
prometheus-node-exporter:
prometheus:
monitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-kenya
grafana:
defaultDashboardsTimezone: Africa/Johannesburg
env:
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
GF_DATABASE_NAME: grafana
GF_DATABASE_USER: grafana
GF_DATABASE_SSL_MODE: disable
envValueFrom:
GF_DATABASE_PASSWORD:
secretKeyRef:
name: grafana-pg-owner
key: password
extraInitContainers:
- name: wait-for-postgres
image: busybox:1.36
command:
- sh
- -c
- |
echo "Waiting for PostgreSQL to be ready..."
until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
echo "PostgreSQL not ready, waiting..."
sleep 5
done
echo "PostgreSQL is ready!"
ingress:
enabled: false
defaultRules:
additionalRuleLabels:
cluster: rciis-kenya
env: baremetal
| Setting | Value | Why |
|---|---|---|
storageClassName |
ceph-rbd-single |
Ceph RBD provides persistent storage on Bare Metal with replication |
externalLabels |
env: baremetal |
Tags all metrics as originating from Bare Metal for multi-cluster queries |
GF_DATABASE_* |
PostgreSQL | Grafana uses PostgreSQL for persistent user and dashboard storage |
On Bare Metal, Prometheus uses Ceph RBD volumes for persistent storage. PostgreSQL backs Grafana's database for data persistence. No OAuth is configured in the patch — add it separately if needed.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: prometheus
spec:
values:
prometheus:
prometheusSpec:
externalLabels:
cluster: "rciis-kenya"
env: "baremetal"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ceph-rbd-single
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: ceph-rbd-single
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
kube-state-metrics:
prometheus:
monitor:
enabled: true
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-kenya
kubelet:
serviceMonitor:
cAdvisorRelabelings:
- action: replace
targetLabel: cluster
replacement: rciis-kenya
kubeApiServer:
serviceMonitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-kenya
prometheus-node-exporter:
prometheus:
monitor:
relabelings:
- action: replace
targetLabel: cluster
replacement: rciis-kenya
grafana:
defaultDashboardsTimezone: Africa/Johannesburg
env:
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
GF_DATABASE_NAME: grafana
GF_DATABASE_USER: grafana
GF_DATABASE_SSL_MODE: disable
envValueFrom:
GF_DATABASE_PASSWORD:
secretKeyRef:
name: grafana-pg-owner
key: password
extraInitContainers:
- name: wait-for-postgres
image: busybox:1.36
command:
- sh
- -c
- |
echo "Waiting for PostgreSQL to be ready..."
until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
echo "PostgreSQL not ready, waiting..."
sleep 5
done
echo "PostgreSQL is ready!"
ingress:
enabled: false
defaultRules:
additionalRuleLabels:
cluster: rciis-kenya
env: baremetal
| Setting | Value | Why |
|---|---|---|
storageClassName |
ceph-rbd-single |
Ceph RBD provides persistent storage on Bare Metal with replication |
externalLabels |
env: baremetal |
Tags all metrics as originating from Bare Metal for multi-cluster queries |
GF_DATABASE_* |
PostgreSQL | Grafana uses PostgreSQL for persistent user and dashboard storage |
Key patch differences:
| Aspect | AWS | Bare Metal |
|---|---|---|
| Storage | gp3 (EBS) | ceph-rbd-single |
| Replicas | 1 (cost-optimized) | 2 (HA-ready) |
| Grafana Auth | Keycloak OAuth | Local admin or separate OAuth |
| Database | PostgreSQL (external) | PostgreSQL (Ceph-backed) |
| Timezone | Africa/Johannesburg | Africa/Johannesburg |
Commit and Deploy¶
Once all files are in place, commit and push to trigger Flux deployment:
Flux will detect the new commit and begin deploying Prometheus. To trigger an immediate sync instead of waiting for the next poll interval:
Verify¶
After Prometheus is deployed, confirm it is working:
# Check Prometheus pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
# Check Alertmanager
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# Check Grafana
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets to see scrape targets
# Port-forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80
# Open http://localhost:3000
Flux Operations¶
This component is managed by Flux as HelmRelease prometheus and Kustomization infra-prometheus.
Check whether the HelmRelease and Kustomization are in a Ready state:
Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:
Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:
View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:
Recovering a stalled HelmRelease
If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:
flux suspend helmrelease prometheus -n flux-system
flux resume helmrelease prometheus -n flux-system
flux reconcile kustomization infra-prometheus -n flux-system
Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.
Loki¶
Loki is the log aggregation system that receives logs from Fluent Bit, stores them in Ceph S3, and makes them queryable through Grafana. It runs in SimpleScalable deployment mode with separate read, write, and backend components for horizontal scalability.
Install¶
The base HelmRelease tells Flux which chart to install. This file is shared across all environments.
Create the base directory and file:
| Field | Value | Explanation |
|---|---|---|
chart |
loki |
The Helm chart name from the Grafana registry |
version |
6.49.0 |
Pinned chart version for Loki |
sourceRef.name |
grafana |
References a HelmRepository CR pointing to https://grafana.github.io/helm-charts |
targetNamespace |
monitoring |
Loki runs in the monitoring namespace alongside Prometheus and Grafana |
crds: CreateReplace |
— | Automatically installs Loki CRDs |
remediation.retries |
3 |
Flux retries up to 3 times if the install or upgrade fails |
Save the following as flux/infra/base/loki.yaml:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: loki
namespace: flux-system
spec:
targetNamespace: monitoring
interval: 30m
chart:
spec:
chart: loki
version: "6.49.0"
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
releaseName: loki
install:
crds: CreateReplace
remediation:
retries: 3
createNamespace: true
upgrade:
crds: CreateReplace
remediation:
retries: 3
Alternative: Helm CLI
If you do not have Git access, install Loki directly:
Configuration¶
Create the environment-specific directory:
Loki uses environment-specific values but no patches. Save the values file for your environment and deployment size:
# Loki — HA configuration
# SimpleScalable mode, replicated writes, S3 via Ceph RGW
deploymentMode: SimpleScalable
loki:
auth_enabled: false
storage:
type: s3
bucketNames:
chunks: loki-chunks
ruler: loki-ruler
admin: loki-admin
s3:
endpoint: rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
region: rciis-kenya
accessKeyId: ${AWS_ACCESS_KEY}
secretAccessKey: ${AWS_SECRET_KEY}
s3ForcePathStyle: true
insecure: true
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
limits_config:
retention_period: 30d
max_query_length: 721h
max_label_names_per_series: 30
max_streams_per_user: 50000
max_global_streams_per_user: 100000
ingestion_rate_mb: 16
ingestion_burst_size_mb: 32
per_stream_rate_limit: 3MB
per_stream_rate_limit_burst: 15MB
reject_old_samples: true
reject_old_samples_max_age: 168h
allow_structured_metadata: true
volume_enabled: true
compactor:
retention_enabled: true
delete_request_store: s3
ruler:
enable_api: true
storage_config:
type: s3
s3_storage_config:
region: rciis-kenya
bucketnames: loki-ruler
pattern_ingester:
enabled: true
commonConfig:
replication_factor: 2
backend:
replicas: 2
extraArgs:
- -config.expand-env=true
persistence:
enabled: true
storageClass: ceph-rbd-single
size: 10Gi
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 300m
memory: 512Mi
read:
replicas: 2
extraArgs:
- -config.expand-env=true
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
nodeTaintsPolicy: Honor
labelSelector:
matchLabels:
app.kubernetes.io/component: read
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 300m
memory: 512Mi
write:
replicas: 2
extraArgs:
- -config.expand-env=true
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
nodeTaintsPolicy: Honor
labelSelector:
matchLabels:
app.kubernetes.io/component: write
persistence:
enabled: true
storageClass: ceph-rbd-single
size: 10Gi
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
gateway:
replicas: 2
enabled: true
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
minio:
enabled: false
singleBinary:
replicas: 0
ingester:
replicas: 0
querier:
replicas: 0
queryFrontend:
replicas: 0
queryScheduler:
replicas: 0
distributor:
replicas: 0
compactor:
replicas: 0
indexGateway:
replicas: 0
bloomCompactor:
replicas: 0
bloomGateway:
replicas: 0
global:
extraEnvFrom:
- secretRef:
name: loki-s3
dnsService: "kube-dns"
chunksCache:
enabled: false
resultsCache:
enabled: false
lokiCanary:
enabled: false
test:
enabled: false
monitoring:
serviceMonitor:
enabled: true
labels:
release: prometheus
selfMonitoring:
enabled: false
grafanaAgent:
installOperator: false
# Loki — Non-HA configuration
# SimpleScalable mode, single replicas, no replication
deploymentMode: SimpleScalable
loki:
auth_enabled: false
storage:
type: s3
bucketNames:
chunks: loki-chunks
ruler: loki-ruler
admin: loki-admin
s3:
endpoint: rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
region: rciis-kenya
accessKeyId: ${AWS_ACCESS_KEY}
secretAccessKey: ${AWS_SECRET_KEY}
s3ForcePathStyle: true
insecure: true
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
limits_config:
retention_period: 14d
ingestion_rate_mb: 8
ingestion_burst_size_mb: 16
reject_old_samples: true
reject_old_samples_max_age: 168h
allow_structured_metadata: true
compactor:
retention_enabled: true
delete_request_store: s3
pattern_ingester:
enabled: false
commonConfig:
replication_factor: 1
backend:
replicas: 1
extraArgs:
- -config.expand-env=true
persistence:
enabled: true
storageClass: ceph-rbd-single
size: 5Gi
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
read:
replicas: 1
extraArgs:
- -config.expand-env=true
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
write:
replicas: 1
extraArgs:
- -config.expand-env=true
persistence:
enabled: true
storageClass: ceph-rbd-single
size: 5Gi
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 300m
memory: 512Mi
gateway:
replicas: 1
enabled: true
resources:
requests:
cpu: 25m
memory: 32Mi
limits:
cpu: 50m
memory: 64Mi
minio:
enabled: false
singleBinary:
replicas: 0
ingester:
replicas: 0
querier:
replicas: 0
queryFrontend:
replicas: 0
queryScheduler:
replicas: 0
distributor:
replicas: 0
compactor:
replicas: 0
indexGateway:
replicas: 0
bloomCompactor:
replicas: 0
bloomGateway:
replicas: 0
global:
extraEnvFrom:
- secretRef:
name: loki-s3
dnsService: "kube-dns"
chunksCache:
enabled: false
resultsCache:
enabled: false
lokiCanary:
enabled: false
test:
enabled: false
monitoring:
serviceMonitor:
enabled: false
selfMonitoring:
enabled: false
grafanaAgent:
installOperator: false
Commit and Deploy¶
Once all files are in place, commit and push to trigger Flux deployment:
Flux will detect the new commit and begin deploying Loki. To trigger an immediate sync:
Verify¶
After Loki is deployed, confirm it is working:
# Check Loki components
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
# Test log ingestion via Grafana
# Navigate to Grafana → Explore → Select Loki datasource
# Query: {namespace="monitoring"}
Flux Operations¶
This component is managed by Flux as HelmRelease loki and Kustomization infra-loki.
Check whether the HelmRelease and Kustomization are in a Ready state:
Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:
Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:
View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:
Recovering a stalled HelmRelease
If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:
flux suspend helmrelease loki -n flux-system
flux resume helmrelease loki -n flux-system
flux reconcile kustomization infra-loki -n flux-system
Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.
Fluent Bit¶
Fluent Bit is a lightweight log processor and forwarder deployed as a DaemonSet on every node. It collects container logs and ships them to Loki for aggregation and querying through Grafana.
Install¶
Create the base directory and file:
| Field | Value | Explanation |
|---|---|---|
chart |
fluent-bit |
The Helm chart name from the Fluent registry |
version |
0.54.1 |
Pinned chart version for Fluent Bit |
sourceRef.name |
fluent |
References a HelmRepository CR pointing to https://fluent.github.io/helm-charts |
targetNamespace |
monitoring |
Fluent Bit runs in the monitoring namespace as a DaemonSet |
remediation.retries |
3 |
Flux retries up to 3 times if the install or upgrade fails |
Save the following as flux/infra/base/fluent-bit.yaml:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: fluent-bit
namespace: flux-system
spec:
targetNamespace: monitoring
interval: 30m
chart:
spec:
chart: fluent-bit
version: "0.54.1"
sourceRef:
kind: HelmRepository
name: fluent
namespace: flux-system
releaseName: fluent-bit
install:
remediation:
retries: 3
createNamespace: true
upgrade:
remediation:
retries: 3
Alternative: Helm CLI
If you do not have Git access, install Fluent Bit directly:
Configuration¶
Create the environment-specific directory:
Fluent Bit uses environment-specific values but no patches. Save the values file for your deployment size:
# Fluent Bit — HA configuration
# DaemonSet on every node, full pipeline with ServiceMonitor
kind: DaemonSet
replicaCount: 1
image:
repository: cr.fluentbit.io/fluent/fluent-bit
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 50m
memory: 64Mi
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
podSecurityContext:
runAsNonRoot: false
containerSecurityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
add:
- DAC_READ_SEARCH
seccompProfile:
type: RuntimeDefault
serviceMonitor:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10s
additionalLabels:
release: prometheus
config:
service: |
[SERVICE]
Daemon Off
Flush 1
Log_Level info
Parsers_File /fluent-bit/etc/parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
Mem_Buf_Limit 50MB
Skip_Long_Lines On
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Read_From_Tail On
filters: |
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name modify
Match kube.*
Add cluster rciis-kenya
Add environment baremetal
outputs: |
[OUTPUT]
Name loki
Match kube.*
Host loki-gateway.monitoring.svc.cluster.local
Port 80
Labels job=fluent-bit, cluster=rciis-kenya
auto_kubernetes_labels off
label_keys $kubernetes['namespace_name'],$kubernetes['container_name'],$kubernetes['labels']['app'],$kubernetes['labels']['ceph_daemon_type'],$kubernetes['labels']['app.kubernetes.io/name']
remove_keys kubernetes,stream
line_format json
Retry_Limit 5
[OUTPUT]
Name loki
Match host.*
Host loki-gateway.monitoring.svc.cluster.local
Port 80
Labels job=fluent-bit, cluster=rciis-kenya, component=kubelet
line_format json
Retry_Limit 5
parsers: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: etcmachineid
mountPath: /etc/machine-id
readOnly: true
daemonSetVolumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: etcmachineid
hostPath:
path: /etc/machine-id
type: File
# Fluent Bit — Non-HA configuration
# DaemonSet, reduced buffer, no ServiceMonitor
kind: DaemonSet
replicaCount: 1
image:
repository: cr.fluentbit.io/fluent/fluent-bit
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 25m
memory: 32Mi
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
serviceMonitor:
enabled: false
config:
service: |
[SERVICE]
Daemon Off
Flush 5
Log_Level warn
Parsers_File /fluent-bit/etc/parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
Mem_Buf_Limit 10MB
Skip_Long_Lines On
filters: |
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
outputs: |
[OUTPUT]
Name loki
Match kube.*
Host loki-gateway.monitoring.svc.cluster.local
Port 80
Labels job=fluent-bit, cluster=rciis-kenya
auto_kubernetes_labels off
label_keys $kubernetes['namespace_name'],$kubernetes['container_name']
remove_keys kubernetes,stream
line_format json
parsers: |
[PARSER]
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
volumeMounts:
- name: varlog
mountPath: /var/log
daemonSetVolumes:
- name: varlog
hostPath:
path: /var/log
Commit and Deploy¶
Once all files are in place, commit and push to trigger Flux deployment:
Flux will detect the new commit and begin deploying Fluent Bit. To trigger an immediate sync:
Verify¶
After Fluent Bit is deployed, confirm it is working:
# Check Fluent Bit pods (one per node)
kubectl get pods -n monitoring -l app.kubernetes.io/name=fluent-bit
# Check logs are flowing
kubectl logs -n monitoring -l app.kubernetes.io/name=fluent-bit --tail=20
# Verify in Grafana → Explore → Loki
# Query: {job="fluent-bit"}
Flux Operations¶
This component is managed by Flux as HelmRelease fluent-bit and Kustomization infra-fluent-bit.
Check whether the HelmRelease and Kustomization are in a Ready state:
Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:
Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:
View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:
Recovering a stalled HelmRelease
If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:
flux suspend helmrelease fluent-bit -n flux-system
flux resume helmrelease fluent-bit -n flux-system
flux reconcile kustomization infra-fluent-bit -n flux-system
Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.
Blackbox Exporter¶
The Prometheus Blackbox Exporter probes endpoints over HTTP, TCP, and ICMP to monitor external service availability and response times. On AWS, it monitors application health endpoints (Keycloak, Kafka UI, Grafana, Nucleus Web) and Kafka broker connectivity.
Install¶
Create the base directory and file:
| Field | Value | Explanation |
|---|---|---|
chart |
prometheus-blackbox-exporter |
The Helm chart name from the Prometheus Community registry |
version |
11.8.0 |
Pinned chart version for Blackbox Exporter |
sourceRef.name |
prometheus-community |
References the Prometheus Community Helm repository |
targetNamespace |
monitoring |
Blackbox Exporter runs in the monitoring namespace |
remediation.retries |
3 |
Flux retries up to 3 times if the install or upgrade fails |
Save the following as flux/infra/base/blackbox-exporter.yaml:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: blackbox-exporter
namespace: flux-system
spec:
targetNamespace: monitoring
interval: 30m
chart:
spec:
chart: prometheus-blackbox-exporter
version: "11.8.0"
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
releaseName: blackbox-exporter
install:
remediation:
retries: 3
createNamespace: true
upgrade:
remediation:
retries: 3
values:
fullnameOverride: blackbox-exporter
replicas: 1
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
serviceMonitor:
enabled: true
defaults:
labels:
release: prometheus
interval: 30s
scrapeTimeout: 10s
config:
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
follow_redirects: true
preferred_ip_protocol: "ip4"
http_2xx_3xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200, 301, 302]
follow_redirects: false
preferred_ip_protocol: "ip4"
tcp_connect:
prober: tcp
timeout: 5s
Alternative: Helm CLI
If you do not have Git access, install Blackbox Exporter directly:
Configuration¶
Create the environment-specific directory:
Environment Patch¶
The AWS patch defines probe targets for application health checks and service connectivity. Bare Metal uses the base configuration without additional targets.
Save the following as the patch file for your environment:
On AWS, Blackbox Exporter probes application health endpoints and Kafka broker connectivity to monitor service availability.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: blackbox-exporter
spec:
values:
serviceMonitor:
enabled: true
defaults:
labels:
release: prometheus
interval: 30s
scrapeTimeout: 10s
targets:
- name: keycloak
url: http://kc-service.keycloak.svc.cluster.local:9000/health/ready
module: http_2xx
additionalMetricsRelabels:
service: keycloak
- name: kafka-ui
url: http://kafka-ui.rciis-prod.svc.cluster.local:80/actuator/health
module: http_2xx
additionalMetricsRelabels:
service: kafka-ui
- name: grafana
url: http://grafana.monitoring.svc.cluster.local:80/api/health
module: http_2xx
additionalMetricsRelabels:
service: grafana
- name: nucleus-web
url: http://web.rciis-prod.svc.cluster.local:8080/
module: http_2xx_3xx
additionalMetricsRelabels:
service: nucleus-web
- name: kafka-bootstrap
url: kafka-rciis-prod-kafka-bootstrap.rciis-prod.svc.cluster.local:9092
module: tcp_connect
additionalMetricsRelabels:
service: kafka
| Setting | Value | Why |
|---|---|---|
targets[].url |
Service endpoints | Blackbox probes each endpoint to monitor availability |
module: http_2xx |
Expects 200-299 | Standard HTTP success status codes |
module: http_2xx_3xx |
Allows redirects | Used for web UI that may redirect |
module: tcp_connect |
TCP connectivity | Checks Kafka broker port is accepting connections |
additionalMetricsRelabels |
Service name | Tags metrics with the service being probed |
On Bare Metal, use the base configuration without additional targets. Customize targets based on your application deployments.
No patch file is needed — apply only the base HelmRelease.
On Bare Metal, use the base configuration without additional targets. Customize targets based on your application deployments.
No patch file is needed — apply only the base HelmRelease.
Commit and Deploy¶
Once all files are in place, commit and push to trigger Flux deployment:
Flux will detect the new commit and begin deploying Blackbox Exporter. To trigger an immediate sync:
Verify¶
After Blackbox Exporter is deployed, confirm it is working:
# Check exporter pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-blackbox-exporter
# Test a probe manually
kubectl port-forward -n monitoring svc/blackbox-exporter 9115:9115
curl "http://localhost:9115/probe?target=https://grafana.rciis.africa&module=http_2xx"
# View probe metrics in Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090 and search for 'probe_' metrics
Flux Operations¶
This component is managed by Flux as HelmRelease blackbox-exporter and Kustomization
infra-blackbox-exporter.
Check whether the HelmRelease and Kustomization are in a Ready state:
Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:
Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:
View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:
Recovering a stalled HelmRelease
If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:
flux suspend helmrelease blackbox-exporter -n flux-system
flux resume helmrelease blackbox-exporter -n flux-system
flux reconcile kustomization infra-blackbox-exporter -n flux-system
Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.
SNMP Exporter¶
The Prometheus SNMP Exporter collects metrics from network devices (switches, routers, firewalls) via SNMP polling. It uses configurable module definitions for different device profiles.
Install¶
Create the base directory and file:
| Field | Value | Explanation |
|---|---|---|
chart |
prometheus-snmp-exporter |
The Helm chart name from the Prometheus Community registry |
version |
9.11.0 |
Pinned chart version for SNMP Exporter |
sourceRef.name |
prometheus-community |
References the Prometheus Community Helm repository |
targetNamespace |
monitoring |
SNMP Exporter runs in the monitoring namespace |
remediation.retries |
3 |
Flux retries up to 3 times if the install or upgrade fails |
Save the following as flux/infra/base/snmp-exporter.yaml:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: snmp-exporter
namespace: flux-system
spec:
targetNamespace: monitoring
interval: 30m
chart:
spec:
chart: prometheus-snmp-exporter
version: "9.11.0"
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
releaseName: snmp-exporter
install:
remediation:
retries: 3
createNamespace: true
upgrade:
remediation:
retries: 3
values:
fullnameOverride: snmp-exporter
replicas: 1
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
podSecurityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
serviceMonitor:
enabled: true
namespace: monitoring
labels:
release: prometheus
config: {}
Alternative: Helm CLI
If you do not have Git access, install SNMP Exporter directly:
Configuration¶
Create the environment-specific directory:
SNMP Exporter uses environment-specific values but no patches. The module configuration (OID mappings, community strings, device profiles) is deployed as a separate ConfigMap and should be customised for your network equipment.
# SNMP Exporter — HA configuration
# 2 replicas for continuous network device monitoring
replicas: 2
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
podSecurityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
serviceMonitor:
enabled: true
namespace: monitoring
labels:
release: prometheus
config: {}
SNMP module definitions
The SNMP exporter's module configuration (OID mappings, community strings, device
profiles) is deployed as a separate ConfigMap from the extra/ directory. This
ConfigMap is device-specific and should be customised for the network equipment
in each deployment.
Commit and Deploy¶
Once all files are in place, commit and push to trigger Flux deployment:
Flux will detect the new commit and begin deploying SNMP Exporter. To trigger an immediate sync:
Verify¶
After SNMP Exporter is deployed, confirm it is working:
# Check exporter pod
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-snmp-exporter
# Verify ServiceMonitor was created
kubectl get servicemonitor -n monitoring -l release=prometheus
Flux Operations¶
This component is managed by Flux as HelmRelease snmp-exporter and Kustomization
infra-snmp-exporter.
Check whether the HelmRelease and Kustomization are in a Ready state:
Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:
Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:
View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:
Recovering a stalled HelmRelease
If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:
flux suspend helmrelease snmp-exporter -n flux-system
flux resume helmrelease snmp-exporter -n flux-system
flux reconcile kustomization infra-snmp-exporter -n flux-system
Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.
Goldilocks¶
Goldilocks uses the Kubernetes Vertical Pod Autoscaler (VPA) in recommendation-only mode to suggest optimal resource requests and limits for workloads. It provides a dashboard showing current vs recommended resource allocations.
Install¶
Create the base directory and file:
| Field | Value | Explanation |
|---|---|---|
chart |
goldilocks |
The Helm chart name from the Fairwinds Stable registry |
version |
10.2.0 |
Pinned chart version for Goldilocks |
sourceRef.name |
fairwinds-stable |
References a HelmRepository CR pointing to https://charts.fairwinds.com/stable |
targetNamespace |
goldilocks |
Goldilocks runs in its own dedicated namespace |
crds: CreateReplace |
— | Automatically installs Goldilocks CRDs |
remediation.retries |
3 |
Flux retries up to 3 times if the install or upgrade fails |
Save the following as flux/infra/base/goldilocks.yaml:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: goldilocks
namespace: flux-system
spec:
targetNamespace: goldilocks
interval: 30m
chart:
spec:
chart: goldilocks
version: "10.2.0"
sourceRef:
kind: HelmRepository
name: fairwinds-stable
namespace: flux-system
releaseName: goldilocks
install:
crds: CreateReplace
remediation:
retries: 3
createNamespace: true
upgrade:
crds: CreateReplace
remediation:
retries: 3
values:
vpa:
enabled: true
updater:
enabled: false
dashboard:
enabled: true
replicaCount: 1
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
service:
type: ClusterIP
port: 80
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
controller:
enabled: true
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
Alternative: Helm CLI
If you do not have Git access, install Goldilocks directly:
Configuration¶
Create the environment-specific directory:
Goldilocks uses environment-specific values but no patches. Save the values file for your deployment size:
# Goldilocks — HA configuration
# VPA recommender mode, 2-replica dashboard
vpa:
enabled: true
updater:
enabled: false
dashboard:
enabled: true
replicaCount: 2
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
service:
type: ClusterIP
port: 80
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
controller:
enabled: true
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
# Goldilocks — Non-HA configuration
# Single dashboard replica
vpa:
enabled: true
updater:
enabled: false
dashboard:
enabled: true
replicaCount: 1
resources:
requests:
cpu: 25m
memory: 32Mi
limits:
cpu: 50m
memory: 64Mi
service:
type: ClusterIP
port: 80
controller:
enabled: true
resources:
requests:
cpu: 25m
memory: 32Mi
limits:
cpu: 50m
memory: 64Mi
Enabling Goldilocks per namespace
Goldilocks only monitors namespaces with the label goldilocks.fairwinds.com/enabled=true.
To enable recommendations for a namespace:
Commit and Deploy¶
Once all files are in place, commit and push to trigger Flux deployment:
Flux will detect the new commit and begin deploying Goldilocks. To trigger an immediate sync:
Verify¶
After Goldilocks is deployed, confirm it is working:
# Check Goldilocks pods
kubectl get pods -n goldilocks
# Access the dashboard (port-forward)
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80
# Open http://localhost:8080
# Verify VPA recommender is working
kubectl describe vpa -n goldilocks
Flux Operations¶
This component is managed by Flux as HelmRelease goldilocks and Kustomization
infra-goldilocks.
Check whether the HelmRelease and Kustomization are in a Ready state:
Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:
Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:
View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:
Recovering a stalled HelmRelease
If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:
flux suspend helmrelease goldilocks -n flux-system
flux resume helmrelease goldilocks -n flux-system
flux reconcile kustomization infra-goldilocks -n flux-system
Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.
Next Steps¶
With observability infrastructure in place, you're ready to deploy and configure your data services. Proceed to 5.3.2 Data Services to set up PostgreSQL, Redis, S3 storage, and their monitoring and backup strategies.