Skip to content

5.3.1 Observability

The observability stack provides metrics collection and alerting (Prometheus), log aggregation (Loki), log shipping (Fluent Bit), dashboards (Grafana), endpoint probing (Blackbox Exporter), network device monitoring (SNMP Exporter), and resource right-sizing recommendations (Goldilocks).

How to use this page

Each component has an Install section showing the Flux HelmRelease, a Configuration section with Helm values, and a Verify section to confirm it is working.

All code blocks are labelled with their file path in the repository. Select your target environment (AWS or Bare Metal) in any tab group — the choice syncs across the entire page.

  • Using the existing rciis-devops repository: All files already exist. Skip the mkdir and git add/git commit commands — they are for users building a new repository. Simply review the files, edit values for your environment, and push.
  • Building a new repository from scratch: Follow the mkdir, file creation, and git commands in order.
  • No Git access: Expand the "Alternative: Helm CLI" block under each Install section.

Prometheus (kube-prometheus-stack)

The kube-prometheus-stack Helm chart deploys the Prometheus Operator, Prometheus server, Alertmanager, Grafana, node-exporter, and kube-state-metrics as a single release. It provides the complete metrics collection, alerting, and visualization pipeline. On AWS, Grafana integrates with Keycloak for OAuth authentication. On Bare Metal, PostgreSQL backs Grafana's user database.

Install

The base HelmRelease tells Flux which chart to install. This file is shared across all environments — environment-specific settings are applied via patches (shown in the Configuration section).

Create the base directory and file:

mkdir -p flux/infra/base
Field Value Explanation
chart kube-prometheus-stack The Helm chart name from the Prometheus Community registry
version 80.14.4 Pinned chart version — update this to upgrade Prometheus and components
sourceRef.name prometheus-community References a HelmRepository CR pointing to https://prometheus-community.github.io/helm-charts
targetNamespace monitoring Prometheus, Grafana, and related components run in the monitoring namespace
crds: CreateReplace Automatically installs and updates Prometheus CRDs (PrometheusRule, ServiceMonitor, etc.)
remediation.retries 3 Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/prometheus.yaml:

flux/infra/base/prometheus.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: kube-prometheus-stack
      version: "80.14.4"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  releaseName: prometheus
  install:
    crds: CreateReplace
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
  values:
    cleanPrometheusOperatorObjectNames: true
    fullnameOverride: "prometheus"
    crds:
      enabled: true
      upgradeJob:
        enabled: true
        forceConflicts: true
    prometheusOperator:
      createCustomResource: true
      enabled: true
      tls:
        enabled: false
      admissionWebhooks:
        certManager:
          enabled: true
        enabled: true
      serviceMonitor:
        selfMonitor: true
    prometheus:
      thanosService:
        enabled: false
      prometheusSpec:
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: DoNotSchedule
            nodeTaintsPolicy: Honor
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: prometheus
        replicas: 2
        retention: 30d
        enableRemoteWriteReceiver: true
        enableFeatures:
          - remote-write-receiver
        replicaExternalLabelName: "__replica__"
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 2Gi
    alertmanager:
      alertmanagerSpec:
        replicas: 2
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: DoNotSchedule
            nodeTaintsPolicy: Honor
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: alertmanager
    kubelet:
      enabled: true
      serviceMonitor:
        cAdvisor: true
    grafana:
      fullnameOverride: "grafana"
      enabled: true
      replicas: 1
      admin:
        existingSecret: "grafana-admin-creds"
        userKey: admin-user
        passwordKey: admin-password
      initChownData:
        enabled: false
      sidecar:
        dashboards:
          enabled: true
          label: grafana_dashboard
          folderAnnotation: grafana_folder
          provider:
            allowUiUpdates: true
            foldersFromFilesStructure: true
        datasources:
          enabled: true
          defaultDatasourceEnabled: true
      persistence:
        enabled: false
      additionalDataSources:
        - name: Loki
          type: loki
          isDefault: false
          access: proxy
          url: http://loki-gateway.monitoring.svc.cluster.local:80
          editable: true
    kube-state-metrics:
      fullnameOverride: "kube-state-metrics"
    prometheus-node-exporter:
      fullnameOverride: "node-exporter"
    thanosRuler:
      enabled: false
Alternative: Helm CLI

If you do not have Git access, install Prometheus directly:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --version 80.14.4 \
  -f values.yaml

Configuration

The environment patch overrides the base HelmRelease with cluster-specific settings, including storage class, resource scaling, and (on AWS) Keycloak OAuth for Grafana.

Create the environment overlay directory:

mkdir -p flux/infra/aws/prometheus
mkdir -p flux/infra/baremetal/prometheus
mkdir -p flux/infra/baremetal/prometheus

Environment Patch

The patch file sets storage class, replica counts, external labels, and Grafana OAuth configuration. These differ fundamentally between AWS and Bare Metal.

Save the following as the patch file for your environment:

On AWS, Prometheus uses gp3 EBS volumes for persistent storage and Grafana authenticates via Keycloak OAuth. Single Prometheus and Alertmanager replica for cost optimization.

flux/infra/aws/prometheus/patch.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus
spec:
  values:
    prometheus:
      prometheusSpec:
        replicas: 1
        topologySpreadConstraints: []
        resources:
          requests:
            cpu: 50m
            memory: 256Mi
          limits:
            cpu: 1000m
            memory: 2Gi
        externalLabels:
          cluster: "rciis-aws"
          env: "aws"
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: gp3
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 50Gi
    alertmanager:
      alertmanagerSpec:
        replicas: 1
        storage:
          volumeClaimTemplate:
            spec:
              storageClassName: gp3
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi
    kube-state-metrics:
      prometheus:
        monitor:
          enabled: true
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-aws
    kubelet:
      serviceMonitor:
        cAdvisorRelabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-aws
    kubeApiServer:
      serviceMonitor:
        relabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-aws
    prometheus-node-exporter:
      prometheus:
        monitor:
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-aws
    grafana:
      defaultDashboardsTimezone: Africa/Johannesburg
      grafana.ini:
        server:
          root_url: https://grafana.rciis.africa
        auth.generic_oauth:
          enabled: true
          name: Keycloak
          allow_sign_up: true
          client_id: grafana
          scopes: openid email profile roles
          auth_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/auth
          token_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/token
          api_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/userinfo
          role_attribute_path: "contains(realm_access.roles[*], 'admin') && 'Admin' || contains(realm_access.roles[*], 'editor') && 'Editor' || 'Viewer'"
      env:
        GF_DATABASE_TYPE: postgres
        GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
        GF_DATABASE_NAME: grafana
        GF_DATABASE_USER: grafana
        GF_DATABASE_SSL_MODE: disable
      envValueFrom:
        GF_DATABASE_PASSWORD:
          secretKeyRef:
            name: grafana-pg-owner
            key: password
        GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET:
          secretKeyRef:
            name: grafana-keycloak-client-secret
            key: clientSecret
      extraInitContainers:
        - name: wait-for-postgres
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              echo "Waiting for PostgreSQL to be ready..."
              until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
                echo "PostgreSQL not ready, waiting..."
                sleep 5
              done
              echo "PostgreSQL is ready!"
      ingress:
        enabled: false
    defaultRules:
      additionalRuleLabels:
        cluster: rciis-aws
        env: aws
Setting Value Why
storageClassName gp3 AWS EBS gp3 volumes provide good price/performance for time-series data
replicas 1 Single replica saves costs on AWS. Use topology spread for HA.
grafana.ini.auth.generic_oauth Keycloak config Grafana users authenticate via Keycloak instead of local admin credentials
GF_DATABASE_* PostgreSQL Grafana data (dashboards, users) is stored in PostgreSQL instead of SQLite
root_url https://grafana.rciis.africa Sets the Grafana external URL for OAuth redirects

On Bare Metal, Prometheus uses Ceph RBD volumes for persistent storage. PostgreSQL backs Grafana's database for data persistence. No OAuth is configured in the patch — add it separately if needed.

flux/infra/baremetal/prometheus/patch.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus
spec:
  values:
    prometheus:
      prometheusSpec:
        externalLabels:
          cluster: "rciis-kenya"
          env: "baremetal"
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd-single
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 50Gi
    alertmanager:
      alertmanagerSpec:
        storage:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd-single
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi
    kube-state-metrics:
      prometheus:
        monitor:
          enabled: true
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-kenya
    kubelet:
      serviceMonitor:
        cAdvisorRelabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-kenya
    kubeApiServer:
      serviceMonitor:
        relabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-kenya
    prometheus-node-exporter:
      prometheus:
        monitor:
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-kenya
    grafana:
      defaultDashboardsTimezone: Africa/Johannesburg
      env:
        GF_DATABASE_TYPE: postgres
        GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
        GF_DATABASE_NAME: grafana
        GF_DATABASE_USER: grafana
        GF_DATABASE_SSL_MODE: disable
      envValueFrom:
        GF_DATABASE_PASSWORD:
          secretKeyRef:
            name: grafana-pg-owner
            key: password
      extraInitContainers:
        - name: wait-for-postgres
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              echo "Waiting for PostgreSQL to be ready..."
              until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
                echo "PostgreSQL not ready, waiting..."
                sleep 5
              done
              echo "PostgreSQL is ready!"
      ingress:
        enabled: false
    defaultRules:
      additionalRuleLabels:
        cluster: rciis-kenya
        env: baremetal
Setting Value Why
storageClassName ceph-rbd-single Ceph RBD provides persistent storage on Bare Metal with replication
externalLabels env: baremetal Tags all metrics as originating from Bare Metal for multi-cluster queries
GF_DATABASE_* PostgreSQL Grafana uses PostgreSQL for persistent user and dashboard storage

On Bare Metal, Prometheus uses Ceph RBD volumes for persistent storage. PostgreSQL backs Grafana's database for data persistence. No OAuth is configured in the patch — add it separately if needed.

flux/infra/baremetal/prometheus/patch.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus
spec:
  values:
    prometheus:
      prometheusSpec:
        externalLabels:
          cluster: "rciis-kenya"
          env: "baremetal"
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd-single
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 50Gi
    alertmanager:
      alertmanagerSpec:
        storage:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd-single
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi
    kube-state-metrics:
      prometheus:
        monitor:
          enabled: true
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-kenya
    kubelet:
      serviceMonitor:
        cAdvisorRelabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-kenya
    kubeApiServer:
      serviceMonitor:
        relabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-kenya
    prometheus-node-exporter:
      prometheus:
        monitor:
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-kenya
    grafana:
      defaultDashboardsTimezone: Africa/Johannesburg
      env:
        GF_DATABASE_TYPE: postgres
        GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
        GF_DATABASE_NAME: grafana
        GF_DATABASE_USER: grafana
        GF_DATABASE_SSL_MODE: disable
      envValueFrom:
        GF_DATABASE_PASSWORD:
          secretKeyRef:
            name: grafana-pg-owner
            key: password
      extraInitContainers:
        - name: wait-for-postgres
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              echo "Waiting for PostgreSQL to be ready..."
              until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
                echo "PostgreSQL not ready, waiting..."
                sleep 5
              done
              echo "PostgreSQL is ready!"
      ingress:
        enabled: false
    defaultRules:
      additionalRuleLabels:
        cluster: rciis-kenya
        env: baremetal
Setting Value Why
storageClassName ceph-rbd-single Ceph RBD provides persistent storage on Bare Metal with replication
externalLabels env: baremetal Tags all metrics as originating from Bare Metal for multi-cluster queries
GF_DATABASE_* PostgreSQL Grafana uses PostgreSQL for persistent user and dashboard storage

Key patch differences:

Aspect AWS Bare Metal
Storage gp3 (EBS) ceph-rbd-single
Replicas 1 (cost-optimized) 2 (HA-ready)
Grafana Auth Keycloak OAuth Local admin or separate OAuth
Database PostgreSQL (external) PostgreSQL (Ceph-backed)
Timezone Africa/Johannesburg Africa/Johannesburg

Commit and Deploy

Once all files are in place, commit and push to trigger Flux deployment:

git add flux/infra/base/prometheus.yaml \
        flux/infra/aws/prometheus/
git commit -m "feat(prometheus): add kube-prometheus-stack for AWS environment"
git push
git add flux/infra/base/prometheus.yaml \
        flux/infra/baremetal/prometheus/
git commit -m "feat(prometheus): add kube-prometheus-stack for bare metal environment"
git push
git add flux/infra/base/prometheus.yaml \
        flux/infra/baremetal/prometheus/
git commit -m "feat(prometheus): add kube-prometheus-stack for bare metal environment"
git push

Flux will detect the new commit and begin deploying Prometheus. To trigger an immediate sync instead of waiting for the next poll interval:

flux reconcile kustomization infra-prometheus -n flux-system --with-source

Verify

After Prometheus is deployed, confirm it is working:

# Check Prometheus pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

# Check Alertmanager
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager

# Check Grafana
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets to see scrape targets

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80
# Open http://localhost:3000

Flux Operations

This component is managed by Flux as HelmRelease prometheus and Kustomization infra-prometheus.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease prometheus -n flux-system
flux get kustomization infra-prometheus -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-prometheus -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease prometheus -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=prometheus -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease prometheus -n flux-system
flux resume helmrelease prometheus -n flux-system
flux reconcile kustomization infra-prometheus -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.


Loki

Loki is the log aggregation system that receives logs from Fluent Bit, stores them in Ceph S3, and makes them queryable through Grafana. It runs in SimpleScalable deployment mode with separate read, write, and backend components for horizontal scalability.

Install

The base HelmRelease tells Flux which chart to install. This file is shared across all environments.

Create the base directory and file:

mkdir -p flux/infra/base
Field Value Explanation
chart loki The Helm chart name from the Grafana registry
version 6.49.0 Pinned chart version for Loki
sourceRef.name grafana References a HelmRepository CR pointing to https://grafana.github.io/helm-charts
targetNamespace monitoring Loki runs in the monitoring namespace alongside Prometheus and Grafana
crds: CreateReplace Automatically installs Loki CRDs
remediation.retries 3 Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/loki.yaml:

flux/infra/base/loki.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: loki
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: loki
      version: "6.49.0"
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
  releaseName: loki
  install:
    crds: CreateReplace
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
Alternative: Helm CLI

If you do not have Git access, install Loki directly:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install loki grafana/loki \
  --namespace monitoring \
  --create-namespace \
  --version 6.49.0 \
  -f values.yaml

Configuration

Create the environment-specific directory:

mkdir -p flux/infra/aws/loki
mkdir -p flux/infra/baremetal/loki
mkdir -p flux/infra/baremetal/loki

Loki uses environment-specific values but no patches. Save the values file for your environment and deployment size:

values.yaml
# Loki — HA configuration
# SimpleScalable mode, replicated writes, S3 via Ceph RGW

deploymentMode: SimpleScalable

loki:
  auth_enabled: false

  storage:
    type: s3
    bucketNames:
      chunks: loki-chunks
      ruler: loki-ruler
      admin: loki-admin
    s3:
      endpoint: rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
      region: rciis-kenya
      accessKeyId: ${AWS_ACCESS_KEY}
      secretAccessKey: ${AWS_SECRET_KEY}
      s3ForcePathStyle: true
      insecure: true

  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

  limits_config:
    retention_period: 30d
    max_query_length: 721h
    max_label_names_per_series: 30
    max_streams_per_user: 50000
    max_global_streams_per_user: 100000
    ingestion_rate_mb: 16
    ingestion_burst_size_mb: 32
    per_stream_rate_limit: 3MB
    per_stream_rate_limit_burst: 15MB
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    allow_structured_metadata: true
    volume_enabled: true

  compactor:
    retention_enabled: true
    delete_request_store: s3

  ruler:
    enable_api: true
    storage_config:
      type: s3
      s3_storage_config:
        region: rciis-kenya
        bucketnames: loki-ruler

  pattern_ingester:
    enabled: true

  commonConfig:
    replication_factor: 2

backend:
  replicas: 2
  extraArgs:
    - -config.expand-env=true
  persistence:
    enabled: true
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 300m
      memory: 512Mi

read:
  replicas: 2
  extraArgs:
    - -config.expand-env=true
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      nodeTaintsPolicy: Honor
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: read
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 300m
      memory: 512Mi

write:
  replicas: 2
  extraArgs:
    - -config.expand-env=true
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      nodeTaintsPolicy: Honor
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: write
  persistence:
    enabled: true
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 500m
      memory: 1Gi

gateway:
  replicas: 2
  enabled: true
  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi

minio:
  enabled: false

singleBinary:
  replicas: 0
ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0

global:
  extraEnvFrom:
    - secretRef:
        name: loki-s3
  dnsService: "kube-dns"

chunksCache:
  enabled: false

resultsCache:
  enabled: false

lokiCanary:
  enabled: false

test:
  enabled: false

monitoring:
  serviceMonitor:
    enabled: true
    labels:
      release: prometheus
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
values.yaml
# Loki — Non-HA configuration
# SimpleScalable mode, single replicas, no replication

deploymentMode: SimpleScalable

loki:
  auth_enabled: false

  storage:
    type: s3
    bucketNames:
      chunks: loki-chunks
      ruler: loki-ruler
      admin: loki-admin
    s3:
      endpoint: rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
      region: rciis-kenya
      accessKeyId: ${AWS_ACCESS_KEY}
      secretAccessKey: ${AWS_SECRET_KEY}
      s3ForcePathStyle: true
      insecure: true

  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

  limits_config:
    retention_period: 14d
    ingestion_rate_mb: 8
    ingestion_burst_size_mb: 16
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    allow_structured_metadata: true

  compactor:
    retention_enabled: true
    delete_request_store: s3

  pattern_ingester:
    enabled: false

  commonConfig:
    replication_factor: 1

backend:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  persistence:
    enabled: true
    storageClass: ceph-rbd-single
    size: 5Gi
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      cpu: 200m
      memory: 256Mi

read:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      cpu: 200m
      memory: 256Mi

write:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  persistence:
    enabled: true
    storageClass: ceph-rbd-single
    size: 5Gi
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 300m
      memory: 512Mi

gateway:
  replicas: 1
  enabled: true
  resources:
    requests:
      cpu: 25m
      memory: 32Mi
    limits:
      cpu: 50m
      memory: 64Mi

minio:
  enabled: false

singleBinary:
  replicas: 0
ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0

global:
  extraEnvFrom:
    - secretRef:
        name: loki-s3
  dnsService: "kube-dns"

chunksCache:
  enabled: false

resultsCache:
  enabled: false

lokiCanary:
  enabled: false

test:
  enabled: false

monitoring:
  serviceMonitor:
    enabled: false
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false

Commit and Deploy

Once all files are in place, commit and push to trigger Flux deployment:

git add flux/infra/base/loki.yaml \
        flux/infra/aws/loki/
git commit -m "feat(loki): add Loki log aggregation for AWS"
git push
git add flux/infra/base/loki.yaml \
        flux/infra/baremetal/loki/
git commit -m "feat(loki): add Loki log aggregation for bare metal"
git push
git add flux/infra/base/loki.yaml \
        flux/infra/baremetal/loki/
git commit -m "feat(loki): add Loki log aggregation for bare metal"
git push

Flux will detect the new commit and begin deploying Loki. To trigger an immediate sync:

flux reconcile kustomization infra-loki -n flux-system --with-source

Verify

After Loki is deployed, confirm it is working:

# Check Loki components
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki

# Test log ingestion via Grafana
# Navigate to Grafana → Explore → Select Loki datasource
# Query: {namespace="monitoring"}

Flux Operations

This component is managed by Flux as HelmRelease loki and Kustomization infra-loki.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease loki -n flux-system
flux get kustomization infra-loki -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-loki -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease loki -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=loki -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease loki -n flux-system
flux resume helmrelease loki -n flux-system
flux reconcile kustomization infra-loki -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.


Fluent Bit

Fluent Bit is a lightweight log processor and forwarder deployed as a DaemonSet on every node. It collects container logs and ships them to Loki for aggregation and querying through Grafana.

Install

Create the base directory and file:

mkdir -p flux/infra/base
Field Value Explanation
chart fluent-bit The Helm chart name from the Fluent registry
version 0.54.1 Pinned chart version for Fluent Bit
sourceRef.name fluent References a HelmRepository CR pointing to https://fluent.github.io/helm-charts
targetNamespace monitoring Fluent Bit runs in the monitoring namespace as a DaemonSet
remediation.retries 3 Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/fluent-bit.yaml:

flux/infra/base/fluent-bit.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: fluent-bit
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: fluent-bit
      version: "0.54.1"
      sourceRef:
        kind: HelmRepository
        name: fluent
        namespace: flux-system
  releaseName: fluent-bit
  install:
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    remediation:
      retries: 3
Alternative: Helm CLI

If you do not have Git access, install Fluent Bit directly:

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm upgrade --install fluent-bit fluent/fluent-bit \
  --namespace monitoring \
  --create-namespace \
  --version 0.54.1 \
  -f values.yaml

Configuration

Create the environment-specific directory:

mkdir -p flux/infra/aws/fluent-bit
mkdir -p flux/infra/baremetal/fluent-bit
mkdir -p flux/infra/baremetal/fluent-bit

Fluent Bit uses environment-specific values but no patches. Save the values file for your deployment size:

values.yaml
# Fluent Bit — HA configuration
# DaemonSet on every node, full pipeline with ServiceMonitor

kind: DaemonSet

replicaCount: 1

image:
  repository: cr.fluentbit.io/fluent/fluent-bit

resources:
  limits:
    cpu: 200m
    memory: 256Mi
  requests:
    cpu: 50m
    memory: 64Mi

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule

podSecurityContext:
  runAsNonRoot: false

containerSecurityContext:
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
    add:
      - DAC_READ_SEARCH
  seccompProfile:
    type: RuntimeDefault

serviceMonitor:
  enabled: true
  namespace: monitoring
  interval: 30s
  scrapeTimeout: 10s
  additionalLabels:
    release: prometheus

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush 1
        Log_Level info
        Parsers_File /fluent-bit/etc/parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        Health_Check On

  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        Tag kube.*
        Mem_Buf_Limit 50MB
        Skip_Long_Lines On

    [INPUT]
        Name systemd
        Tag host.*
        Systemd_Filter _SYSTEMD_UNIT=kubelet.service
        Read_From_Tail On

  filters: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

    [FILTER]
        Name modify
        Match kube.*
        Add cluster rciis-kenya
        Add environment baremetal

  outputs: |
    [OUTPUT]
        Name loki
        Match kube.*
        Host loki-gateway.monitoring.svc.cluster.local
        Port 80
        Labels job=fluent-bit, cluster=rciis-kenya
        auto_kubernetes_labels off
        label_keys $kubernetes['namespace_name'],$kubernetes['container_name'],$kubernetes['labels']['app'],$kubernetes['labels']['ceph_daemon_type'],$kubernetes['labels']['app.kubernetes.io/name']
        remove_keys kubernetes,stream
        line_format json
        Retry_Limit 5

    [OUTPUT]
        Name loki
        Match host.*
        Host loki-gateway.monitoring.svc.cluster.local
        Port 80
        Labels job=fluent-bit, cluster=rciis-kenya, component=kubelet
        line_format json
        Retry_Limit 5

  parsers: |
    [PARSER]
        Name docker
        Format json
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep On

    [PARSER]
        Name cri
        Format regex
        Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

volumeMounts:
  - name: varlog
    mountPath: /var/log
  - name: varlibdockercontainers
    mountPath: /var/lib/docker/containers
    readOnly: true
  - name: etcmachineid
    mountPath: /etc/machine-id
    readOnly: true

daemonSetVolumes:
  - name: varlog
    hostPath:
      path: /var/log
  - name: varlibdockercontainers
    hostPath:
      path: /var/lib/docker/containers
  - name: etcmachineid
    hostPath:
      path: /etc/machine-id
      type: File
values.yaml
# Fluent Bit — Non-HA configuration
# DaemonSet, reduced buffer, no ServiceMonitor

kind: DaemonSet

replicaCount: 1

image:
  repository: cr.fluentbit.io/fluent/fluent-bit

resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 25m
    memory: 32Mi

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule

serviceMonitor:
  enabled: false

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush 5
        Log_Level warn
        Parsers_File /fluent-bit/etc/parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        Health_Check On

  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        Tag kube.*
        Mem_Buf_Limit 10MB
        Skip_Long_Lines On

  filters: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

  outputs: |
    [OUTPUT]
        Name loki
        Match kube.*
        Host loki-gateway.monitoring.svc.cluster.local
        Port 80
        Labels job=fluent-bit, cluster=rciis-kenya
        auto_kubernetes_labels off
        label_keys $kubernetes['namespace_name'],$kubernetes['container_name']
        remove_keys kubernetes,stream
        line_format json

  parsers: |
    [PARSER]
        Name cri
        Format regex
        Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

volumeMounts:
  - name: varlog
    mountPath: /var/log

daemonSetVolumes:
  - name: varlog
    hostPath:
      path: /var/log

Commit and Deploy

Once all files are in place, commit and push to trigger Flux deployment:

git add flux/infra/base/fluent-bit.yaml \
        flux/infra/aws/fluent-bit/
git commit -m "feat(fluent-bit): add log forwarding for AWS"
git push
git add flux/infra/base/fluent-bit.yaml \
        flux/infra/baremetal/fluent-bit/
git commit -m "feat(fluent-bit): add log forwarding for bare metal"
git push
git add flux/infra/base/fluent-bit.yaml \
        flux/infra/baremetal/fluent-bit/
git commit -m "feat(fluent-bit): add log forwarding for bare metal"
git push

Flux will detect the new commit and begin deploying Fluent Bit. To trigger an immediate sync:

flux reconcile kustomization infra-fluent-bit -n flux-system --with-source

Verify

After Fluent Bit is deployed, confirm it is working:

# Check Fluent Bit pods (one per node)
kubectl get pods -n monitoring -l app.kubernetes.io/name=fluent-bit

# Check logs are flowing
kubectl logs -n monitoring -l app.kubernetes.io/name=fluent-bit --tail=20

# Verify in Grafana → Explore → Loki
# Query: {job="fluent-bit"}

Flux Operations

This component is managed by Flux as HelmRelease fluent-bit and Kustomization infra-fluent-bit.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease fluent-bit -n flux-system
flux get kustomization infra-fluent-bit -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-fluent-bit -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease fluent-bit -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=fluent-bit -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease fluent-bit -n flux-system
flux resume helmrelease fluent-bit -n flux-system
flux reconcile kustomization infra-fluent-bit -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.


Blackbox Exporter

The Prometheus Blackbox Exporter probes endpoints over HTTP, TCP, and ICMP to monitor external service availability and response times. On AWS, it monitors application health endpoints (Keycloak, Kafka UI, Grafana, Nucleus Web) and Kafka broker connectivity.

Install

Create the base directory and file:

mkdir -p flux/infra/base
Field Value Explanation
chart prometheus-blackbox-exporter The Helm chart name from the Prometheus Community registry
version 11.8.0 Pinned chart version for Blackbox Exporter
sourceRef.name prometheus-community References the Prometheus Community Helm repository
targetNamespace monitoring Blackbox Exporter runs in the monitoring namespace
remediation.retries 3 Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/blackbox-exporter.yaml:

flux/infra/base/blackbox-exporter.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: blackbox-exporter
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: prometheus-blackbox-exporter
      version: "11.8.0"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  releaseName: blackbox-exporter
  install:
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    remediation:
      retries: 3
  values:
    fullnameOverride: blackbox-exporter
    replicas: 1
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        cpu: 100m
        memory: 128Mi
    serviceMonitor:
      enabled: true
      defaults:
        labels:
          release: prometheus
        interval: 30s
        scrapeTimeout: 10s
    config:
      modules:
        http_2xx:
          prober: http
          timeout: 5s
          http:
            valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
            follow_redirects: true
            preferred_ip_protocol: "ip4"
        http_2xx_3xx:
          prober: http
          timeout: 5s
          http:
            valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
            valid_status_codes: [200, 301, 302]
            follow_redirects: false
            preferred_ip_protocol: "ip4"
        tcp_connect:
          prober: tcp
          timeout: 5s
Alternative: Helm CLI

If you do not have Git access, install Blackbox Exporter directly:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install blackbox-exporter prometheus-community/prometheus-blackbox-exporter \
  --namespace monitoring \
  --create-namespace \
  --version 11.8.0 \
  -f values.yaml

Configuration

Create the environment-specific directory:

mkdir -p flux/infra/aws/blackbox-exporter
mkdir -p flux/infra/baremetal/blackbox-exporter
mkdir -p flux/infra/baremetal/blackbox-exporter

Environment Patch

The AWS patch defines probe targets for application health checks and service connectivity. Bare Metal uses the base configuration without additional targets.

Save the following as the patch file for your environment:

On AWS, Blackbox Exporter probes application health endpoints and Kafka broker connectivity to monitor service availability.

flux/infra/aws/blackbox-exporter/patch.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: blackbox-exporter
spec:
  values:
    serviceMonitor:
      enabled: true
      defaults:
        labels:
          release: prometheus
        interval: 30s
        scrapeTimeout: 10s
      targets:
        - name: keycloak
          url: http://kc-service.keycloak.svc.cluster.local:9000/health/ready
          module: http_2xx
          additionalMetricsRelabels:
            service: keycloak
        - name: kafka-ui
          url: http://kafka-ui.rciis-prod.svc.cluster.local:80/actuator/health
          module: http_2xx
          additionalMetricsRelabels:
            service: kafka-ui
        - name: grafana
          url: http://grafana.monitoring.svc.cluster.local:80/api/health
          module: http_2xx
          additionalMetricsRelabels:
            service: grafana
        - name: nucleus-web
          url: http://web.rciis-prod.svc.cluster.local:8080/
          module: http_2xx_3xx
          additionalMetricsRelabels:
            service: nucleus-web
        - name: kafka-bootstrap
          url: kafka-rciis-prod-kafka-bootstrap.rciis-prod.svc.cluster.local:9092
          module: tcp_connect
          additionalMetricsRelabels:
            service: kafka
Setting Value Why
targets[].url Service endpoints Blackbox probes each endpoint to monitor availability
module: http_2xx Expects 200-299 Standard HTTP success status codes
module: http_2xx_3xx Allows redirects Used for web UI that may redirect
module: tcp_connect TCP connectivity Checks Kafka broker port is accepting connections
additionalMetricsRelabels Service name Tags metrics with the service being probed

On Bare Metal, use the base configuration without additional targets. Customize targets based on your application deployments.

No patch file is needed — apply only the base HelmRelease.

On Bare Metal, use the base configuration without additional targets. Customize targets based on your application deployments.

No patch file is needed — apply only the base HelmRelease.

Commit and Deploy

Once all files are in place, commit and push to trigger Flux deployment:

git add flux/infra/base/blackbox-exporter.yaml \
        flux/infra/aws/blackbox-exporter/
git commit -m "feat(blackbox-exporter): add endpoint monitoring for AWS"
git push
git add flux/infra/base/blackbox-exporter.yaml
git commit -m "feat(blackbox-exporter): add endpoint monitoring"
git push
git add flux/infra/base/blackbox-exporter.yaml
git commit -m "feat(blackbox-exporter): add endpoint monitoring"
git push

Flux will detect the new commit and begin deploying Blackbox Exporter. To trigger an immediate sync:

flux reconcile kustomization infra-blackbox-exporter -n flux-system --with-source

Verify

After Blackbox Exporter is deployed, confirm it is working:

# Check exporter pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-blackbox-exporter

# Test a probe manually
kubectl port-forward -n monitoring svc/blackbox-exporter 9115:9115
curl "http://localhost:9115/probe?target=https://grafana.rciis.africa&module=http_2xx"

# View probe metrics in Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090 and search for 'probe_' metrics

Flux Operations

This component is managed by Flux as HelmRelease blackbox-exporter and Kustomization infra-blackbox-exporter.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease blackbox-exporter -n flux-system
flux get kustomization infra-blackbox-exporter -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-blackbox-exporter -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease blackbox-exporter -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=blackbox-exporter -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease blackbox-exporter -n flux-system
flux resume helmrelease blackbox-exporter -n flux-system
flux reconcile kustomization infra-blackbox-exporter -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.


SNMP Exporter

The Prometheus SNMP Exporter collects metrics from network devices (switches, routers, firewalls) via SNMP polling. It uses configurable module definitions for different device profiles.

Install

Create the base directory and file:

mkdir -p flux/infra/base
Field Value Explanation
chart prometheus-snmp-exporter The Helm chart name from the Prometheus Community registry
version 9.11.0 Pinned chart version for SNMP Exporter
sourceRef.name prometheus-community References the Prometheus Community Helm repository
targetNamespace monitoring SNMP Exporter runs in the monitoring namespace
remediation.retries 3 Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/snmp-exporter.yaml:

flux/infra/base/snmp-exporter.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: snmp-exporter
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: prometheus-snmp-exporter
      version: "9.11.0"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  releaseName: snmp-exporter
  install:
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    remediation:
      retries: 3
  values:
    fullnameOverride: snmp-exporter
    replicas: 1
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        cpu: 100m
        memory: 128Mi
    podSecurityContext:
      runAsNonRoot: true
      seccompProfile:
        type: RuntimeDefault
    serviceMonitor:
      enabled: true
      namespace: monitoring
      labels:
        release: prometheus
    config: {}
Alternative: Helm CLI

If you do not have Git access, install SNMP Exporter directly:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install snmp-exporter prometheus-community/prometheus-snmp-exporter \
  --namespace monitoring \
  --create-namespace \
  --version 9.11.0 \
  -f values.yaml

Configuration

Create the environment-specific directory:

mkdir -p flux/infra/aws/snmp-exporter
mkdir -p flux/infra/baremetal/snmp-exporter
mkdir -p flux/infra/baremetal/snmp-exporter

SNMP Exporter uses environment-specific values but no patches. The module configuration (OID mappings, community strings, device profiles) is deployed as a separate ConfigMap and should be customised for your network equipment.

values.yaml
# SNMP Exporter — HA configuration
# 2 replicas for continuous network device monitoring

replicas: 2

resources:
  requests:
    cpu: 50m
    memory: 64Mi
  limits:
    cpu: 100m
    memory: 128Mi

podSecurityContext:
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault

serviceMonitor:
  enabled: true
  namespace: monitoring
  labels:
    release: prometheus

config: {}
values.yaml
# SNMP Exporter — Non-HA configuration
# Single replica

replicas: 1

resources:
  requests:
    cpu: 25m
    memory: 32Mi
  limits:
    cpu: 50m
    memory: 64Mi

serviceMonitor:
  enabled: false

config: {}

SNMP module definitions

The SNMP exporter's module configuration (OID mappings, community strings, device profiles) is deployed as a separate ConfigMap from the extra/ directory. This ConfigMap is device-specific and should be customised for the network equipment in each deployment.

Commit and Deploy

Once all files are in place, commit and push to trigger Flux deployment:

git add flux/infra/base/snmp-exporter.yaml \
        flux/infra/aws/snmp-exporter/
git commit -m "feat(snmp-exporter): add network device monitoring for AWS"
git push
git add flux/infra/base/snmp-exporter.yaml \
        flux/infra/baremetal/snmp-exporter/
git commit -m "feat(snmp-exporter): add network device monitoring"
git push
git add flux/infra/base/snmp-exporter.yaml \
        flux/infra/baremetal/snmp-exporter/
git commit -m "feat(snmp-exporter): add network device monitoring"
git push

Flux will detect the new commit and begin deploying SNMP Exporter. To trigger an immediate sync:

flux reconcile kustomization infra-snmp-exporter -n flux-system --with-source

Verify

After SNMP Exporter is deployed, confirm it is working:

# Check exporter pod
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-snmp-exporter

# Verify ServiceMonitor was created
kubectl get servicemonitor -n monitoring -l release=prometheus

Flux Operations

This component is managed by Flux as HelmRelease snmp-exporter and Kustomization infra-snmp-exporter.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease snmp-exporter -n flux-system
flux get kustomization infra-snmp-exporter -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-snmp-exporter -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease snmp-exporter -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=snmp-exporter -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease snmp-exporter -n flux-system
flux resume helmrelease snmp-exporter -n flux-system
flux reconcile kustomization infra-snmp-exporter -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.


Goldilocks

Goldilocks uses the Kubernetes Vertical Pod Autoscaler (VPA) in recommendation-only mode to suggest optimal resource requests and limits for workloads. It provides a dashboard showing current vs recommended resource allocations.

Install

Create the base directory and file:

mkdir -p flux/infra/base
Field Value Explanation
chart goldilocks The Helm chart name from the Fairwinds Stable registry
version 10.2.0 Pinned chart version for Goldilocks
sourceRef.name fairwinds-stable References a HelmRepository CR pointing to https://charts.fairwinds.com/stable
targetNamespace goldilocks Goldilocks runs in its own dedicated namespace
crds: CreateReplace Automatically installs Goldilocks CRDs
remediation.retries 3 Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/goldilocks.yaml:

flux/infra/base/goldilocks.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: goldilocks
  namespace: flux-system
spec:
  targetNamespace: goldilocks
  interval: 30m
  chart:
    spec:
      chart: goldilocks
      version: "10.2.0"
      sourceRef:
        kind: HelmRepository
        name: fairwinds-stable
        namespace: flux-system
  releaseName: goldilocks
  install:
    crds: CreateReplace
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
  values:
    vpa:
      enabled: true
      updater:
        enabled: false
    dashboard:
      enabled: true
      replicaCount: 1
      resources:
        requests:
          cpu: 50m
          memory: 64Mi
        limits:
          cpu: 100m
          memory: 128Mi
      service:
        type: ClusterIP
        port: 80
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
    controller:
      enabled: true
      resources:
        requests:
          cpu: 50m
          memory: 64Mi
        limits:
          cpu: 100m
          memory: 128Mi
Alternative: Helm CLI

If you do not have Git access, install Goldilocks directly:

helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm repo update
helm upgrade --install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks \
  --create-namespace \
  --version 10.2.0 \
  -f values.yaml

Configuration

Create the environment-specific directory:

mkdir -p flux/infra/aws/goldilocks
mkdir -p flux/infra/baremetal/goldilocks
mkdir -p flux/infra/baremetal/goldilocks

Goldilocks uses environment-specific values but no patches. Save the values file for your deployment size:

values.yaml
# Goldilocks — HA configuration
# VPA recommender mode, 2-replica dashboard

vpa:
  enabled: true
  updater:
    enabled: false

dashboard:
  enabled: true
  replicaCount: 2

  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi

  service:
    type: ClusterIP
    port: 80

  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

controller:
  enabled: true
  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi
values.yaml
# Goldilocks — Non-HA configuration
# Single dashboard replica

vpa:
  enabled: true
  updater:
    enabled: false

dashboard:
  enabled: true
  replicaCount: 1

  resources:
    requests:
      cpu: 25m
      memory: 32Mi
    limits:
      cpu: 50m
      memory: 64Mi

  service:
    type: ClusterIP
    port: 80

controller:
  enabled: true
  resources:
    requests:
      cpu: 25m
      memory: 32Mi
    limits:
      cpu: 50m
      memory: 64Mi

Enabling Goldilocks per namespace

Goldilocks only monitors namespaces with the label goldilocks.fairwinds.com/enabled=true. To enable recommendations for a namespace:

kubectl label namespace <namespace> goldilocks.fairwinds.com/enabled=true

Commit and Deploy

Once all files are in place, commit and push to trigger Flux deployment:

git add flux/infra/base/goldilocks.yaml \
        flux/infra/aws/goldilocks/
git commit -m "feat(goldilocks): add resource recommendation dashboard for AWS"
git push
git add flux/infra/base/goldilocks.yaml \
        flux/infra/baremetal/goldilocks/
git commit -m "feat(goldilocks): add resource recommendation dashboard"
git push
git add flux/infra/base/goldilocks.yaml \
        flux/infra/baremetal/goldilocks/
git commit -m "feat(goldilocks): add resource recommendation dashboard"
git push

Flux will detect the new commit and begin deploying Goldilocks. To trigger an immediate sync:

flux reconcile kustomization infra-goldilocks -n flux-system --with-source

Verify

After Goldilocks is deployed, confirm it is working:

# Check Goldilocks pods
kubectl get pods -n goldilocks

# Access the dashboard (port-forward)
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80
# Open http://localhost:8080

# Verify VPA recommender is working
kubectl describe vpa -n goldilocks

Flux Operations

This component is managed by Flux as HelmRelease goldilocks and Kustomization infra-goldilocks.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease goldilocks -n flux-system
flux get kustomization infra-goldilocks -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-goldilocks -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease goldilocks -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=goldilocks -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease goldilocks -n flux-system
flux resume helmrelease goldilocks -n flux-system
flux reconcile kustomization infra-goldilocks -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.


Next Steps

With observability infrastructure in place, you're ready to deploy and configure your data services. Proceed to 5.3.2 Data Services to set up PostgreSQL, Redis, S3 storage, and their monitoring and backup strategies.