5.3.1 Observability¶

The observability stack provides metrics collection and alerting (Prometheus), log aggregation (Loki), log shipping (Fluent Bit), dashboards (Grafana), endpoint probing (Blackbox Exporter), network device monitoring (SNMP Exporter), and resource right-sizing recommendations (Goldilocks).

How to use this page

Each component has an Install section showing the Flux HelmRelease, a Configuration section with Helm values, and a Verify section to confirm it is working.

All code blocks are labelled with their file path in the repository. Select your target environment (AWS or Bare Metal) in any tab group — the choice syncs across the entire page.

Using the existing rciis-devops repository: All files already exist. Skip the mkdir and git add/git commit commands — they are for users building a new repository. Simply review the files, edit values for your environment, and push.
Building a new repository from scratch: Follow the mkdir, file creation, and git commands in order.
No Git access: Expand the "Alternative: Helm CLI" block under each Install section.

Prometheus (kube-prometheus-stack)¶

The kube-prometheus-stack Helm chart deploys the Prometheus Operator, Prometheus server, Alertmanager, Grafana, node-exporter, and kube-state-metrics as a single release. It provides the complete metrics collection, alerting, and visualization pipeline. On AWS, Grafana integrates with Keycloak for OAuth authentication. On Bare Metal, PostgreSQL backs Grafana's user database.

Install¶

The base HelmRelease tells Flux which chart to install. This file is shared across all environments — environment-specific settings are applied via patches (shown in the Configuration section).

Create the base directory and file:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`kube-prometheus-stack`	The Helm chart name from the Prometheus Community registry
`version`	`80.14.4`	Pinned chart version — update this to upgrade Prometheus and components
`sourceRef.name`	`prometheus-community`	References a `HelmRepository` CR pointing to `https://prometheus-community.github.io/helm-charts`
`targetNamespace`	`monitoring`	Prometheus, Grafana, and related components run in the monitoring namespace
`crds: CreateReplace`	—	Automatically installs and updates Prometheus CRDs (PrometheusRule, ServiceMonitor, etc.)
`remediation.retries`	`3`	Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/prometheus.yaml:

flux/infra/base/prometheus.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: kube-prometheus-stack
      version: "80.14.4"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  releaseName: prometheus
  install:
    crds: CreateReplace
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
  values:
    cleanPrometheusOperatorObjectNames: true
    fullnameOverride: "prometheus"
    crds:
      enabled: true
      upgradeJob:
        enabled: true
        forceConflicts: true
    prometheusOperator:
      createCustomResource: true
      enabled: true
      tls:
        enabled: false
      admissionWebhooks:
        certManager:
          enabled: true
        enabled: true
      serviceMonitor:
        selfMonitor: true
    prometheus:
      thanosService:
        enabled: false
      prometheusSpec:
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: DoNotSchedule
            nodeTaintsPolicy: Honor
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: prometheus
        replicas: 2
        retention: 30d
        enableRemoteWriteReceiver: true
        enableFeatures:
          - remote-write-receiver
        replicaExternalLabelName: "__replica__"
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 2Gi
    alertmanager:
      alertmanagerSpec:
        replicas: 2
        topologySpreadConstraints:
          - maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: DoNotSchedule
            nodeTaintsPolicy: Honor
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: alertmanager
    kubelet:
      enabled: true
      serviceMonitor:
        cAdvisor: true
    grafana:
      fullnameOverride: "grafana"
      enabled: true
      replicas: 1
      admin:
        existingSecret: "grafana-admin-creds"
        userKey: admin-user
        passwordKey: admin-password
      initChownData:
        enabled: false
      sidecar:
        dashboards:
          enabled: true
          label: grafana_dashboard
          folderAnnotation: grafana_folder
          provider:
            allowUiUpdates: true
            foldersFromFilesStructure: true
        datasources:
          enabled: true
          defaultDatasourceEnabled: true
      persistence:
        enabled: false
      additionalDataSources:
        - name: Loki
          type: loki
          isDefault: false
          access: proxy
          url: http://loki-gateway.monitoring.svc.cluster.local:80
          editable: true
    kube-state-metrics:
      fullnameOverride: "kube-state-metrics"
    prometheus-node-exporter:
      fullnameOverride: "node-exporter"
    thanosRuler:
      enabled: false

Alternative: Helm CLI

If you do not have Git access, install Prometheus directly:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --version 80.14.4 \
  -f values.yaml

Configuration¶

The environment patch overrides the base HelmRelease with cluster-specific settings, including storage class, resource scaling, and (on AWS) Keycloak OAuth for Grafana.

Create the environment overlay directory:

AWSBare MetalProxmox VMs

mkdir -p flux/infra/aws/prometheus

mkdir -p flux/infra/baremetal/prometheus

mkdir -p flux/infra/baremetal/prometheus

Environment Patch¶

The patch file sets storage class, replica counts, external labels, and Grafana OAuth configuration. These differ fundamentally between AWS and Bare Metal.

Save the following as the patch file for your environment:

AWSBare MetalProxmox VMs

On AWS, Prometheus uses gp3 EBS volumes for persistent storage and Grafana authenticates via Keycloak OAuth. Single Prometheus and Alertmanager replica for cost optimization.

flux/infra/aws/prometheus/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus
spec:
  values:
    prometheus:
      prometheusSpec:
        replicas: 1
        topologySpreadConstraints: []
        resources:
          requests:
            cpu: 50m
            memory: 256Mi
          limits:
            cpu: 1000m
            memory: 2Gi
        externalLabels:
          cluster: "rciis-aws"
          env: "aws"
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: gp3
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 50Gi
    alertmanager:
      alertmanagerSpec:
        replicas: 1
        storage:
          volumeClaimTemplate:
            spec:
              storageClassName: gp3
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi
    kube-state-metrics:
      prometheus:
        monitor:
          enabled: true
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-aws
    kubelet:
      serviceMonitor:
        cAdvisorRelabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-aws
    kubeApiServer:
      serviceMonitor:
        relabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-aws
    prometheus-node-exporter:
      prometheus:
        monitor:
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-aws
    grafana:
      defaultDashboardsTimezone: Africa/Johannesburg
      grafana.ini:
        server:
          root_url: https://grafana.rciis.africa
        auth.generic_oauth:
          enabled: true
          name: Keycloak
          allow_sign_up: true
          client_id: grafana
          scopes: openid email profile roles
          auth_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/auth
          token_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/token
          api_url: https://auth.rciis.africa/realms/rciis/protocol/openid-connect/userinfo
          role_attribute_path: "contains(realm_access.roles[*], 'admin') && 'Admin' || contains(realm_access.roles[*], 'editor') && 'Editor' || 'Viewer'"
      env:
        GF_DATABASE_TYPE: postgres
        GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
        GF_DATABASE_NAME: grafana
        GF_DATABASE_USER: grafana
        GF_DATABASE_SSL_MODE: disable
      envValueFrom:
        GF_DATABASE_PASSWORD:
          secretKeyRef:
            name: grafana-pg-owner
            key: password
        GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET:
          secretKeyRef:
            name: grafana-keycloak-client-secret
            key: clientSecret
      extraInitContainers:
        - name: wait-for-postgres
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              echo "Waiting for PostgreSQL to be ready..."
              until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
                echo "PostgreSQL not ready, waiting..."
                sleep 5
              done
              echo "PostgreSQL is ready!"
      ingress:
        enabled: false
    defaultRules:
      additionalRuleLabels:
        cluster: rciis-aws
        env: aws

Setting	Value	Why
`storageClassName`	`gp3`	AWS EBS gp3 volumes provide good price/performance for time-series data
`replicas`	`1`	Single replica saves costs on AWS. Use topology spread for HA.
`grafana.ini.auth.generic_oauth`	Keycloak config	Grafana users authenticate via Keycloak instead of local admin credentials
`GF_DATABASE_*`	PostgreSQL	Grafana data (dashboards, users) is stored in PostgreSQL instead of SQLite
`root_url`	`https://grafana.rciis.africa`	Sets the Grafana external URL for OAuth redirects

On Bare Metal, Prometheus uses Ceph RBD volumes for persistent storage. PostgreSQL backs Grafana's database for data persistence. No OAuth is configured in the patch — add it separately if needed.

flux/infra/baremetal/prometheus/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus
spec:
  values:
    prometheus:
      prometheusSpec:
        externalLabels:
          cluster: "rciis-kenya"
          env: "baremetal"
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd-single
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 50Gi
    alertmanager:
      alertmanagerSpec:
        storage:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd-single
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi
    kube-state-metrics:
      prometheus:
        monitor:
          enabled: true
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-kenya
    kubelet:
      serviceMonitor:
        cAdvisorRelabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-kenya
    kubeApiServer:
      serviceMonitor:
        relabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-kenya
    prometheus-node-exporter:
      prometheus:
        monitor:
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-kenya
    grafana:
      defaultDashboardsTimezone: Africa/Johannesburg
      env:
        GF_DATABASE_TYPE: postgres
        GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
        GF_DATABASE_NAME: grafana
        GF_DATABASE_USER: grafana
        GF_DATABASE_SSL_MODE: disable
      envValueFrom:
        GF_DATABASE_PASSWORD:
          secretKeyRef:
            name: grafana-pg-owner
            key: password
      extraInitContainers:
        - name: wait-for-postgres
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              echo "Waiting for PostgreSQL to be ready..."
              until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
                echo "PostgreSQL not ready, waiting..."
                sleep 5
              done
              echo "PostgreSQL is ready!"
      ingress:
        enabled: false
    defaultRules:
      additionalRuleLabels:
        cluster: rciis-kenya
        env: baremetal

Setting	Value	Why
`storageClassName`	`ceph-rbd-single`	Ceph RBD provides persistent storage on Bare Metal with replication
`externalLabels`	`env: baremetal`	Tags all metrics as originating from Bare Metal for multi-cluster queries
`GF_DATABASE_*`	PostgreSQL	Grafana uses PostgreSQL for persistent user and dashboard storage

On Bare Metal, Prometheus uses Ceph RBD volumes for persistent storage. PostgreSQL backs Grafana's database for data persistence. No OAuth is configured in the patch — add it separately if needed.

flux/infra/baremetal/prometheus/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: prometheus
spec:
  values:
    prometheus:
      prometheusSpec:
        externalLabels:
          cluster: "rciis-kenya"
          env: "baremetal"
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd-single
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 50Gi
    alertmanager:
      alertmanagerSpec:
        storage:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd-single
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi
    kube-state-metrics:
      prometheus:
        monitor:
          enabled: true
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-kenya
    kubelet:
      serviceMonitor:
        cAdvisorRelabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-kenya
    kubeApiServer:
      serviceMonitor:
        relabelings:
          - action: replace
            targetLabel: cluster
            replacement: rciis-kenya
    prometheus-node-exporter:
      prometheus:
        monitor:
          relabelings:
            - action: replace
              targetLabel: cluster
              replacement: rciis-kenya
    grafana:
      defaultDashboardsTimezone: Africa/Johannesburg
      env:
        GF_DATABASE_TYPE: postgres
        GF_DATABASE_HOST: grafana-postgres-rw.monitoring.svc.cluster.local:5432
        GF_DATABASE_NAME: grafana
        GF_DATABASE_USER: grafana
        GF_DATABASE_SSL_MODE: disable
      envValueFrom:
        GF_DATABASE_PASSWORD:
          secretKeyRef:
            name: grafana-pg-owner
            key: password
      extraInitContainers:
        - name: wait-for-postgres
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              echo "Waiting for PostgreSQL to be ready..."
              until nc -z grafana-postgres-rw.monitoring.svc.cluster.local 5432; do
                echo "PostgreSQL not ready, waiting..."
                sleep 5
              done
              echo "PostgreSQL is ready!"
      ingress:
        enabled: false
    defaultRules:
      additionalRuleLabels:
        cluster: rciis-kenya
        env: baremetal

Setting	Value	Why
`storageClassName`	`ceph-rbd-single`	Ceph RBD provides persistent storage on Bare Metal with replication
`externalLabels`	`env: baremetal`	Tags all metrics as originating from Bare Metal for multi-cluster queries
`GF_DATABASE_*`	PostgreSQL	Grafana uses PostgreSQL for persistent user and dashboard storage

Key patch differences:

Aspect	AWS	Bare Metal
Storage	gp3 (EBS)	ceph-rbd-single
Replicas	1 (cost-optimized)	2 (HA-ready)
Grafana Auth	Keycloak OAuth	Local admin or separate OAuth
Database	PostgreSQL (external)	PostgreSQL (Ceph-backed)
Timezone	Africa/Johannesburg	Africa/Johannesburg

Commit and Deploy¶

Once all files are in place, commit and push to trigger Flux deployment:

AWSBare MetalProxmox VMs

git add flux/infra/base/prometheus.yaml \
        flux/infra/aws/prometheus/
git commit -m "feat(prometheus): add kube-prometheus-stack for AWS environment"
git push

git add flux/infra/base/prometheus.yaml \
        flux/infra/baremetal/prometheus/
git commit -m "feat(prometheus): add kube-prometheus-stack for bare metal environment"
git push

git add flux/infra/base/prometheus.yaml \
        flux/infra/baremetal/prometheus/
git commit -m "feat(prometheus): add kube-prometheus-stack for bare metal environment"
git push

Flux will detect the new commit and begin deploying Prometheus. To trigger an immediate sync instead of waiting for the next poll interval:

flux reconcile kustomization infra-prometheus -n flux-system --with-source

Verify¶

After Prometheus is deployed, confirm it is working:

# Check Prometheus pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

# Check Alertmanager
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager

# Check Grafana
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets to see scrape targets

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/grafana 3000:80
# Open http://localhost:3000

Flux Operations¶

This component is managed by Flux as HelmRelease prometheus and Kustomization infra-prometheus.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease prometheus -n flux-system

flux get kustomization infra-prometheus -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-prometheus -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease prometheus -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=prometheus -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease prometheus -n flux-system
flux resume helmrelease prometheus -n flux-system
flux reconcile kustomization infra-prometheus -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.

Loki¶

Loki is the log aggregation system that receives logs from Fluent Bit, stores them in Ceph S3, and makes them queryable through Grafana. It runs in SimpleScalable deployment mode with separate read, write, and backend components for horizontal scalability.

Install¶

The base HelmRelease tells Flux which chart to install. This file is shared across all environments.

Create the base directory and file:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`loki`	The Helm chart name from the Grafana registry
`version`	`6.49.0`	Pinned chart version for Loki
`sourceRef.name`	`grafana`	References a `HelmRepository` CR pointing to `https://grafana.github.io/helm-charts`
`targetNamespace`	`monitoring`	Loki runs in the monitoring namespace alongside Prometheus and Grafana
`crds: CreateReplace`	—	Automatically installs Loki CRDs
`remediation.retries`	`3`	Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/loki.yaml:

flux/infra/base/loki.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: loki
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: loki
      version: "6.49.0"
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
  releaseName: loki
  install:
    crds: CreateReplace
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3

Alternative: Helm CLI

If you do not have Git access, install Loki directly:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install loki grafana/loki \
  --namespace monitoring \
  --create-namespace \
  --version 6.49.0 \
  -f values.yaml

Configuration¶

Create the environment-specific directory:

AWSBare MetalProxmox VMs

mkdir -p flux/infra/aws/loki

mkdir -p flux/infra/baremetal/loki

mkdir -p flux/infra/baremetal/loki

Loki uses environment-specific values but no patches. Save the values file for your environment and deployment size:

HANon-HA

values.yaml

# Loki — HA configuration
# SimpleScalable mode, replicated writes, S3 via Ceph RGW

deploymentMode: SimpleScalable

loki:
  auth_enabled: false

  storage:
    type: s3
    bucketNames:
      chunks: loki-chunks
      ruler: loki-ruler
      admin: loki-admin
    s3:
      endpoint: rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
      region: rciis-kenya
      accessKeyId: ${AWS_ACCESS_KEY}
      secretAccessKey: ${AWS_SECRET_KEY}
      s3ForcePathStyle: true
      insecure: true

  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

  limits_config:
    retention_period: 30d
    max_query_length: 721h
    max_label_names_per_series: 30
    max_streams_per_user: 50000
    max_global_streams_per_user: 100000
    ingestion_rate_mb: 16
    ingestion_burst_size_mb: 32
    per_stream_rate_limit: 3MB
    per_stream_rate_limit_burst: 15MB
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    allow_structured_metadata: true
    volume_enabled: true

  compactor:
    retention_enabled: true
    delete_request_store: s3

  ruler:
    enable_api: true
    storage_config:
      type: s3
      s3_storage_config:
        region: rciis-kenya
        bucketnames: loki-ruler

  pattern_ingester:
    enabled: true

  commonConfig:
    replication_factor: 2

backend:
  replicas: 2
  extraArgs:
    - -config.expand-env=true
  persistence:
    enabled: true
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 300m
      memory: 512Mi

read:
  replicas: 2
  extraArgs:
    - -config.expand-env=true
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      nodeTaintsPolicy: Honor
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: read
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 300m
      memory: 512Mi

write:
  replicas: 2
  extraArgs:
    - -config.expand-env=true
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      nodeTaintsPolicy: Honor
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: write
  persistence:
    enabled: true
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 500m
      memory: 1Gi

gateway:
  replicas: 2
  enabled: true
  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi

minio:
  enabled: false

singleBinary:
  replicas: 0
ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0

global:
  extraEnvFrom:
    - secretRef:
        name: loki-s3
  dnsService: "kube-dns"

chunksCache:
  enabled: false

resultsCache:
  enabled: false

lokiCanary:
  enabled: false

test:
  enabled: false

monitoring:
  serviceMonitor:
    enabled: true
    labels:
      release: prometheus
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false

values.yaml

# Loki — Non-HA configuration
# SimpleScalable mode, single replicas, no replication

deploymentMode: SimpleScalable

loki:
  auth_enabled: false

  storage:
    type: s3
    bucketNames:
      chunks: loki-chunks
      ruler: loki-ruler
      admin: loki-admin
    s3:
      endpoint: rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
      region: rciis-kenya
      accessKeyId: ${AWS_ACCESS_KEY}
      secretAccessKey: ${AWS_SECRET_KEY}
      s3ForcePathStyle: true
      insecure: true

  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

  limits_config:
    retention_period: 14d
    ingestion_rate_mb: 8
    ingestion_burst_size_mb: 16
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    allow_structured_metadata: true

  compactor:
    retention_enabled: true
    delete_request_store: s3

  pattern_ingester:
    enabled: false

  commonConfig:
    replication_factor: 1

backend:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  persistence:
    enabled: true
    storageClass: ceph-rbd-single
    size: 5Gi
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      cpu: 200m
      memory: 256Mi

read:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      cpu: 200m
      memory: 256Mi

write:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  persistence:
    enabled: true
    storageClass: ceph-rbd-single
    size: 5Gi
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 300m
      memory: 512Mi

gateway:
  replicas: 1
  enabled: true
  resources:
    requests:
      cpu: 25m
      memory: 32Mi
    limits:
      cpu: 50m
      memory: 64Mi

minio:
  enabled: false

singleBinary:
  replicas: 0
ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0

global:
  extraEnvFrom:
    - secretRef:
        name: loki-s3
  dnsService: "kube-dns"

chunksCache:
  enabled: false

resultsCache:
  enabled: false

lokiCanary:
  enabled: false

test:
  enabled: false

monitoring:
  serviceMonitor:
    enabled: false
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false

Commit and Deploy¶

Once all files are in place, commit and push to trigger Flux deployment:

AWSBare MetalProxmox VMs

git add flux/infra/base/loki.yaml \
        flux/infra/aws/loki/
git commit -m "feat(loki): add Loki log aggregation for AWS"
git push

git add flux/infra/base/loki.yaml \
        flux/infra/baremetal/loki/
git commit -m "feat(loki): add Loki log aggregation for bare metal"
git push

git add flux/infra/base/loki.yaml \
        flux/infra/baremetal/loki/
git commit -m "feat(loki): add Loki log aggregation for bare metal"
git push

Flux will detect the new commit and begin deploying Loki. To trigger an immediate sync:

flux reconcile kustomization infra-loki -n flux-system --with-source

Verify¶

After Loki is deployed, confirm it is working:

# Check Loki components
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki

# Test log ingestion via Grafana
# Navigate to Grafana → Explore → Select Loki datasource
# Query: {namespace="monitoring"}

Flux Operations¶

This component is managed by Flux as HelmRelease loki and Kustomization infra-loki.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease loki -n flux-system

flux get kustomization infra-loki -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-loki -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease loki -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=loki -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease loki -n flux-system
flux resume helmrelease loki -n flux-system
flux reconcile kustomization infra-loki -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.

Fluent Bit¶

Fluent Bit is a lightweight log processor and forwarder deployed as a DaemonSet on every node. It collects container logs and ships them to Loki for aggregation and querying through Grafana.

Install¶

Create the base directory and file:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`fluent-bit`	The Helm chart name from the Fluent registry
`version`	`0.54.1`	Pinned chart version for Fluent Bit
`sourceRef.name`	`fluent`	References a `HelmRepository` CR pointing to `https://fluent.github.io/helm-charts`
`targetNamespace`	`monitoring`	Fluent Bit runs in the monitoring namespace as a DaemonSet
`remediation.retries`	`3`	Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/fluent-bit.yaml:

flux/infra/base/fluent-bit.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: fluent-bit
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: fluent-bit
      version: "0.54.1"
      sourceRef:
        kind: HelmRepository
        name: fluent
        namespace: flux-system
  releaseName: fluent-bit
  install:
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    remediation:
      retries: 3

Alternative: Helm CLI

If you do not have Git access, install Fluent Bit directly:

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm upgrade --install fluent-bit fluent/fluent-bit \
  --namespace monitoring \
  --create-namespace \
  --version 0.54.1 \
  -f values.yaml

Configuration¶

Create the environment-specific directory:

AWSBare MetalProxmox VMs

mkdir -p flux/infra/aws/fluent-bit

mkdir -p flux/infra/baremetal/fluent-bit

mkdir -p flux/infra/baremetal/fluent-bit

Fluent Bit uses environment-specific values but no patches. Save the values file for your deployment size:

HANon-HA

values.yaml

# Fluent Bit — HA configuration
# DaemonSet on every node, full pipeline with ServiceMonitor

kind: DaemonSet

replicaCount: 1

image:
  repository: cr.fluentbit.io/fluent/fluent-bit

resources:
  limits:
    cpu: 200m
    memory: 256Mi
  requests:
    cpu: 50m
    memory: 64Mi

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule

podSecurityContext:
  runAsNonRoot: false

containerSecurityContext:
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
    add:
      - DAC_READ_SEARCH
  seccompProfile:
    type: RuntimeDefault

serviceMonitor:
  enabled: true
  namespace: monitoring
  interval: 30s
  scrapeTimeout: 10s
  additionalLabels:
    release: prometheus

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush 1
        Log_Level info
        Parsers_File /fluent-bit/etc/parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        Health_Check On

  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        Tag kube.*
        Mem_Buf_Limit 50MB
        Skip_Long_Lines On

    [INPUT]
        Name systemd
        Tag host.*
        Systemd_Filter _SYSTEMD_UNIT=kubelet.service
        Read_From_Tail On

  filters: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

    [FILTER]
        Name modify
        Match kube.*
        Add cluster rciis-kenya
        Add environment baremetal

  outputs: |
    [OUTPUT]
        Name loki
        Match kube.*
        Host loki-gateway.monitoring.svc.cluster.local
        Port 80
        Labels job=fluent-bit, cluster=rciis-kenya
        auto_kubernetes_labels off
        label_keys $kubernetes['namespace_name'],$kubernetes['container_name'],$kubernetes['labels']['app'],$kubernetes['labels']['ceph_daemon_type'],$kubernetes['labels']['app.kubernetes.io/name']
        remove_keys kubernetes,stream
        line_format json
        Retry_Limit 5

    [OUTPUT]
        Name loki
        Match host.*
        Host loki-gateway.monitoring.svc.cluster.local
        Port 80
        Labels job=fluent-bit, cluster=rciis-kenya, component=kubelet
        line_format json
        Retry_Limit 5

  parsers: |
    [PARSER]
        Name docker
        Format json
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep On

    [PARSER]
        Name cri
        Format regex
        Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

volumeMounts:
  - name: varlog
    mountPath: /var/log
  - name: varlibdockercontainers
    mountPath: /var/lib/docker/containers
    readOnly: true
  - name: etcmachineid
    mountPath: /etc/machine-id
    readOnly: true

daemonSetVolumes:
  - name: varlog
    hostPath:
      path: /var/log
  - name: varlibdockercontainers
    hostPath:
      path: /var/lib/docker/containers
  - name: etcmachineid
    hostPath:
      path: /etc/machine-id
      type: File

values.yaml

# Fluent Bit — Non-HA configuration
# DaemonSet, reduced buffer, no ServiceMonitor

kind: DaemonSet

replicaCount: 1

image:
  repository: cr.fluentbit.io/fluent/fluent-bit

resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 25m
    memory: 32Mi

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule

serviceMonitor:
  enabled: false

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush 5
        Log_Level warn
        Parsers_File /fluent-bit/etc/parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        Health_Check On

  inputs: |
    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        Tag kube.*
        Mem_Buf_Limit 10MB
        Skip_Long_Lines On

  filters: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

  outputs: |
    [OUTPUT]
        Name loki
        Match kube.*
        Host loki-gateway.monitoring.svc.cluster.local
        Port 80
        Labels job=fluent-bit, cluster=rciis-kenya
        auto_kubernetes_labels off
        label_keys $kubernetes['namespace_name'],$kubernetes['container_name']
        remove_keys kubernetes,stream
        line_format json

  parsers: |
    [PARSER]
        Name cri
        Format regex
        Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

volumeMounts:
  - name: varlog
    mountPath: /var/log

daemonSetVolumes:
  - name: varlog
    hostPath:
      path: /var/log

Commit and Deploy¶

Once all files are in place, commit and push to trigger Flux deployment:

AWSBare MetalProxmox VMs

git add flux/infra/base/fluent-bit.yaml \
        flux/infra/aws/fluent-bit/
git commit -m "feat(fluent-bit): add log forwarding for AWS"
git push

git add flux/infra/base/fluent-bit.yaml \
        flux/infra/baremetal/fluent-bit/
git commit -m "feat(fluent-bit): add log forwarding for bare metal"
git push

git add flux/infra/base/fluent-bit.yaml \
        flux/infra/baremetal/fluent-bit/
git commit -m "feat(fluent-bit): add log forwarding for bare metal"
git push

Flux will detect the new commit and begin deploying Fluent Bit. To trigger an immediate sync:

flux reconcile kustomization infra-fluent-bit -n flux-system --with-source

Verify¶

After Fluent Bit is deployed, confirm it is working:

# Check Fluent Bit pods (one per node)
kubectl get pods -n monitoring -l app.kubernetes.io/name=fluent-bit

# Check logs are flowing
kubectl logs -n monitoring -l app.kubernetes.io/name=fluent-bit --tail=20

# Verify in Grafana → Explore → Loki
# Query: {job="fluent-bit"}

Flux Operations¶

This component is managed by Flux as HelmRelease fluent-bit and Kustomization infra-fluent-bit.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease fluent-bit -n flux-system

flux get kustomization infra-fluent-bit -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-fluent-bit -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease fluent-bit -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=fluent-bit -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease fluent-bit -n flux-system
flux resume helmrelease fluent-bit -n flux-system
flux reconcile kustomization infra-fluent-bit -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.

Blackbox Exporter¶

The Prometheus Blackbox Exporter probes endpoints over HTTP, TCP, and ICMP to monitor external service availability and response times. On AWS, it monitors application health endpoints (Keycloak, Kafka UI, Grafana, Nucleus Web) and Kafka broker connectivity.

Install¶

Create the base directory and file:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`prometheus-blackbox-exporter`	The Helm chart name from the Prometheus Community registry
`version`	`11.8.0`	Pinned chart version for Blackbox Exporter
`sourceRef.name`	`prometheus-community`	References the Prometheus Community Helm repository
`targetNamespace`	`monitoring`	Blackbox Exporter runs in the monitoring namespace
`remediation.retries`	`3`	Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/blackbox-exporter.yaml:

flux/infra/base/blackbox-exporter.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: blackbox-exporter
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: prometheus-blackbox-exporter
      version: "11.8.0"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  releaseName: blackbox-exporter
  install:
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    remediation:
      retries: 3
  values:
    fullnameOverride: blackbox-exporter
    replicas: 1
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        cpu: 100m
        memory: 128Mi
    serviceMonitor:
      enabled: true
      defaults:
        labels:
          release: prometheus
        interval: 30s
        scrapeTimeout: 10s
    config:
      modules:
        http_2xx:
          prober: http
          timeout: 5s
          http:
            valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
            follow_redirects: true
            preferred_ip_protocol: "ip4"
        http_2xx_3xx:
          prober: http
          timeout: 5s
          http:
            valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
            valid_status_codes: [200, 301, 302]
            follow_redirects: false
            preferred_ip_protocol: "ip4"
        tcp_connect:
          prober: tcp
          timeout: 5s

Alternative: Helm CLI

If you do not have Git access, install Blackbox Exporter directly:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install blackbox-exporter prometheus-community/prometheus-blackbox-exporter \
  --namespace monitoring \
  --create-namespace \
  --version 11.8.0 \
  -f values.yaml

Configuration¶

Create the environment-specific directory:

AWSBare MetalProxmox VMs

mkdir -p flux/infra/aws/blackbox-exporter

mkdir -p flux/infra/baremetal/blackbox-exporter

mkdir -p flux/infra/baremetal/blackbox-exporter

Environment Patch¶

The AWS patch defines probe targets for application health checks and service connectivity. Bare Metal uses the base configuration without additional targets.

Save the following as the patch file for your environment:

AWSBare MetalProxmox VMs

On AWS, Blackbox Exporter probes application health endpoints and Kafka broker connectivity to monitor service availability.

flux/infra/aws/blackbox-exporter/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: blackbox-exporter
spec:
  values:
    serviceMonitor:
      enabled: true
      defaults:
        labels:
          release: prometheus
        interval: 30s
        scrapeTimeout: 10s
      targets:
        - name: keycloak
          url: http://kc-service.keycloak.svc.cluster.local:9000/health/ready
          module: http_2xx
          additionalMetricsRelabels:
            service: keycloak
        - name: kafka-ui
          url: http://kafka-ui.rciis-prod.svc.cluster.local:80/actuator/health
          module: http_2xx
          additionalMetricsRelabels:
            service: kafka-ui
        - name: grafana
          url: http://grafana.monitoring.svc.cluster.local:80/api/health
          module: http_2xx
          additionalMetricsRelabels:
            service: grafana
        - name: nucleus-web
          url: http://web.rciis-prod.svc.cluster.local:8080/
          module: http_2xx_3xx
          additionalMetricsRelabels:
            service: nucleus-web
        - name: kafka-bootstrap
          url: kafka-rciis-prod-kafka-bootstrap.rciis-prod.svc.cluster.local:9092
          module: tcp_connect
          additionalMetricsRelabels:
            service: kafka

Setting	Value	Why
`targets[].url`	Service endpoints	Blackbox probes each endpoint to monitor availability
`module: http_2xx`	Expects 200-299	Standard HTTP success status codes
`module: http_2xx_3xx`	Allows redirects	Used for web UI that may redirect
`module: tcp_connect`	TCP connectivity	Checks Kafka broker port is accepting connections
`additionalMetricsRelabels`	Service name	Tags metrics with the service being probed

On Bare Metal, use the base configuration without additional targets. Customize targets based on your application deployments.

No patch file is needed — apply only the base HelmRelease.

On Bare Metal, use the base configuration without additional targets. Customize targets based on your application deployments.

No patch file is needed — apply only the base HelmRelease.

Commit and Deploy¶

Once all files are in place, commit and push to trigger Flux deployment:

AWSBare MetalProxmox VMs

git add flux/infra/base/blackbox-exporter.yaml \
        flux/infra/aws/blackbox-exporter/
git commit -m "feat(blackbox-exporter): add endpoint monitoring for AWS"
git push

git add flux/infra/base/blackbox-exporter.yaml
git commit -m "feat(blackbox-exporter): add endpoint monitoring"
git push

git add flux/infra/base/blackbox-exporter.yaml
git commit -m "feat(blackbox-exporter): add endpoint monitoring"
git push

Flux will detect the new commit and begin deploying Blackbox Exporter. To trigger an immediate sync:

flux reconcile kustomization infra-blackbox-exporter -n flux-system --with-source

Verify¶

After Blackbox Exporter is deployed, confirm it is working:

# Check exporter pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-blackbox-exporter

# Test a probe manually
kubectl port-forward -n monitoring svc/blackbox-exporter 9115:9115
curl "http://localhost:9115/probe?target=https://grafana.rciis.africa&module=http_2xx"

# View probe metrics in Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090 and search for 'probe_' metrics

Flux Operations¶

This component is managed by Flux as HelmRelease blackbox-exporter and Kustomization infra-blackbox-exporter.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease blackbox-exporter -n flux-system

flux get kustomization infra-blackbox-exporter -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-blackbox-exporter -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease blackbox-exporter -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=blackbox-exporter -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease blackbox-exporter -n flux-system
flux resume helmrelease blackbox-exporter -n flux-system
flux reconcile kustomization infra-blackbox-exporter -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.

SNMP Exporter¶

The Prometheus SNMP Exporter collects metrics from network devices (switches, routers, firewalls) via SNMP polling. It uses configurable module definitions for different device profiles.

Install¶

Create the base directory and file:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`prometheus-snmp-exporter`	The Helm chart name from the Prometheus Community registry
`version`	`9.11.0`	Pinned chart version for SNMP Exporter
`sourceRef.name`	`prometheus-community`	References the Prometheus Community Helm repository
`targetNamespace`	`monitoring`	SNMP Exporter runs in the monitoring namespace
`remediation.retries`	`3`	Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/snmp-exporter.yaml:

flux/infra/base/snmp-exporter.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: snmp-exporter
  namespace: flux-system
spec:
  targetNamespace: monitoring
  interval: 30m
  chart:
    spec:
      chart: prometheus-snmp-exporter
      version: "9.11.0"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  releaseName: snmp-exporter
  install:
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    remediation:
      retries: 3
  values:
    fullnameOverride: snmp-exporter
    replicas: 1
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        cpu: 100m
        memory: 128Mi
    podSecurityContext:
      runAsNonRoot: true
      seccompProfile:
        type: RuntimeDefault
    serviceMonitor:
      enabled: true
      namespace: monitoring
      labels:
        release: prometheus
    config: {}

Alternative: Helm CLI

If you do not have Git access, install SNMP Exporter directly:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install snmp-exporter prometheus-community/prometheus-snmp-exporter \
  --namespace monitoring \
  --create-namespace \
  --version 9.11.0 \
  -f values.yaml

Configuration¶

Create the environment-specific directory:

AWSBare MetalProxmox VMs

mkdir -p flux/infra/aws/snmp-exporter

mkdir -p flux/infra/baremetal/snmp-exporter

mkdir -p flux/infra/baremetal/snmp-exporter

SNMP Exporter uses environment-specific values but no patches. The module configuration (OID mappings, community strings, device profiles) is deployed as a separate ConfigMap and should be customised for your network equipment.

HANon-HA

values.yaml

# SNMP Exporter — HA configuration
# 2 replicas for continuous network device monitoring

replicas: 2

resources:
  requests:
    cpu: 50m
    memory: 64Mi
  limits:
    cpu: 100m
    memory: 128Mi

podSecurityContext:
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault

serviceMonitor:
  enabled: true
  namespace: monitoring
  labels:
    release: prometheus

config: {}

values.yaml

# SNMP Exporter — Non-HA configuration
# Single replica

replicas: 1

resources:
  requests:
    cpu: 25m
    memory: 32Mi
  limits:
    cpu: 50m
    memory: 64Mi

serviceMonitor:
  enabled: false

config: {}

SNMP module definitions

The SNMP exporter's module configuration (OID mappings, community strings, device profiles) is deployed as a separate ConfigMap from the extra/ directory. This ConfigMap is device-specific and should be customised for the network equipment in each deployment.

Commit and Deploy¶

Once all files are in place, commit and push to trigger Flux deployment:

AWSBare MetalProxmox VMs

git add flux/infra/base/snmp-exporter.yaml \
        flux/infra/aws/snmp-exporter/
git commit -m "feat(snmp-exporter): add network device monitoring for AWS"
git push

git add flux/infra/base/snmp-exporter.yaml \
        flux/infra/baremetal/snmp-exporter/
git commit -m "feat(snmp-exporter): add network device monitoring"
git push

git add flux/infra/base/snmp-exporter.yaml \
        flux/infra/baremetal/snmp-exporter/
git commit -m "feat(snmp-exporter): add network device monitoring"
git push

Flux will detect the new commit and begin deploying SNMP Exporter. To trigger an immediate sync:

flux reconcile kustomization infra-snmp-exporter -n flux-system --with-source

Verify¶

After SNMP Exporter is deployed, confirm it is working:

# Check exporter pod
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-snmp-exporter

# Verify ServiceMonitor was created
kubectl get servicemonitor -n monitoring -l release=prometheus

Flux Operations¶

This component is managed by Flux as HelmRelease snmp-exporter and Kustomization infra-snmp-exporter.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease snmp-exporter -n flux-system

flux get kustomization infra-snmp-exporter -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-snmp-exporter -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease snmp-exporter -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=snmp-exporter -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease snmp-exporter -n flux-system
flux resume helmrelease snmp-exporter -n flux-system
flux reconcile kustomization infra-snmp-exporter -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.

Goldilocks¶

Goldilocks uses the Kubernetes Vertical Pod Autoscaler (VPA) in recommendation-only mode to suggest optimal resource requests and limits for workloads. It provides a dashboard showing current vs recommended resource allocations.

Install¶

Create the base directory and file:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`goldilocks`	The Helm chart name from the Fairwinds Stable registry
`version`	`10.2.0`	Pinned chart version for Goldilocks
`sourceRef.name`	`fairwinds-stable`	References a `HelmRepository` CR pointing to `https://charts.fairwinds.com/stable`
`targetNamespace`	`goldilocks`	Goldilocks runs in its own dedicated namespace
`crds: CreateReplace`	—	Automatically installs Goldilocks CRDs
`remediation.retries`	`3`	Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/goldilocks.yaml:

flux/infra/base/goldilocks.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: goldilocks
  namespace: flux-system
spec:
  targetNamespace: goldilocks
  interval: 30m
  chart:
    spec:
      chart: goldilocks
      version: "10.2.0"
      sourceRef:
        kind: HelmRepository
        name: fairwinds-stable
        namespace: flux-system
  releaseName: goldilocks
  install:
    crds: CreateReplace
    remediation:
      retries: 3
    createNamespace: true
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
  values:
    vpa:
      enabled: true
      updater:
        enabled: false
    dashboard:
      enabled: true
      replicaCount: 1
      resources:
        requests:
          cpu: 50m
          memory: 64Mi
        limits:
          cpu: 100m
          memory: 128Mi
      service:
        type: ClusterIP
        port: 80
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
    controller:
      enabled: true
      resources:
        requests:
          cpu: 50m
          memory: 64Mi
        limits:
          cpu: 100m
          memory: 128Mi

Alternative: Helm CLI

If you do not have Git access, install Goldilocks directly:

helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm repo update
helm upgrade --install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks \
  --create-namespace \
  --version 10.2.0 \
  -f values.yaml

Configuration¶

Create the environment-specific directory:

AWSBare MetalProxmox VMs

mkdir -p flux/infra/aws/goldilocks

mkdir -p flux/infra/baremetal/goldilocks

mkdir -p flux/infra/baremetal/goldilocks

Goldilocks uses environment-specific values but no patches. Save the values file for your deployment size:

HANon-HA

values.yaml

# Goldilocks — HA configuration
# VPA recommender mode, 2-replica dashboard

vpa:
  enabled: true
  updater:
    enabled: false

dashboard:
  enabled: true
  replicaCount: 2

  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi

  service:
    type: ClusterIP
    port: 80

  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

controller:
  enabled: true
  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi

values.yaml

# Goldilocks — Non-HA configuration
# Single dashboard replica

vpa:
  enabled: true
  updater:
    enabled: false

dashboard:
  enabled: true
  replicaCount: 1

  resources:
    requests:
      cpu: 25m
      memory: 32Mi
    limits:
      cpu: 50m
      memory: 64Mi

  service:
    type: ClusterIP
    port: 80

controller:
  enabled: true
  resources:
    requests:
      cpu: 25m
      memory: 32Mi
    limits:
      cpu: 50m
      memory: 64Mi

Enabling Goldilocks per namespace

Goldilocks only monitors namespaces with the label goldilocks.fairwinds.com/enabled=true. To enable recommendations for a namespace:

kubectl label namespace <namespace> goldilocks.fairwinds.com/enabled=true

Commit and Deploy¶

Once all files are in place, commit and push to trigger Flux deployment:

AWSBare MetalProxmox VMs

git add flux/infra/base/goldilocks.yaml \
        flux/infra/aws/goldilocks/
git commit -m "feat(goldilocks): add resource recommendation dashboard for AWS"
git push

git add flux/infra/base/goldilocks.yaml \
        flux/infra/baremetal/goldilocks/
git commit -m "feat(goldilocks): add resource recommendation dashboard"
git push

git add flux/infra/base/goldilocks.yaml \
        flux/infra/baremetal/goldilocks/
git commit -m "feat(goldilocks): add resource recommendation dashboard"
git push

Flux will detect the new commit and begin deploying Goldilocks. To trigger an immediate sync:

flux reconcile kustomization infra-goldilocks -n flux-system --with-source

Verify¶

After Goldilocks is deployed, confirm it is working:

# Check Goldilocks pods
kubectl get pods -n goldilocks

# Access the dashboard (port-forward)
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80
# Open http://localhost:8080

# Verify VPA recommender is working
kubectl describe vpa -n goldilocks

Flux Operations¶

This component is managed by Flux as HelmRelease goldilocks and Kustomization infra-goldilocks.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease goldilocks -n flux-system

flux get kustomization infra-goldilocks -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-goldilocks -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease goldilocks -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=goldilocks -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease goldilocks -n flux-system
flux resume helmrelease goldilocks -n flux-system
flux reconcile kustomization infra-goldilocks -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.

Next Steps¶

With observability infrastructure in place, you're ready to deploy and configure your data services. Proceed to 5.3.2 Data Services to set up PostgreSQL, Redis, S3 storage, and their monitoring and backup strategies.