Skip to content

5.1.4 Storage

The storage layer provides persistent block storage, optional S3-compatible object storage, and CSI volume snapshot support for backup integration.

How to use this page

Each component has an Install section showing the Flux HelmRelease, a Configuration section with Helm values, and a Verify section to confirm it is working.

All code blocks are labelled with their file path in the repository. Select your target environment (AWS or Bare Metal) in any tab group — the choice syncs across the entire page.

  • Using the existing rciis-devops repository: All files already exist. Skip the mkdir and git add/git commit commands — they are for users building a new repository. Simply review the files, edit values for your environment, and push.
  • Building a new repository from scratch: Follow the mkdir, file creation, and git commands in order.
  • No Git access: Expand the "Alternative: Helm CLI" block under each Install section.

Storage Architecture

The storage backend differs fundamentally between deployment environments:

Concern AWS Bare Metal
CSI Driver AWS EBS CSI Driver Rook-Ceph (RBD)
Block Storage EBS gp3 volumes Ceph RBD pools
Object Storage AWS S3 Ceph RGW (S3-compatible)
Default StorageClass gp3 ceph-rbd-single (single-replica) or ceph-rbd (3x replicated)
Snapshot Support EBS Snapshots via CSI Ceph RBD Snapshots via CSI

Select your environment in the tabs below — the choice syncs across the entire page.


Block Storage

AWS EBS CSI Driver

The AWS EBS CSI Driver provides native Kubernetes integration with AWS Elastic Block Store. It enables dynamic provisioning of EBS volumes. On Talos Linux, the controller requires explicit IAM credentials via a Secret because Talos pods cannot reach the EC2 Instance Metadata Service (IMDS).

Install

Create the base directory and save the HelmRelease:

mkdir -p flux/infra/base
Field Value Explanation
chart aws-ebs-csi-driver Helm chart from the AWS EBS CSI driver project
version 2.56.1 Pinned chart version
sourceRef.name aws-ebs-csi-driver HelmRepository CR pointing to https://kubernetes-sigs.github.io/aws-ebs-csi-driver
targetNamespace kube-system CSI drivers run in kube-system

Save the following as flux/infra/base/aws-ebs-csi-driver.yaml:

flux/infra/base/aws-ebs-csi-driver.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: aws-ebs-csi-driver
  namespace: flux-system
spec:
  targetNamespace: kube-system
  interval: 30m
  chart:
    spec:
      chart: aws-ebs-csi-driver
      version: "2.56.1"
      sourceRef:
        kind: HelmRepository
        name: aws-ebs-csi-driver
        namespace: flux-system
  releaseName: aws-ebs-csi-driver
  install:
    createNamespace: true
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: 3
  values:
    controller:
      replicaCount: 1
    node:
      tolerateAllTaints: true
Alternative: Helm CLI
helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm repo update
helm upgrade --install aws-ebs-csi-driver aws-ebs-csi-driver/aws-ebs-csi-driver \
  --namespace kube-system \
  --version 2.56.1 \
  -f values.yaml

Configuration

Create the environment overlay directory:

mkdir -p flux/infra/aws/aws-ebs-csi-driver

Environment Patch

The patch provides IAM credentials and resource limits. Save as flux/infra/aws/aws-ebs-csi-driver/patch.yaml:

flux/infra/aws/aws-ebs-csi-driver/patch.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: aws-ebs-csi-driver
spec:
  values:
    # IAM credentials via secret — Talos pods can't reach IMDS
    awsAccessSecret:
      name: aws-secret
      keyId: key_id
      accessKey: access_key
    controller:
      replicaCount: 1
      resources:
        requests:
          cpu: 10m
          memory: 40Mi
        limits:
          cpu: 100m
          memory: 128Mi
    node:
      tolerateAllTaints: true
Setting Value Why
awsAccessSecret aws-secret Talos cannot use IMDS — IAM credentials must be injected via a Kubernetes Secret
node.tolerateAllTaints true Ensures the CSI node plugin runs on all nodes including tainted ones

StorageClass

The gp3 StorageClass is the default for all PersistentVolumeClaims on AWS. Save as flux/infra/aws/aws-ebs-csi-driver/storageclass.yaml:

flux/infra/aws/aws-ebs-csi-driver/storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
parameters:
  type: gp3
  fsType: ext4
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
allowVolumeExpansion: true
Setting Value Why
type: gp3 EBS volume type General purpose SSD — 3,000 IOPS and 125 MB/s baseline included
volumeBindingMode WaitForFirstConsumer Volume is created in the same AZ as the pod that claims it
allowVolumeExpansion true PVCs can be resized without recreating the volume

Commit and Deploy

git add flux/infra/base/aws-ebs-csi-driver.yaml \
        flux/infra/aws/aws-ebs-csi-driver/
git commit -m "feat(storage): add AWS EBS CSI Driver"
git push
flux reconcile kustomization infra-aws-ebs-csi-driver -n flux-system --with-source

Verify

kubectl get pods -n kube-system | grep ebs-csi
kubectl get storageclass gp3
# Test dynamic provisioning
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-ebs-pvc
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: gp3
  resources:
    requests:
      storage: 1Gi
EOF
kubectl get pvc test-ebs-pvc
# Clean up
kubectl delete pvc test-ebs-pvc

Flux Operations

This component is managed by Flux as HelmRelease aws-ebs-csi-driver and Kustomization infra-aws-ebs-csi-driver.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease aws-ebs-csi-driver -n flux-system
flux get kustomization infra-aws-ebs-csi-driver -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-aws-ebs-csi-driver -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease aws-ebs-csi-driver -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=aws-ebs-csi-driver -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease aws-ebs-csi-driver -n flux-system
flux resume helmrelease aws-ebs-csi-driver -n flux-system
flux reconcile kustomization infra-aws-ebs-csi-driver -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved.

Rook-Ceph Operator

Rook orchestrates Ceph storage clusters on Kubernetes. The operator is installed first, then the cluster configuration follows. Rook manages MONs, MGRs, OSDs, and the RGW (S3-compatible) gateway.

Install

mkdir -p flux/infra/base
Field Value Explanation
chart rook-ceph Rook operator Helm chart
version v1.17.9 Pinned chart version
sourceRef.name rook-release HelmRepository CR pointing to https://charts.rook.io/release
targetNamespace rook-ceph Rook components run in their own namespace

Save the following as flux/infra/base/rook-ceph.yaml:

flux/infra/base/rook-ceph.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: rook-ceph
  namespace: flux-system
spec:
  targetNamespace: rook-ceph
  interval: 30m
  chart:
    spec:
      chart: rook-ceph
      version: "v1.17.9"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  releaseName: rook-ceph
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
Alternative: Helm CLI
helm repo add rook-release https://charts.rook.io/release
helm repo update
helm upgrade --install rook-ceph rook-release/rook-ceph \
  --namespace rook-ceph \
  --create-namespace \
  --version v1.17.9 \
  -f values.yaml

Configuration

mkdir -p flux/infra/baremetal/rook-ceph

Save the following values file for the operator. Choose HA or Non-HA:

flux/infra/baremetal/rook-ceph/values.yaml
# Rook-Ceph Operator — HA configuration

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

csi:
  enableCSIHostNetwork: true
  enableRbdDriver: true
  enableCephfsDriver: false
  enableCSISnapshotter: true

monitoring:
  enabled: true

enableDiscoveryDaemon: true
logLevel: INFO
flux/infra/baremetal/rook-ceph/values.yaml
# Rook-Ceph Operator — Non-HA configuration

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 256Mi

csi:
  enableCSIHostNetwork: true
  enableRbdDriver: true
  enableCephfsDriver: false
  enableCSISnapshotter: true

monitoring:
  enabled: false

enableDiscoveryDaemon: true
logLevel: INFO

Commit and Deploy

git add flux/infra/base/rook-ceph.yaml \
        flux/infra/baremetal/rook-ceph/
git commit -m "feat(storage): add Rook-Ceph operator for bare metal"
git push
flux reconcile kustomization infra-rook-ceph -n flux-system --with-source

Flux Operations

This component is managed by Flux as HelmRelease rook-ceph and Kustomization infra-rook-ceph.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease rook-ceph -n flux-system
flux get kustomization infra-rook-ceph -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-rook-ceph -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease rook-ceph -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=rook-ceph -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease rook-ceph -n flux-system
flux resume helmrelease rook-ceph -n flux-system
flux reconcile kustomization infra-rook-ceph -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved.


Rook-Ceph Cluster

The cluster chart creates the actual Ceph cluster — MONs, MGRs, OSDs, the RGW object store, block pools, storage classes, and the VolumeSnapshotClass for Velero.

Deploy after the operator

The Rook-Ceph Cluster HelmRelease depends on the operator being healthy. Flux enforces this via dependsOn in the infrastructure ResourceSet.

Install

Save the following as flux/infra/base/rook-ceph-cluster.yaml:

flux/infra/base/rook-ceph-cluster.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: rook-ceph-cluster
  namespace: flux-system
spec:
  targetNamespace: rook-ceph
  interval: 30m
  chart:
    spec:
      chart: rook-ceph-cluster
      version: "v1.17.9"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  releaseName: rook-ceph-cluster
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
Alternative: Helm CLI
helm upgrade --install rook-ceph-cluster rook-release/rook-ceph-cluster \
  --namespace rook-ceph \
  --version v1.17.9 \
  -f values.yaml

Configuration

mkdir -p flux/infra/baremetal/rook-ceph-cluster

The cluster values define the Ceph topology, storage devices, block pools, storage classes, and optional S3 object store. Save as flux/infra/baremetal/rook-ceph-cluster/values.yaml:

flux/infra/baremetal/rook-ceph-cluster/values.yaml
# Rook-Ceph Cluster — HA configuration
# 3 MON, 2 MGR, 3 worker nodes, replicated pools, S3 object store

operatorNamespace: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3

  dataDirHostPath: /var/lib/rook

  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    allowMultiplePerNode: false
    modules:
      - name: pg_autoscaler
        enabled: true
      - name: rook
        enabled: true

  dashboard:
    enabled: true
    ssl: false

  storage:
    useAllNodes: false
    useAllDevices: true
    deviceFilter: "^sd[b-z]$"
    config:
      osdsPerDevice: "1"
    nodes:
      - name: "rciis-kenya-wn-01"
      - name: "rciis-kenya-wn-02"
      - name: "rciis-kenya-wn-03"

  resources:
    mgr:
      requests: { cpu: 250m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }
    mon:
      requests: { cpu: 250m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }
    osd:
      requests: { cpu: 250m, memory: 1Gi }
      limits:   { cpu: 2000m, memory: 2Gi }

# Block Pools and StorageClasses
cephBlockPools:
  # Single-replica — for apps WITH built-in replication (PostgreSQL, Kafka, etcd)
  - name: single-pool
    spec:
      failureDomain: osd
      replicated:
        size: 1
        requireSafeReplicaSize: false
    storageClass:
      enabled: true
      name: ceph-rbd-single
      isDefault: false
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

  # 3x replicated — for apps WITHOUT built-in replication
  - name: default-pool
    spec:
      failureDomain: host
      replicated:
        size: 3
    storageClass:
      enabled: true
      name: ceph-rbd
      isDefault: true
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

# VolumeSnapshotClass for Velero CSI backups
cephBlockPoolsVolumeSnapshotClass:
  enabled: true
  name: ceph-rbd-snapshot
  isDefault: true
  deletionPolicy: Delete
  labels:
    velero.io/csi-volumesnapshot-class: "true"

# S3-compatible Object Store (Ceph RGW)
cephObjectStores:
  - name: ceph-objectstore
    spec:
      allowUsersInNamespaces: [velero, monitoring]
      metadataPool:
        failureDomain: host
        replicated: { size: 3 }
      dataPool:
        failureDomain: host
        replicated: { size: 3 }
      preservePoolsOnDelete: true
      gateway:
        port: 80
        instances: 2
        resources:
          requests: { cpu: 250m, memory: 512Mi }
          limits:   { cpu: 1000m, memory: 1Gi }
    storageClass:
      enabled: true
      name: ceph-bucket
      reclaimPolicy: Delete
      volumeBindingMode: Immediate

cephFileSystems: []

monitoring:
  enabled: true
  createPrometheusRules: true

toolbox:
  enabled: true

Key settings:

Setting Value Why
mon.count: 3 3 monitors Ceph quorum requires an odd number ≥ 3 for HA
mgr.count: 2 2 managers Active/standby for dashboard and module HA
deviceFilter: "^sd[b-z]$" Disk regex Matches secondary disks, excludes boot disk (sda)
single-pool 1x replicated For apps with built-in replication (PostgreSQL, Kafka) — avoids double replication
default-pool 3x replicated For apps without replication — maximum durability
cephObjectStores RGW gateway S3-compatible endpoint for Velero backups and application object storage
flux/infra/baremetal/rook-ceph-cluster/values.yaml
# Rook-Ceph Cluster — Non-HA configuration
# 1 MON, 1 MGR, single-replica pools, no object store

operatorNamespace: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3

  dataDirHostPath: /var/lib/rook

  mon:
    count: 1
    allowMultiplePerNode: true
  mgr:
    count: 1
    allowMultiplePerNode: true
    modules:
      - name: pg_autoscaler
        enabled: true

  dashboard:
    enabled: true
    ssl: false

  storage:
    useAllNodes: true
    useAllDevices: true
    deviceFilter: "^sd[b-z]$"
    config:
      osdsPerDevice: "1"

  resources:
    mgr:
      requests: { cpu: 100m, memory: 256Mi }
      limits:   { cpu: 500m, memory: 512Mi }
    mon:
      requests: { cpu: 100m, memory: 256Mi }
      limits:   { cpu: 500m, memory: 512Mi }
    osd:
      requests: { cpu: 100m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }

cephBlockPools:
  - name: single-pool
    spec:
      failureDomain: osd
      replicated:
        size: 1
        requireSafeReplicaSize: false
    storageClass:
      enabled: true
      name: ceph-rbd-single
      isDefault: true
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

cephBlockPoolsVolumeSnapshotClass:
  enabled: true
  name: ceph-rbd-snapshot
  isDefault: true
  deletionPolicy: Delete
  labels:
    velero.io/csi-volumesnapshot-class: "true"

cephObjectStores: []
cephFileSystems: []

monitoring:
  enabled: false

toolbox:
  enabled: true

No data redundancy

Single-replica pools and a single MON. An OSD or node failure will cause data loss. Use only for development or testing.

Commit and Deploy

git add flux/infra/base/rook-ceph.yaml \
        flux/infra/base/rook-ceph-cluster.yaml \
        flux/infra/baremetal/rook-ceph/ \
        flux/infra/baremetal/rook-ceph-cluster/
git commit -m "feat(storage): add Rook-Ceph for bare metal environment"
git push
flux reconcile kustomization infra-rook-ceph -n flux-system --with-source
# Wait for operator to be ready, then:
flux reconcile kustomization infra-rook-ceph-cluster -n flux-system --with-source

Verify

# Check Ceph cluster health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
# Verify OSDs are up
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
# Check storage classes
kubectl get sc
# Verify S3 object store (HA only)
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- radosgw-admin user list

Flux Operations

This component is managed by Flux as HelmRelease rook-ceph-cluster and Kustomization infra-rook-ceph-cluster.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease rook-ceph-cluster -n flux-system
flux get kustomization infra-rook-ceph-cluster -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-rook-ceph-cluster -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease rook-ceph-cluster -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=rook-ceph-cluster -n flux-system

Recovering a stalled HelmRelease

The Rook-Ceph Cluster HelmRelease often takes longer than the default Helm timeout (5 minutes) because OSD provisioning is slow. If the HelmRelease shows Stalled with RetriesExceeded, suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease rook-ceph-cluster -n flux-system
flux resume helmrelease rook-ceph-cluster -n flux-system
flux reconcile kustomization infra-rook-ceph-cluster -n flux-system

Only run this after confirming the underlying issue (e.g. OSD provisioning timeout) has been resolved.

Rook-Ceph Operator

Rook orchestrates Ceph storage clusters on Kubernetes. The operator is installed first, then the cluster configuration follows. Rook manages MONs, MGRs, OSDs, and the RGW (S3-compatible) gateway.

Install

mkdir -p flux/infra/base
Field Value Explanation
chart rook-ceph Rook operator Helm chart
version v1.17.9 Pinned chart version
sourceRef.name rook-release HelmRepository CR pointing to https://charts.rook.io/release
targetNamespace rook-ceph Rook components run in their own namespace

Save the following as flux/infra/base/rook-ceph.yaml:

flux/infra/base/rook-ceph.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: rook-ceph
  namespace: flux-system
spec:
  targetNamespace: rook-ceph
  interval: 30m
  chart:
    spec:
      chart: rook-ceph
      version: "v1.17.9"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  releaseName: rook-ceph
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
Alternative: Helm CLI
helm repo add rook-release https://charts.rook.io/release
helm repo update
helm upgrade --install rook-ceph rook-release/rook-ceph \
  --namespace rook-ceph \
  --create-namespace \
  --version v1.17.9 \
  -f values.yaml

Configuration

mkdir -p flux/infra/baremetal/rook-ceph

Save the following values file for the operator. Choose HA or Non-HA:

flux/infra/baremetal/rook-ceph/values.yaml
# Rook-Ceph Operator — HA configuration

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

csi:
  enableCSIHostNetwork: true
  enableRbdDriver: true
  enableCephfsDriver: false
  enableCSISnapshotter: true

monitoring:
  enabled: true

enableDiscoveryDaemon: true
logLevel: INFO
flux/infra/baremetal/rook-ceph/values.yaml
# Rook-Ceph Operator — Non-HA configuration

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 256Mi

csi:
  enableCSIHostNetwork: true
  enableRbdDriver: true
  enableCephfsDriver: false
  enableCSISnapshotter: true

monitoring:
  enabled: false

enableDiscoveryDaemon: true
logLevel: INFO

Commit and Deploy

git add flux/infra/base/rook-ceph.yaml \
        flux/infra/baremetal/rook-ceph/
git commit -m "feat(storage): add Rook-Ceph operator for bare metal"
git push
flux reconcile kustomization infra-rook-ceph -n flux-system --with-source

Flux Operations

This component is managed by Flux as HelmRelease rook-ceph and Kustomization infra-rook-ceph.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease rook-ceph -n flux-system
flux get kustomization infra-rook-ceph -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-rook-ceph -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease rook-ceph -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=rook-ceph -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease rook-ceph -n flux-system
flux resume helmrelease rook-ceph -n flux-system
flux reconcile kustomization infra-rook-ceph -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved.


Rook-Ceph Cluster

The cluster chart creates the actual Ceph cluster — MONs, MGRs, OSDs, the RGW object store, block pools, storage classes, and the VolumeSnapshotClass for Velero.

Deploy after the operator

The Rook-Ceph Cluster HelmRelease depends on the operator being healthy. Flux enforces this via dependsOn in the infrastructure ResourceSet.

Install

Save the following as flux/infra/base/rook-ceph-cluster.yaml:

flux/infra/base/rook-ceph-cluster.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: rook-ceph-cluster
  namespace: flux-system
spec:
  targetNamespace: rook-ceph
  interval: 30m
  chart:
    spec:
      chart: rook-ceph-cluster
      version: "v1.17.9"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  releaseName: rook-ceph-cluster
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
Alternative: Helm CLI
helm upgrade --install rook-ceph-cluster rook-release/rook-ceph-cluster \
  --namespace rook-ceph \
  --version v1.17.9 \
  -f values.yaml

Configuration

mkdir -p flux/infra/baremetal/rook-ceph-cluster

The cluster values define the Ceph topology, storage devices, block pools, storage classes, and optional S3 object store. Save as flux/infra/baremetal/rook-ceph-cluster/values.yaml:

flux/infra/baremetal/rook-ceph-cluster/values.yaml
# Rook-Ceph Cluster — HA configuration
# 3 MON, 2 MGR, 3 worker nodes, replicated pools, S3 object store

operatorNamespace: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3

  dataDirHostPath: /var/lib/rook

  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    allowMultiplePerNode: false
    modules:
      - name: pg_autoscaler
        enabled: true
      - name: rook
        enabled: true

  dashboard:
    enabled: true
    ssl: false

  storage:
    useAllNodes: false
    useAllDevices: true
    deviceFilter: "^sd[b-z]$"
    config:
      osdsPerDevice: "1"
    nodes:
      - name: "rciis-kenya-wn-01"
      - name: "rciis-kenya-wn-02"
      - name: "rciis-kenya-wn-03"

  resources:
    mgr:
      requests: { cpu: 250m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }
    mon:
      requests: { cpu: 250m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }
    osd:
      requests: { cpu: 250m, memory: 1Gi }
      limits:   { cpu: 2000m, memory: 2Gi }

# Block Pools and StorageClasses
cephBlockPools:
  # Single-replica — for apps WITH built-in replication (PostgreSQL, Kafka, etcd)
  - name: single-pool
    spec:
      failureDomain: osd
      replicated:
        size: 1
        requireSafeReplicaSize: false
    storageClass:
      enabled: true
      name: ceph-rbd-single
      isDefault: false
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

  # 3x replicated — for apps WITHOUT built-in replication
  - name: default-pool
    spec:
      failureDomain: host
      replicated:
        size: 3
    storageClass:
      enabled: true
      name: ceph-rbd
      isDefault: true
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

# VolumeSnapshotClass for Velero CSI backups
cephBlockPoolsVolumeSnapshotClass:
  enabled: true
  name: ceph-rbd-snapshot
  isDefault: true
  deletionPolicy: Delete
  labels:
    velero.io/csi-volumesnapshot-class: "true"

# S3-compatible Object Store (Ceph RGW)
cephObjectStores:
  - name: ceph-objectstore
    spec:
      allowUsersInNamespaces: [velero, monitoring]
      metadataPool:
        failureDomain: host
        replicated: { size: 3 }
      dataPool:
        failureDomain: host
        replicated: { size: 3 }
      preservePoolsOnDelete: true
      gateway:
        port: 80
        instances: 2
        resources:
          requests: { cpu: 250m, memory: 512Mi }
          limits:   { cpu: 1000m, memory: 1Gi }
    storageClass:
      enabled: true
      name: ceph-bucket
      reclaimPolicy: Delete
      volumeBindingMode: Immediate

cephFileSystems: []

monitoring:
  enabled: true
  createPrometheusRules: true

toolbox:
  enabled: true

Key settings:

Setting Value Why
mon.count: 3 3 monitors Ceph quorum requires an odd number ≥ 3 for HA
mgr.count: 2 2 managers Active/standby for dashboard and module HA
deviceFilter: "^sd[b-z]$" Disk regex Matches secondary disks, excludes boot disk (sda)
single-pool 1x replicated For apps with built-in replication (PostgreSQL, Kafka) — avoids double replication
default-pool 3x replicated For apps without replication — maximum durability
cephObjectStores RGW gateway S3-compatible endpoint for Velero backups and application object storage
flux/infra/baremetal/rook-ceph-cluster/values.yaml
# Rook-Ceph Cluster — Non-HA configuration
# 1 MON, 1 MGR, single-replica pools, no object store

operatorNamespace: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3

  dataDirHostPath: /var/lib/rook

  mon:
    count: 1
    allowMultiplePerNode: true
  mgr:
    count: 1
    allowMultiplePerNode: true
    modules:
      - name: pg_autoscaler
        enabled: true

  dashboard:
    enabled: true
    ssl: false

  storage:
    useAllNodes: true
    useAllDevices: true
    deviceFilter: "^sd[b-z]$"
    config:
      osdsPerDevice: "1"

  resources:
    mgr:
      requests: { cpu: 100m, memory: 256Mi }
      limits:   { cpu: 500m, memory: 512Mi }
    mon:
      requests: { cpu: 100m, memory: 256Mi }
      limits:   { cpu: 500m, memory: 512Mi }
    osd:
      requests: { cpu: 100m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }

cephBlockPools:
  - name: single-pool
    spec:
      failureDomain: osd
      replicated:
        size: 1
        requireSafeReplicaSize: false
    storageClass:
      enabled: true
      name: ceph-rbd-single
      isDefault: true
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

cephBlockPoolsVolumeSnapshotClass:
  enabled: true
  name: ceph-rbd-snapshot
  isDefault: true
  deletionPolicy: Delete
  labels:
    velero.io/csi-volumesnapshot-class: "true"

cephObjectStores: []
cephFileSystems: []

monitoring:
  enabled: false

toolbox:
  enabled: true

No data redundancy

Single-replica pools and a single MON. An OSD or node failure will cause data loss. Use only for development or testing.

Commit and Deploy

git add flux/infra/base/rook-ceph.yaml \
        flux/infra/base/rook-ceph-cluster.yaml \
        flux/infra/baremetal/rook-ceph/ \
        flux/infra/baremetal/rook-ceph-cluster/
git commit -m "feat(storage): add Rook-Ceph for bare metal environment"
git push
flux reconcile kustomization infra-rook-ceph -n flux-system --with-source
# Wait for operator to be ready, then:
flux reconcile kustomization infra-rook-ceph-cluster -n flux-system --with-source

Verify

# Check Ceph cluster health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
# Verify OSDs are up
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
# Check storage classes
kubectl get sc
# Verify S3 object store (HA only)
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- radosgw-admin user list

Flux Operations

This component is managed by Flux as HelmRelease rook-ceph-cluster and Kustomization infra-rook-ceph-cluster.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease rook-ceph-cluster -n flux-system
flux get kustomization infra-rook-ceph-cluster -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-rook-ceph-cluster -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease rook-ceph-cluster -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=rook-ceph-cluster -n flux-system

Recovering a stalled HelmRelease

The Rook-Ceph Cluster HelmRelease often takes longer than the default Helm timeout (5 minutes) because OSD provisioning is slow. If the HelmRelease shows Stalled with RetriesExceeded, suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease rook-ceph-cluster -n flux-system
flux resume helmrelease rook-ceph-cluster -n flux-system
flux reconcile kustomization infra-rook-ceph-cluster -n flux-system

Only run this after confirming the underlying issue (e.g. OSD provisioning timeout) has been resolved.


Snapshot Controller

The CSI Snapshot Controller enables VolumeSnapshot support in Kubernetes, required for Velero CSI-based backups. This component is shared across both AWS and Bare Metal environments.

Install

mkdir -p flux/infra/base
Field Value Explanation
chart snapshot-controller CSI snapshot controller chart
version 5.0.2 Pinned chart version
sourceRef.name piraeus HelmRepository CR pointing to https://piraeus.io/helm-charts
targetNamespace snapshot-controller Deployed in its own namespace

Save the following as flux/infra/base/snapshot-controller.yaml:

flux/infra/base/snapshot-controller.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: snapshot-controller
  namespace: flux-system
spec:
  targetNamespace: snapshot-controller
  interval: 30m
  chart:
    spec:
      chart: snapshot-controller
      version: "5.0.2"
      sourceRef:
        kind: HelmRepository
        name: piraeus
        namespace: flux-system
  releaseName: snapshot-controller
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
Alternative: Helm CLI
helm repo add piraeus https://piraeus.io/helm-charts
helm repo update
helm upgrade --install snapshot-controller piraeus/snapshot-controller \
  --namespace snapshot-controller \
  --create-namespace \
  --version 5.0.2

Commit and Deploy

git add flux/infra/base/snapshot-controller.yaml
git commit -m "feat(storage): add CSI Snapshot Controller"
git push
git add flux/infra/base/snapshot-controller.yaml
git commit -m "feat(storage): add CSI Snapshot Controller"
git push
git add flux/infra/base/snapshot-controller.yaml
git commit -m "feat(storage): add CSI Snapshot Controller"
git push
flux reconcile kustomization infra-snapshot-controller -n flux-system --with-source

Verify

kubectl get pods -n snapshot-controller
kubectl get volumesnapshotclass

Flux Operations

This component is managed by Flux as HelmRelease snapshot-controller and Kustomization infra-snapshot-controller.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease snapshot-controller -n flux-system
flux get kustomization infra-snapshot-controller -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-snapshot-controller -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease snapshot-controller -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=snapshot-controller -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease snapshot-controller -n flux-system
flux resume helmrelease snapshot-controller -n flux-system
flux reconcile kustomization infra-snapshot-controller -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved.


Next Steps

Storage is now configured. Proceed to 5.2 Security to set up the policy engine, vulnerability scanning, and runtime threat detection.