5.1.4 Storage¶

The storage layer provides persistent block storage, optional S3-compatible object storage, and CSI volume snapshot support for backup integration.

How to use this page

Each component has an Install section showing the Flux HelmRelease, a Configuration section with Helm values, and a Verify section to confirm it is working.

All code blocks are labelled with their file path in the repository. Select your target environment (AWS or Bare Metal) in any tab group — the choice syncs across the entire page.

Using the existing rciis-devops repository: All files already exist. Skip the mkdir and git add/git commit commands — they are for users building a new repository. Simply review the files, edit values for your environment, and push.
Building a new repository from scratch: Follow the mkdir, file creation, and git commands in order.
No Git access: Expand the "Alternative: Helm CLI" block under each Install section.

Storage Architecture¶

The storage backend differs fundamentally between deployment environments:

Concern	AWS	Bare Metal
CSI Driver	AWS EBS CSI Driver	Rook-Ceph (RBD)
Block Storage	EBS `gp3` volumes	Ceph RBD pools
Object Storage	AWS S3	Ceph RGW (S3-compatible)
Default StorageClass	`gp3`	`ceph-rbd-single` (single-replica) or `ceph-rbd` (3x replicated)
Snapshot Support	EBS Snapshots via CSI	Ceph RBD Snapshots via CSI

Select your environment in the tabs below — the choice syncs across the entire page.

Block Storage¶

AWSBare MetalProxmox VMs

AWS EBS CSI Driver¶

The AWS EBS CSI Driver provides native Kubernetes integration with AWS Elastic Block Store. It enables dynamic provisioning of EBS volumes. On Talos Linux, the controller requires explicit IAM credentials via a Secret because Talos pods cannot reach the EC2 Instance Metadata Service (IMDS).

Install¶

Create the base directory and save the HelmRelease:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`aws-ebs-csi-driver`	Helm chart from the AWS EBS CSI driver project
`version`	`2.56.1`	Pinned chart version
`sourceRef.name`	`aws-ebs-csi-driver`	HelmRepository CR pointing to `https://kubernetes-sigs.github.io/aws-ebs-csi-driver`
`targetNamespace`	`kube-system`	CSI drivers run in kube-system

Save the following as flux/infra/base/aws-ebs-csi-driver.yaml:

flux/infra/base/aws-ebs-csi-driver.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: aws-ebs-csi-driver
  namespace: flux-system
spec:
  targetNamespace: kube-system
  interval: 30m
  chart:
    spec:
      chart: aws-ebs-csi-driver
      version: "2.56.1"
      sourceRef:
        kind: HelmRepository
        name: aws-ebs-csi-driver
        namespace: flux-system
  releaseName: aws-ebs-csi-driver
  install:
    createNamespace: true
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: 3
  values:
    controller:
      replicaCount: 1
    node:
      tolerateAllTaints: true

Alternative: Helm CLI

helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm repo update
helm upgrade --install aws-ebs-csi-driver aws-ebs-csi-driver/aws-ebs-csi-driver \
  --namespace kube-system \
  --version 2.56.1 \
  -f values.yaml

Configuration¶

Create the environment overlay directory:

mkdir -p flux/infra/aws/aws-ebs-csi-driver

Environment Patch¶

The patch provides IAM credentials and resource limits. Save as flux/infra/aws/aws-ebs-csi-driver/patch.yaml:

flux/infra/aws/aws-ebs-csi-driver/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: aws-ebs-csi-driver
spec:
  values:
    # IAM credentials via secret — Talos pods can't reach IMDS
    awsAccessSecret:
      name: aws-secret
      keyId: key_id
      accessKey: access_key
    controller:
      replicaCount: 1
      resources:
        requests:
          cpu: 10m
          memory: 40Mi
        limits:
          cpu: 100m
          memory: 128Mi
    node:
      tolerateAllTaints: true

Setting	Value	Why
`awsAccessSecret`	`aws-secret`	Talos cannot use IMDS — IAM credentials must be injected via a Kubernetes Secret
`node.tolerateAllTaints`	`true`	Ensures the CSI node plugin runs on all nodes including tainted ones

StorageClass¶

The gp3 StorageClass is the default for all PersistentVolumeClaims on AWS. Save as flux/infra/aws/aws-ebs-csi-driver/storageclass.yaml:

flux/infra/aws/aws-ebs-csi-driver/storageclass.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
parameters:
  type: gp3
  fsType: ext4
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
allowVolumeExpansion: true

Setting	Value	Why
`type: gp3`	EBS volume type	General purpose SSD — 3,000 IOPS and 125 MB/s baseline included
`volumeBindingMode`	`WaitForFirstConsumer`	Volume is created in the same AZ as the pod that claims it
`allowVolumeExpansion`	`true`	PVCs can be resized without recreating the volume

Commit and Deploy¶

git add flux/infra/base/aws-ebs-csi-driver.yaml \
        flux/infra/aws/aws-ebs-csi-driver/
git commit -m "feat(storage): add AWS EBS CSI Driver"
git push

flux reconcile kustomization infra-aws-ebs-csi-driver -n flux-system --with-source

Verify¶

kubectl get pods -n kube-system | grep ebs-csi

kubectl get storageclass gp3

# Test dynamic provisioning
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-ebs-pvc
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: gp3
  resources:
    requests:
      storage: 1Gi
EOF
kubectl get pvc test-ebs-pvc
# Clean up
kubectl delete pvc test-ebs-pvc

Flux Operations¶

This component is managed by Flux as HelmRelease aws-ebs-csi-driver and Kustomization infra-aws-ebs-csi-driver.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease aws-ebs-csi-driver -n flux-system

flux get kustomization infra-aws-ebs-csi-driver -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-aws-ebs-csi-driver -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease aws-ebs-csi-driver -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=aws-ebs-csi-driver -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease aws-ebs-csi-driver -n flux-system
flux resume helmrelease aws-ebs-csi-driver -n flux-system
flux reconcile kustomization infra-aws-ebs-csi-driver -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved.

Rook-Ceph Operator¶

Rook orchestrates Ceph storage clusters on Kubernetes. The operator is installed first, then the cluster configuration follows. Rook manages MONs, MGRs, OSDs, and the RGW (S3-compatible) gateway.

Install¶

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`rook-ceph`	Rook operator Helm chart
`version`	`v1.17.9`	Pinned chart version
`sourceRef.name`	`rook-release`	HelmRepository CR pointing to `https://charts.rook.io/release`
`targetNamespace`	`rook-ceph`	Rook components run in their own namespace

Save the following as flux/infra/base/rook-ceph.yaml:

flux/infra/base/rook-ceph.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: rook-ceph
  namespace: flux-system
spec:
  targetNamespace: rook-ceph
  interval: 30m
  chart:
    spec:
      chart: rook-ceph
      version: "v1.17.9"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  releaseName: rook-ceph
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3

Alternative: Helm CLI

helm repo add rook-release https://charts.rook.io/release
helm repo update
helm upgrade --install rook-ceph rook-release/rook-ceph \
  --namespace rook-ceph \
  --create-namespace \
  --version v1.17.9 \
  -f values.yaml

Configuration¶

mkdir -p flux/infra/baremetal/rook-ceph

Save the following values file for the operator. Choose HA or Non-HA:

HANon-HA

flux/infra/baremetal/rook-ceph/values.yaml

# Rook-Ceph Operator — HA configuration

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

csi:
  enableCSIHostNetwork: true
  enableRbdDriver: true
  enableCephfsDriver: false
  enableCSISnapshotter: true

monitoring:
  enabled: true

enableDiscoveryDaemon: true
logLevel: INFO

flux/infra/baremetal/rook-ceph/values.yaml

# Rook-Ceph Operator — Non-HA configuration

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 256Mi

csi:
  enableCSIHostNetwork: true
  enableRbdDriver: true
  enableCephfsDriver: false
  enableCSISnapshotter: true

monitoring:
  enabled: false

enableDiscoveryDaemon: true
logLevel: INFO

Commit and Deploy¶

git add flux/infra/base/rook-ceph.yaml \
        flux/infra/baremetal/rook-ceph/
git commit -m "feat(storage): add Rook-Ceph operator for bare metal"
git push

flux reconcile kustomization infra-rook-ceph -n flux-system --with-source

Flux Operations¶

This component is managed by Flux as HelmRelease rook-ceph and Kustomization infra-rook-ceph.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease rook-ceph -n flux-system

flux get kustomization infra-rook-ceph -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-rook-ceph -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease rook-ceph -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=rook-ceph -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease rook-ceph -n flux-system
flux resume helmrelease rook-ceph -n flux-system
flux reconcile kustomization infra-rook-ceph -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved.

Rook-Ceph Cluster¶

The cluster chart creates the actual Ceph cluster — MONs, MGRs, OSDs, the RGW object store, block pools, storage classes, and the VolumeSnapshotClass for Velero.

Deploy after the operator

The Rook-Ceph Cluster HelmRelease depends on the operator being healthy. Flux enforces this via dependsOn in the infrastructure ResourceSet.

Install¶

Save the following as flux/infra/base/rook-ceph-cluster.yaml:

flux/infra/base/rook-ceph-cluster.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: rook-ceph-cluster
  namespace: flux-system
spec:
  targetNamespace: rook-ceph
  interval: 30m
  chart:
    spec:
      chart: rook-ceph-cluster
      version: "v1.17.9"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  releaseName: rook-ceph-cluster
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3

Alternative: Helm CLI

helm upgrade --install rook-ceph-cluster rook-release/rook-ceph-cluster \
  --namespace rook-ceph \
  --version v1.17.9 \
  -f values.yaml

Configuration¶

mkdir -p flux/infra/baremetal/rook-ceph-cluster

The cluster values define the Ceph topology, storage devices, block pools, storage classes, and optional S3 object store. Save as flux/infra/baremetal/rook-ceph-cluster/values.yaml:

HANon-HA

flux/infra/baremetal/rook-ceph-cluster/values.yaml

# Rook-Ceph Cluster — HA configuration
# 3 MON, 2 MGR, 3 worker nodes, replicated pools, S3 object store

operatorNamespace: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3

  dataDirHostPath: /var/lib/rook

  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    allowMultiplePerNode: false
    modules:
      - name: pg_autoscaler
        enabled: true
      - name: rook
        enabled: true

  dashboard:
    enabled: true
    ssl: false

  storage:
    useAllNodes: false
    useAllDevices: true
    deviceFilter: "^sd[b-z]$"
    config:
      osdsPerDevice: "1"
    nodes:
      - name: "rciis-kenya-wn-01"
      - name: "rciis-kenya-wn-02"
      - name: "rciis-kenya-wn-03"

  resources:
    mgr:
      requests: { cpu: 250m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }
    mon:
      requests: { cpu: 250m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }
    osd:
      requests: { cpu: 250m, memory: 1Gi }
      limits:   { cpu: 2000m, memory: 2Gi }

# Block Pools and StorageClasses
cephBlockPools:
  # Single-replica — for apps WITH built-in replication (PostgreSQL, Kafka, etcd)
  - name: single-pool
    spec:
      failureDomain: osd
      replicated:
        size: 1
        requireSafeReplicaSize: false
    storageClass:
      enabled: true
      name: ceph-rbd-single
      isDefault: false
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

  # 3x replicated — for apps WITHOUT built-in replication
  - name: default-pool
    spec:
      failureDomain: host
      replicated:
        size: 3
    storageClass:
      enabled: true
      name: ceph-rbd
      isDefault: true
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

# VolumeSnapshotClass for Velero CSI backups
cephBlockPoolsVolumeSnapshotClass:
  enabled: true
  name: ceph-rbd-snapshot
  isDefault: true
  deletionPolicy: Delete
  labels:
    velero.io/csi-volumesnapshot-class: "true"

# S3-compatible Object Store (Ceph RGW)
cephObjectStores:
  - name: ceph-objectstore
    spec:
      allowUsersInNamespaces: [velero, monitoring]
      metadataPool:
        failureDomain: host
        replicated: { size: 3 }
      dataPool:
        failureDomain: host
        replicated: { size: 3 }
      preservePoolsOnDelete: true
      gateway:
        port: 80
        instances: 2
        resources:
          requests: { cpu: 250m, memory: 512Mi }
          limits:   { cpu: 1000m, memory: 1Gi }
    storageClass:
      enabled: true
      name: ceph-bucket
      reclaimPolicy: Delete
      volumeBindingMode: Immediate

cephFileSystems: []

monitoring:
  enabled: true
  createPrometheusRules: true

toolbox:
  enabled: true

Key settings:

Setting	Value	Why
`mon.count: 3`	3 monitors	Ceph quorum requires an odd number ≥ 3 for HA
`mgr.count: 2`	2 managers	Active/standby for dashboard and module HA
`deviceFilter: "^sd[b-z]$"`	Disk regex	Matches secondary disks, excludes boot disk (`sda`)
`single-pool`	1x replicated	For apps with built-in replication (PostgreSQL, Kafka) — avoids double replication
`default-pool`	3x replicated	For apps without replication — maximum durability
`cephObjectStores`	RGW gateway	S3-compatible endpoint for Velero backups and application object storage

flux/infra/baremetal/rook-ceph-cluster/values.yaml

# Rook-Ceph Cluster — Non-HA configuration
# 1 MON, 1 MGR, single-replica pools, no object store

operatorNamespace: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3

  dataDirHostPath: /var/lib/rook

  mon:
    count: 1
    allowMultiplePerNode: true
  mgr:
    count: 1
    allowMultiplePerNode: true
    modules:
      - name: pg_autoscaler
        enabled: true

  dashboard:
    enabled: true
    ssl: false

  storage:
    useAllNodes: true
    useAllDevices: true
    deviceFilter: "^sd[b-z]$"
    config:
      osdsPerDevice: "1"

  resources:
    mgr:
      requests: { cpu: 100m, memory: 256Mi }
      limits:   { cpu: 500m, memory: 512Mi }
    mon:
      requests: { cpu: 100m, memory: 256Mi }
      limits:   { cpu: 500m, memory: 512Mi }
    osd:
      requests: { cpu: 100m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }

cephBlockPools:
  - name: single-pool
    spec:
      failureDomain: osd
      replicated:
        size: 1
        requireSafeReplicaSize: false
    storageClass:
      enabled: true
      name: ceph-rbd-single
      isDefault: true
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

cephBlockPoolsVolumeSnapshotClass:
  enabled: true
  name: ceph-rbd-snapshot
  isDefault: true
  deletionPolicy: Delete
  labels:
    velero.io/csi-volumesnapshot-class: "true"

cephObjectStores: []
cephFileSystems: []

monitoring:
  enabled: false

toolbox:
  enabled: true

No data redundancy

Single-replica pools and a single MON. An OSD or node failure will cause data loss. Use only for development or testing.

Commit and Deploy¶

git add flux/infra/base/rook-ceph.yaml \
        flux/infra/base/rook-ceph-cluster.yaml \
        flux/infra/baremetal/rook-ceph/ \
        flux/infra/baremetal/rook-ceph-cluster/
git commit -m "feat(storage): add Rook-Ceph for bare metal environment"
git push

flux reconcile kustomization infra-rook-ceph -n flux-system --with-source
# Wait for operator to be ready, then:
flux reconcile kustomization infra-rook-ceph-cluster -n flux-system --with-source

Verify¶

# Check Ceph cluster health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status

# Verify OSDs are up
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree

# Check storage classes
kubectl get sc

# Verify S3 object store (HA only)
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- radosgw-admin user list

Flux Operations¶

This component is managed by Flux as HelmRelease rook-ceph-cluster and Kustomization infra-rook-ceph-cluster.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease rook-ceph-cluster -n flux-system

flux get kustomization infra-rook-ceph-cluster -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-rook-ceph-cluster -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease rook-ceph-cluster -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=rook-ceph-cluster -n flux-system

Recovering a stalled HelmRelease

The Rook-Ceph Cluster HelmRelease often takes longer than the default Helm timeout (5 minutes) because OSD provisioning is slow. If the HelmRelease shows Stalled with RetriesExceeded, suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease rook-ceph-cluster -n flux-system
flux resume helmrelease rook-ceph-cluster -n flux-system
flux reconcile kustomization infra-rook-ceph-cluster -n flux-system

Only run this after confirming the underlying issue (e.g. OSD provisioning timeout) has been resolved.

Rook-Ceph Operator¶

Rook orchestrates Ceph storage clusters on Kubernetes. The operator is installed first, then the cluster configuration follows. Rook manages MONs, MGRs, OSDs, and the RGW (S3-compatible) gateway.

Install¶

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`rook-ceph`	Rook operator Helm chart
`version`	`v1.17.9`	Pinned chart version
`sourceRef.name`	`rook-release`	HelmRepository CR pointing to `https://charts.rook.io/release`
`targetNamespace`	`rook-ceph`	Rook components run in their own namespace

Save the following as flux/infra/base/rook-ceph.yaml:

flux/infra/base/rook-ceph.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: rook-ceph
  namespace: flux-system
spec:
  targetNamespace: rook-ceph
  interval: 30m
  chart:
    spec:
      chart: rook-ceph
      version: "v1.17.9"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  releaseName: rook-ceph
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3

Alternative: Helm CLI

helm repo add rook-release https://charts.rook.io/release
helm repo update
helm upgrade --install rook-ceph rook-release/rook-ceph \
  --namespace rook-ceph \
  --create-namespace \
  --version v1.17.9 \
  -f values.yaml

Configuration¶

mkdir -p flux/infra/baremetal/rook-ceph

Save the following values file for the operator. Choose HA or Non-HA:

HANon-HA

flux/infra/baremetal/rook-ceph/values.yaml

# Rook-Ceph Operator — HA configuration

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

csi:
  enableCSIHostNetwork: true
  enableRbdDriver: true
  enableCephfsDriver: false
  enableCSISnapshotter: true

monitoring:
  enabled: true

enableDiscoveryDaemon: true
logLevel: INFO

flux/infra/baremetal/rook-ceph/values.yaml

# Rook-Ceph Operator — Non-HA configuration

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 256Mi

csi:
  enableCSIHostNetwork: true
  enableRbdDriver: true
  enableCephfsDriver: false
  enableCSISnapshotter: true

monitoring:
  enabled: false

enableDiscoveryDaemon: true
logLevel: INFO

Commit and Deploy¶

git add flux/infra/base/rook-ceph.yaml \
        flux/infra/baremetal/rook-ceph/
git commit -m "feat(storage): add Rook-Ceph operator for bare metal"
git push

flux reconcile kustomization infra-rook-ceph -n flux-system --with-source

Flux Operations¶

This component is managed by Flux as HelmRelease rook-ceph and Kustomization infra-rook-ceph.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease rook-ceph -n flux-system

flux get kustomization infra-rook-ceph -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-rook-ceph -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease rook-ceph -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=rook-ceph -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease rook-ceph -n flux-system
flux resume helmrelease rook-ceph -n flux-system
flux reconcile kustomization infra-rook-ceph -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved.

Rook-Ceph Cluster¶

The cluster chart creates the actual Ceph cluster — MONs, MGRs, OSDs, the RGW object store, block pools, storage classes, and the VolumeSnapshotClass for Velero.

Deploy after the operator

The Rook-Ceph Cluster HelmRelease depends on the operator being healthy. Flux enforces this via dependsOn in the infrastructure ResourceSet.

Install¶

Save the following as flux/infra/base/rook-ceph-cluster.yaml:

flux/infra/base/rook-ceph-cluster.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: rook-ceph-cluster
  namespace: flux-system
spec:
  targetNamespace: rook-ceph
  interval: 30m
  chart:
    spec:
      chart: rook-ceph-cluster
      version: "v1.17.9"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  releaseName: rook-ceph-cluster
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3

Alternative: Helm CLI

helm upgrade --install rook-ceph-cluster rook-release/rook-ceph-cluster \
  --namespace rook-ceph \
  --version v1.17.9 \
  -f values.yaml

Configuration¶

mkdir -p flux/infra/baremetal/rook-ceph-cluster

The cluster values define the Ceph topology, storage devices, block pools, storage classes, and optional S3 object store. Save as flux/infra/baremetal/rook-ceph-cluster/values.yaml:

HANon-HA

flux/infra/baremetal/rook-ceph-cluster/values.yaml

# Rook-Ceph Cluster — HA configuration
# 3 MON, 2 MGR, 3 worker nodes, replicated pools, S3 object store

operatorNamespace: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3

  dataDirHostPath: /var/lib/rook

  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    allowMultiplePerNode: false
    modules:
      - name: pg_autoscaler
        enabled: true
      - name: rook
        enabled: true

  dashboard:
    enabled: true
    ssl: false

  storage:
    useAllNodes: false
    useAllDevices: true
    deviceFilter: "^sd[b-z]$"
    config:
      osdsPerDevice: "1"
    nodes:
      - name: "rciis-kenya-wn-01"
      - name: "rciis-kenya-wn-02"
      - name: "rciis-kenya-wn-03"

  resources:
    mgr:
      requests: { cpu: 250m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }
    mon:
      requests: { cpu: 250m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }
    osd:
      requests: { cpu: 250m, memory: 1Gi }
      limits:   { cpu: 2000m, memory: 2Gi }

# Block Pools and StorageClasses
cephBlockPools:
  # Single-replica — for apps WITH built-in replication (PostgreSQL, Kafka, etcd)
  - name: single-pool
    spec:
      failureDomain: osd
      replicated:
        size: 1
        requireSafeReplicaSize: false
    storageClass:
      enabled: true
      name: ceph-rbd-single
      isDefault: false
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

  # 3x replicated — for apps WITHOUT built-in replication
  - name: default-pool
    spec:
      failureDomain: host
      replicated:
        size: 3
    storageClass:
      enabled: true
      name: ceph-rbd
      isDefault: true
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

# VolumeSnapshotClass for Velero CSI backups
cephBlockPoolsVolumeSnapshotClass:
  enabled: true
  name: ceph-rbd-snapshot
  isDefault: true
  deletionPolicy: Delete
  labels:
    velero.io/csi-volumesnapshot-class: "true"

# S3-compatible Object Store (Ceph RGW)
cephObjectStores:
  - name: ceph-objectstore
    spec:
      allowUsersInNamespaces: [velero, monitoring]
      metadataPool:
        failureDomain: host
        replicated: { size: 3 }
      dataPool:
        failureDomain: host
        replicated: { size: 3 }
      preservePoolsOnDelete: true
      gateway:
        port: 80
        instances: 2
        resources:
          requests: { cpu: 250m, memory: 512Mi }
          limits:   { cpu: 1000m, memory: 1Gi }
    storageClass:
      enabled: true
      name: ceph-bucket
      reclaimPolicy: Delete
      volumeBindingMode: Immediate

cephFileSystems: []

monitoring:
  enabled: true
  createPrometheusRules: true

toolbox:
  enabled: true

Key settings:

Setting	Value	Why
`mon.count: 3`	3 monitors	Ceph quorum requires an odd number ≥ 3 for HA
`mgr.count: 2`	2 managers	Active/standby for dashboard and module HA
`deviceFilter: "^sd[b-z]$"`	Disk regex	Matches secondary disks, excludes boot disk (`sda`)
`single-pool`	1x replicated	For apps with built-in replication (PostgreSQL, Kafka) — avoids double replication
`default-pool`	3x replicated	For apps without replication — maximum durability
`cephObjectStores`	RGW gateway	S3-compatible endpoint for Velero backups and application object storage

flux/infra/baremetal/rook-ceph-cluster/values.yaml

# Rook-Ceph Cluster — Non-HA configuration
# 1 MON, 1 MGR, single-replica pools, no object store

operatorNamespace: rook-ceph

cephClusterSpec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3

  dataDirHostPath: /var/lib/rook

  mon:
    count: 1
    allowMultiplePerNode: true
  mgr:
    count: 1
    allowMultiplePerNode: true
    modules:
      - name: pg_autoscaler
        enabled: true

  dashboard:
    enabled: true
    ssl: false

  storage:
    useAllNodes: true
    useAllDevices: true
    deviceFilter: "^sd[b-z]$"
    config:
      osdsPerDevice: "1"

  resources:
    mgr:
      requests: { cpu: 100m, memory: 256Mi }
      limits:   { cpu: 500m, memory: 512Mi }
    mon:
      requests: { cpu: 100m, memory: 256Mi }
      limits:   { cpu: 500m, memory: 512Mi }
    osd:
      requests: { cpu: 100m, memory: 512Mi }
      limits:   { cpu: 1000m, memory: 1Gi }

cephBlockPools:
  - name: single-pool
    spec:
      failureDomain: osd
      replicated:
        size: 1
        requireSafeReplicaSize: false
    storageClass:
      enabled: true
      name: ceph-rbd-single
      isDefault: true
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      volumeBindingMode: Immediate
      parameters:
        imageFormat: "2"
        imageFeatures: layering
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

cephBlockPoolsVolumeSnapshotClass:
  enabled: true
  name: ceph-rbd-snapshot
  isDefault: true
  deletionPolicy: Delete
  labels:
    velero.io/csi-volumesnapshot-class: "true"

cephObjectStores: []
cephFileSystems: []

monitoring:
  enabled: false

toolbox:
  enabled: true

No data redundancy

Single-replica pools and a single MON. An OSD or node failure will cause data loss. Use only for development or testing.

Commit and Deploy¶

git add flux/infra/base/rook-ceph.yaml \
        flux/infra/base/rook-ceph-cluster.yaml \
        flux/infra/baremetal/rook-ceph/ \
        flux/infra/baremetal/rook-ceph-cluster/
git commit -m "feat(storage): add Rook-Ceph for bare metal environment"
git push

flux reconcile kustomization infra-rook-ceph -n flux-system --with-source
# Wait for operator to be ready, then:
flux reconcile kustomization infra-rook-ceph-cluster -n flux-system --with-source

Verify¶

# Check Ceph cluster health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status

# Verify OSDs are up
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree

# Check storage classes
kubectl get sc

# Verify S3 object store (HA only)
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- radosgw-admin user list

Flux Operations¶

This component is managed by Flux as HelmRelease rook-ceph-cluster and Kustomization infra-rook-ceph-cluster.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease rook-ceph-cluster -n flux-system

flux get kustomization infra-rook-ceph-cluster -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-rook-ceph-cluster -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease rook-ceph-cluster -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=rook-ceph-cluster -n flux-system

Recovering a stalled HelmRelease

The Rook-Ceph Cluster HelmRelease often takes longer than the default Helm timeout (5 minutes) because OSD provisioning is slow. If the HelmRelease shows Stalled with RetriesExceeded, suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease rook-ceph-cluster -n flux-system
flux resume helmrelease rook-ceph-cluster -n flux-system
flux reconcile kustomization infra-rook-ceph-cluster -n flux-system

Only run this after confirming the underlying issue (e.g. OSD provisioning timeout) has been resolved.

Snapshot Controller¶

The CSI Snapshot Controller enables VolumeSnapshot support in Kubernetes, required for Velero CSI-based backups. This component is shared across both AWS and Bare Metal environments.

Install¶

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`snapshot-controller`	CSI snapshot controller chart
`version`	`5.0.2`	Pinned chart version
`sourceRef.name`	`piraeus`	HelmRepository CR pointing to `https://piraeus.io/helm-charts`
`targetNamespace`	`snapshot-controller`	Deployed in its own namespace

Save the following as flux/infra/base/snapshot-controller.yaml:

flux/infra/base/snapshot-controller.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: snapshot-controller
  namespace: flux-system
spec:
  targetNamespace: snapshot-controller
  interval: 30m
  chart:
    spec:
      chart: snapshot-controller
      version: "5.0.2"
      sourceRef:
        kind: HelmRepository
        name: piraeus
        namespace: flux-system
  releaseName: snapshot-controller
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3

Alternative: Helm CLI

helm repo add piraeus https://piraeus.io/helm-charts
helm repo update
helm upgrade --install snapshot-controller piraeus/snapshot-controller \
  --namespace snapshot-controller \
  --create-namespace \
  --version 5.0.2

Commit and Deploy¶

AWSBare MetalProxmox VMs

git add flux/infra/base/snapshot-controller.yaml
git commit -m "feat(storage): add CSI Snapshot Controller"
git push

git add flux/infra/base/snapshot-controller.yaml
git commit -m "feat(storage): add CSI Snapshot Controller"
git push

git add flux/infra/base/snapshot-controller.yaml
git commit -m "feat(storage): add CSI Snapshot Controller"
git push

flux reconcile kustomization infra-snapshot-controller -n flux-system --with-source

Verify¶

kubectl get pods -n snapshot-controller

kubectl get volumesnapshotclass

Flux Operations¶

This component is managed by Flux as HelmRelease snapshot-controller and Kustomization infra-snapshot-controller.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease snapshot-controller -n flux-system

flux get kustomization infra-snapshot-controller -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-snapshot-controller -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease snapshot-controller -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=snapshot-controller -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease snapshot-controller -n flux-system
flux resume helmrelease snapshot-controller -n flux-system
flux reconcile kustomization infra-snapshot-controller -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved.

Next Steps¶

Storage is now configured. Proceed to 5.2 Security to set up the policy engine, vulnerability scanning, and runtime threat detection.