5.3.3 Backup & Scheduling¶

The backup strategy combines two complementary approaches: Velero for Kubernetes resource and PVC-level backups using CSI volume snapshots, and CloudNativePG continuous backup for PostgreSQL point-in-time recovery (PITR) using WAL archiving to S3-compatible storage. The Descheduler continuously rebalances pod placement to maintain even resource distribution across nodes.

How to use this page

Each component has an Install section showing the Flux HelmRelease, a Configuration section with Helm values, and a Verify section to confirm it is working.

All code blocks are labelled with their file path in the repository. Select your target environment (AWS or Bare Metal) in any tab group — the choice syncs across the entire page.

Using the existing rciis-devops repository: All files already exist. Skip the mkdir and git add/git commit commands — they are for users building a new repository. Simply review the files, edit values for your environment, and push.
Building a new repository from scratch: Follow the mkdir, file creation, and git commands in order.
No Git access: Expand the "Alternative: Helm CLI" block under each Install section.

Velero¶

Velero backs up Kubernetes resources and persistent volumes. It uses CSI snapshots for volume backups and stores backup metadata in a cloud object store. On AWS, this is AWS S3. On Bare Metal, it is the in-cluster Ceph Object Store (S3-compatible via RGW). This enables both scheduled backups and on-demand disaster recovery.

Install¶

The base HelmRelease tells Flux which chart to install. This file is shared across all environments — environment-specific settings are applied via patches (shown in the Configuration section).

Create the base directory and file:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`velero`	The Helm chart name from the VMware Tanzu registry
`version`	`11.3.2`	Pinned chart version — update this to upgrade Velero
`sourceRef.name`	`vmware-tanzu`	References a `HelmRepository` CR pointing to the VMware Tanzu Helm repository
`targetNamespace`	`velero`	Velero is installed in its own namespace
`crds: CreateReplace`	—	Automatically installs and updates Velero CRDs
`remediation.retries`	`3`	Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/velero.yaml:

flux/infra/base/velero.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: velero
  namespace: flux-system
spec:
  targetNamespace: velero
  interval: 30m
  chart:
    spec:
      chart: velero
      version: "11.3.2"
      sourceRef:
        kind: HelmRepository
        name: vmware-tanzu
        namespace: flux-system
  releaseName: velero
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
  values:
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 512Mi
    initContainers:
      - name: velero-plugin-for-aws
        image: velero/velero-plugin-for-aws:v1.13.0
        volumeMounts:
          - mountPath: /target
            name: plugins
    configuration:
      features: EnableCSI
      volumeSnapshotLocation: []
    credentials:
      useSecret: true
      existingSecret: velero-s3-credentials
    deployNodeAgent: false
    metrics:
      enabled: true
      serviceMonitor:
        enabled: true
        additionalLabels:
          release: prometheus
    schedules: {}
    kubectl:
      image:
        repository: public.ecr.aws/bitnami/kubectl

Alternative: Helm CLI

If you do not have Git access, install Velero directly:

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update
helm upgrade --install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --version 11.3.2 \
  -f values.yaml

Configuration¶

The environment patch overrides the base HelmRelease with cluster-specific settings. The values file controls where backups are stored and how Velero behaves. Select your environment below.

Create the environment overlay directory:

AWSBare MetalProxmox VMs

mkdir -p flux/infra/aws/velero

mkdir -p flux/infra/baremetal/velero

mkdir -p flux/infra/baremetal/velero

Environment Patch¶

The patch file sets the backup storage location. This differs fundamentally between AWS and Bare Metal.

Save the following as the patch file for your environment:

AWSBare MetalProxmox VMs

On AWS, Velero stores backup metadata directly in AWS S3. The AWS plugin uses native S3 endpoints — no s3Url or s3ForcePathStyle is needed.

flux/infra/aws/velero/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: velero
spec:
  values:
    resources:
      requests:
        cpu: 25m
        memory: 64Mi
      limits:
        cpu: 250m
        memory: 256Mi
    configuration:
      backupStorageLocation:
        - name: default
          provider: aws
          bucket: rciis-aws-velero-backups
          config:
            region: af-south-1

Setting	Value	Why
`bucket`	`rciis-aws-velero-backups`	AWS S3 bucket for backup storage
`region`	`af-south-1`	AWS region where the bucket is located
Resource limits (reduced)	CPU 25m, RAM 64Mi	AWS deployments use less resources than HA bare metal

On Bare Metal, Velero stores backup metadata in the in-cluster Ceph Object Store (RGW). The patch configures S3 compatibility settings for Ceph RGW.

flux/infra/baremetal/velero/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: velero
spec:
  values:
    configuration:
      backupStorageLocation:
        - name: default
          provider: aws
          bucket: velero-backups
          config:
            region: rciis-kenya
            s3ForcePathStyle: true
            s3Url: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80

Setting	Value	Why
`bucket`	`velero-backups`	Ceph RGW bucket for backup storage
`region`	`rciis-kenya`	Region identifier for Ceph RGW (arbitrary)
`s3ForcePathStyle`	`true`	Uses path-style S3 URLs (required for Ceph RGW)
`s3Url`	`http://rook-ceph-rgw-...`	Ceph RGW endpoint within the cluster

On Bare Metal, Velero stores backup metadata in the in-cluster Ceph Object Store (RGW). The patch configures S3 compatibility settings for Ceph RGW.

flux/infra/baremetal/velero/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: velero
spec:
  values:
    configuration:
      backupStorageLocation:
        - name: default
          provider: aws
          bucket: velero-backups
          config:
            region: rciis-kenya
            s3ForcePathStyle: true
            s3Url: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80

Setting	Value	Why
`bucket`	`velero-backups`	Ceph RGW bucket for backup storage
`region`	`rciis-kenya`	Region identifier for Ceph RGW (arbitrary)
`s3ForcePathStyle`	`true`	Uses path-style S3 URLs (required for Ceph RGW)
`s3Url`	`http://rook-ceph-rgw-...`	Ceph RGW endpoint within the cluster

Helm Values¶

The values file controls Velero's backup schedules and feature flags. Save the following as the values file for your environment:

AWSBare MetalProxmox VMs

HANon-HA

flux/infra/aws/velero/values.yaml

# Velero — AWS HA configuration
# Automated backup schedules, CSI snapshots, S3 backend

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 65534
  runAsGroup: 65534
  seccompProfile:
    type: RuntimeDefault

containerSecurityContext:
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    additionalLabels:
      release: prometheus

# Automated backup schedules
schedules:
  # Daily namespace backup — retains 30 days
  daily-namespaces:
    disabled: false
    schedule: "0 2 * * *"   # 02:00 UTC daily
    useOwnerReferencesInBackup: false
    template:
      ttl: "720h"           # 30 days
      storageLocation: default
      includedNamespaces:
        - rciis-aws
        - monitoring
        - strimzi-operator
        - cnpg-system
      snapshotMoveData: false

  # Weekly full-cluster backup — retains 90 days
  weekly-full:
    disabled: false
    schedule: "0 3 * * 0"   # 03:00 UTC Sunday
    useOwnerReferencesInBackup: false
    template:
      ttl: "2160h"          # 90 days
      storageLocation: default
      includeClusterResources: true
      snapshotMoveData: false

flux/infra/aws/velero/values.yaml

# Velero — AWS Non-HA configuration
# No automated schedules, reduced resources, on-demand backups only

metrics:
  enabled: true
  serviceMonitor:
    enabled: false

# No automated schedules — create on-demand backups as needed
schedules: {}

HANon-HA

flux/infra/baremetal/velero/values.yaml

# Velero — Bare Metal HA configuration
# Automated backup schedules, CSI snapshots, Ceph RGW backend

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 65534
  runAsGroup: 65534
  seccompProfile:
    type: RuntimeDefault

containerSecurityContext:
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    additionalLabels:
      release: prometheus

# Automated backup schedules
schedules:
  # Daily namespace backup — retains 30 days
  daily-namespaces:
    disabled: false
    schedule: "0 2 * * *"   # 02:00 UTC daily
    useOwnerReferencesInBackup: false
    template:
      ttl: "720h"           # 30 days
      storageLocation: default
      includedNamespaces:
        - rciis-kenya
        - monitoring
        - strimzi-operator
        - cnpg-system
      snapshotMoveData: false

  # Weekly full-cluster backup — retains 90 days
  weekly-full:
    disabled: false
    schedule: "0 3 * * 0"   # 03:00 UTC Sunday
    useOwnerReferencesInBackup: false
    template:
      ttl: "2160h"          # 90 days
      storageLocation: default
      includeClusterResources: true
      snapshotMoveData: false

flux/infra/baremetal/velero/values.yaml

# Velero — Bare Metal Non-HA configuration
# No automated schedules, reduced resources, on-demand backups only

metrics:
  enabled: true
  serviceMonitor:
    enabled: false

# No automated schedules — create on-demand backups as needed
schedules: {}

HANon-HA

flux/infra/baremetal/velero/values.yaml

# Velero — Bare Metal HA configuration
# Automated backup schedules, CSI snapshots, Ceph RGW backend

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 65534
  runAsGroup: 65534
  seccompProfile:
    type: RuntimeDefault

containerSecurityContext:
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    additionalLabels:
      release: prometheus

# Automated backup schedules
schedules:
  # Daily namespace backup — retains 30 days
  daily-namespaces:
    disabled: false
    schedule: "0 2 * * *"   # 02:00 UTC daily
    useOwnerReferencesInBackup: false
    template:
      ttl: "720h"           # 30 days
      storageLocation: default
      includedNamespaces:
        - rciis-kenya
        - monitoring
        - strimzi-operator
        - cnpg-system
      snapshotMoveData: false

  # Weekly full-cluster backup — retains 90 days
  weekly-full:
    disabled: false
    schedule: "0 3 * * 0"   # 03:00 UTC Sunday
    useOwnerReferencesInBackup: false
    template:
      ttl: "2160h"          # 90 days
      storageLocation: default
      includeClusterResources: true
      snapshotMoveData: false

flux/infra/baremetal/velero/values.yaml

# Velero — Bare Metal Non-HA configuration
# No automated schedules, reduced resources, on-demand backups only

metrics:
  enabled: true
  serviceMonitor:
    enabled: false

# No automated schedules — create on-demand backups as needed
schedules: {}

Key settings (all environments):

Setting	HA	Non-HA	Why
`schedules.*`	Daily + weekly	Empty `{}`	Automated schedules provide continuous protection vs on-demand only
`metrics.serviceMonitor.enabled`	`true`	`false`	HA exports Velero metrics to Prometheus for monitoring
`podSecurityContext`	Strict (`65534:65534`)	Inherited from base	HA enforces non-root execution for security
`EnableCSI`	Enabled in base	Enabled in base	CSI snapshots required for PVC-level backups

Commit and Deploy¶

Once all files are in place, commit and push to trigger Flux deployment:

AWSBare MetalProxmox VMs

git add flux/infra/base/velero.yaml \
        flux/infra/aws/velero/
git commit -m "feat(velero): add Velero backup for AWS environment"
git push

git add flux/infra/base/velero.yaml \
        flux/infra/baremetal/velero/
git commit -m "feat(velero): add Velero backup for bare metal environment"
git push

git add flux/infra/base/velero.yaml \
        flux/infra/baremetal/velero/
git commit -m "feat(velero): add Velero backup for bare metal environment"
git push

Flux will detect the new commit and begin deploying Velero. To trigger an immediate sync instead of waiting for the next poll interval:

flux reconcile kustomization infra-velero -n flux-system --with-source

Extra Manifests - Ceph S3 User¶

Bare Metal only

This manifest is only required when using Ceph RGW as the backup storage backend. AWS deployments use IAM credentials instead.

Velero needs an S3 user in Ceph to access the backup bucket:

Bare MetalProxmox VMs

flux/infra/baremetal/velero/velero-s3-user.yaml

apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
  name: velero
  namespace: velero
spec:
  store: ceph-objectstore
  clusterNamespace: rook-ceph
  displayName: "Velero Backup User"
  capabilities:
    user: "*"
    bucket: "*"

CSI-only backup strategy

With deployNodeAgent: false, only PVCs backed by CSI-compatible storage classes (Ceph RBD) are snapshotted. Ensure all critical workloads use ceph-rbd or ceph-rbd-single storage classes.

flux/infra/baremetal/velero/velero-s3-user.yaml

apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
  name: velero
  namespace: velero
spec:
  store: ceph-objectstore
  clusterNamespace: rook-ceph
  displayName: "Velero Backup User"
  capabilities:
    user: "*"
    bucket: "*"

CSI-only backup strategy

With deployNodeAgent: false, only PVCs backed by CSI-compatible storage classes (Ceph RBD) are snapshotted. Ensure all critical workloads use ceph-rbd or ceph-rbd-single storage classes.

Verify¶

# Check Velero is running
kubectl get pods -n velero

# Verify backup storage location
velero backup-location get

# Create a test backup
velero backup create test-backup --include-namespaces default --wait

# Check backup status
velero backup describe test-backup

# Clean up test backup
velero backup delete test-backup --confirm

Flux Operations¶

This component is managed by Flux as HelmRelease velero and Kustomization infra-velero.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease velero -n flux-system

flux get kustomization infra-velero -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-velero -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease velero -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=velero -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease velero -n flux-system
flux resume helmrelease velero -n flux-system
flux reconcile kustomization infra-velero -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.

Next: Continue to CloudNativePG Backups below.

CloudNativePG Backups¶

CloudNativePG provides continuous backup at the PostgreSQL level using Barman. This is independent of Velero — while Velero backs up Kubernetes resources and PVCs as CSI snapshots, CNPG archives the PostgreSQL Write-Ahead Log (WAL) stream and performs periodic base backups directly to S3-compatible storage.

This enables point-in-time recovery (PITR) for all PostgreSQL databases managed by the CNPG operator (Grafana, Keycloak, application databases).

Operator vs Cluster backups

The CNPG operator (installed in Data Services) does not configure backups itself. Backups are configured per Cluster CR in each application namespace. The examples below show the backup stanza to add to any CNPG Cluster.

Cluster Backup Configuration¶

Add the backup stanza to any CNPG Cluster CR to enable continuous WAL archiving and base backups. The S3 destination depends on your deployment model:

AWSBare MetalProxmox VMs

cluster-with-backup.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example-db
  namespace: app-namespace
spec:
  instances: 3
  storage:
    size: 10Gi
    storageClassName: ceph-rbd-single

  backup:
    barmanObjectStore:
      destinationPath: s3://rciis-cnpg-backups/example-db
      s3Credentials:
        accessKeyId:
          name: cnpg-s3-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: cnpg-s3-credentials
          key: ACCESS_SECRET_KEY
      wal:
        compression: gzip
        maxParallel: 2
      data:
        compression: gzip
    retentionPolicy: "30d"

The cnpg-s3-credentials Secret contains AWS IAM credentials:

cnpg-s3-credentials (SOPS-encrypted)

apiVersion: v1
kind: Secret
metadata:
  name: cnpg-s3-credentials
  namespace: app-namespace
type: Opaque
stringData:
  ACCESS_KEY_ID: "<AWS_ACCESS_KEY_ID>"
  ACCESS_SECRET_KEY: "<AWS_SECRET_ACCESS_KEY>"

IAM Roles for Service Accounts (IRSA)

On EKS, prefer IRSA over static credentials. Set backup.barmanObjectStore.s3Credentials.inheritFromIAMRole: true and annotate the CNPG ServiceAccount with the IAM role ARN.

cluster-with-backup.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example-db
  namespace: app-namespace
spec:
  instances: 3
  storage:
    size: 10Gi
    storageClassName: ceph-rbd-single

  backup:
    barmanObjectStore:
      destinationPath: s3://cnpg-backups/example-db
      endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
      s3Credentials:
        accessKeyId:
          name: cnpg-s3-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: cnpg-s3-credentials
          key: ACCESS_SECRET_KEY
      wal:
        compression: gzip
        maxParallel: 2
      data:
        compression: gzip
    retentionPolicy: "30d"

The cnpg-s3-credentials Secret contains the Ceph RGW user credentials:

cnpg-s3-credentials (SOPS-encrypted)

apiVersion: v1
kind: Secret
metadata:
  name: cnpg-s3-credentials
  namespace: app-namespace
type: Opaque
stringData:
  ACCESS_KEY_ID: "<CEPH_RGW_ACCESS_KEY>"
  ACCESS_SECRET_KEY: "<CEPH_RGW_SECRET_KEY>"

cluster-with-backup.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example-db
  namespace: app-namespace
spec:
  instances: 3
  storage:
    size: 10Gi
    storageClassName: ceph-rbd-single

  backup:
    barmanObjectStore:
      destinationPath: s3://cnpg-backups/example-db
      endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
      s3Credentials:
        accessKeyId:
          name: cnpg-s3-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: cnpg-s3-credentials
          key: ACCESS_SECRET_KEY
      wal:
        compression: gzip
        maxParallel: 2
      data:
        compression: gzip
    retentionPolicy: "30d"

The cnpg-s3-credentials Secret contains the Ceph RGW user credentials:

cnpg-s3-credentials (SOPS-encrypted)

apiVersion: v1
kind: Secret
metadata:
  name: cnpg-s3-credentials
  namespace: app-namespace
type: Opaque
stringData:
  ACCESS_KEY_ID: "<CEPH_RGW_ACCESS_KEY>"
  ACCESS_SECRET_KEY: "<CEPH_RGW_SECRET_KEY>"

Scheduled Base Backups¶

WAL archiving is continuous, but periodic base backups are needed for efficient recovery. Create a ScheduledBackup CR for each PostgreSQL cluster:

scheduled-backup.yaml

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: example-db-daily
  namespace: app-namespace
spec:
  schedule: "0 2 * * *"           # 02:00 UTC daily
  backupOwnerReference: self
  cluster:
    name: example-db
  method: barmanObjectStore

Backup retention

The retentionPolicy: "30d" in the Cluster CR controls how long base backups and WAL files are retained. The ScheduledBackup creates new base backups on schedule — old base backups and WAL segments beyond the retention window are automatically pruned by Barman.

Ceph S3 User for CNPG¶

Bare Metal only

This manifest is only required when using Ceph RGW as the backup storage backend. AWS deployments use IAM credentials instead.

cnpg-s3-user.yaml

apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
  name: cnpg-backup
  namespace: rook-ceph
spec:
  store: ceph-objectstore
  clusterNamespace: rook-ceph
  displayName: "CNPG Backup User"
  capabilities:
    user: "*"
    bucket: "*"

Recovery¶

To recover a PostgreSQL cluster to a specific point in time, create a new Cluster CR that bootstraps from the backup:

AWSBare MetalProxmox VMs

recovery-cluster.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example-db-recovered
  namespace: app-namespace
spec:
  instances: 3
  storage:
    size: 10Gi
    storageClassName: ceph-rbd-single

  bootstrap:
    recovery:
      source: example-db-backup
      recoveryTarget:
        targetTime: "2026-02-15T12:00:00Z"

  externalClusters:
    - name: example-db-backup
      barmanObjectStore:
        destinationPath: s3://rciis-cnpg-backups/example-db
        s3Credentials:
          accessKeyId:
            name: cnpg-s3-credentials
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: cnpg-s3-credentials
            key: ACCESS_SECRET_KEY

recovery-cluster.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example-db-recovered
  namespace: app-namespace
spec:
  instances: 3
  storage:
    size: 10Gi
    storageClassName: ceph-rbd-single

  bootstrap:
    recovery:
      source: example-db-backup
      recoveryTarget:
        targetTime: "2026-02-15T12:00:00Z"

  externalClusters:
    - name: example-db-backup
      barmanObjectStore:
        destinationPath: s3://cnpg-backups/example-db
        endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
        s3Credentials:
          accessKeyId:
            name: cnpg-s3-credentials
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: cnpg-s3-credentials
            key: ACCESS_SECRET_KEY

recovery-cluster.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: example-db-recovered
  namespace: app-namespace
spec:
  instances: 3
  storage:
    size: 10Gi
    storageClassName: ceph-rbd-single

  bootstrap:
    recovery:
      source: example-db-backup
      recoveryTarget:
        targetTime: "2026-02-15T12:00:00Z"

  externalClusters:
    - name: example-db-backup
      barmanObjectStore:
        destinationPath: s3://cnpg-backups/example-db
        endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
        s3Credentials:
          accessKeyId:
            name: cnpg-s3-credentials
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: cnpg-s3-credentials
            key: ACCESS_SECRET_KEY

Verify¶

# Check backup status on a CNPG cluster
kubectl get cluster example-db -n app-namespace \
  -o jsonpath='{.status.lastSuccessfulBackup}'

# List backups
kubectl get backups -n app-namespace

# Check WAL archiving — first recoverable point
kubectl get cluster example-db -n app-namespace \
  -o jsonpath='{.status.firstRecoverabilityPoint}'

# Verify scheduled backups
kubectl get scheduledbackups -n app-namespace

Descheduler¶

The Kubernetes Descheduler evicts pods that violate scheduling constraints or contribute to resource imbalance. It works alongside the default scheduler — the descheduler evicts, and the scheduler re-places pods on better-suited nodes.

Install¶

The base HelmRelease tells Flux which chart to install. This file is shared across all environments — environment-specific settings are applied via patches.

Create the base directory and file:

mkdir -p flux/infra/base

Field	Value	Explanation
`chart`	`descheduler`	The Helm chart name from the Descheduler registry
`version`	`0.34.0`	Pinned chart version — update this to upgrade Descheduler
`sourceRef.name`	`descheduler`	References a `HelmRepository` CR pointing to the Descheduler Helm repository
`targetNamespace`	`kube-system`	Descheduler runs in the system namespace
`crds: CreateReplace`	—	Automatically installs and updates Descheduler CRDs
`remediation.retries`	`3`	Flux retries up to 3 times if the install or upgrade fails

Save the following as flux/infra/base/descheduler.yaml:

flux/infra/base/descheduler.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: descheduler
  namespace: flux-system
spec:
  targetNamespace: kube-system
  interval: 30m
  chart:
    spec:
      chart: descheduler
      version: "0.34.0"
      sourceRef:
        kind: HelmRepository
        name: descheduler
        namespace: flux-system
  releaseName: descheduler
  install:
    createNamespace: true
    crds: CreateReplace
    remediation:
      retries: 3
  upgrade:
    crds: CreateReplace
    remediation:
      retries: 3
  values:
    replicas: 3
    leaderElection:
      enabled: true
    kind: Deployment
    deschedulerPolicy:
      profiles:
        - name: default
          pluginConfig:
            - name: DefaultEvictor
              args:
                evictLocalStoragePods: false
                evictSystemCriticalPods: false
                nodeFit: true
            - name: LowNodeUtilization
              args:
                useDeviationThresholds: true
                thresholds:
                  cpu: 10
                  memory: 10
                  pods: 10
                targetThresholds:
                  cpu: 20
                  memory: 20
                  pods: 20
            - name: RemovePodsViolatingTopologySpreadConstraint
              args:
                constraints:
                  - DoNotSchedule
          plugins:
            balance:
              enabled:
                - LowNodeUtilization
                - RemovePodsViolatingTopologySpreadConstraint

Alternative: Helm CLI

If you do not have Git access, install Descheduler directly:

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler
helm repo update
helm upgrade --install descheduler descheduler/descheduler \
  --namespace kube-system \
  --version 0.34.0 \
  -f values.yaml

Configuration¶

The environment patch overrides the base HelmRelease with cluster-specific resource settings. Only AWS has a patch — Bare Metal uses the base configuration as-is.

Create the environment overlay directory:

AWSBare MetalProxmox VMs

mkdir -p flux/infra/aws/descheduler

No overlay needed — Bare Metal uses the base configuration. Skip this step.

Environment Patch¶

The patch file adjusts resource limits for your deployment model.

AWSBare MetalProxmox VMs

AWS deployments typically run on smaller instances, so Descheduler uses reduced resource limits.

flux/infra/aws/descheduler/patch.yaml

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: descheduler
spec:
  values:
    replicas: 1
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        cpu: 200m
        memory: 128Mi

Setting	Value	Why
`replicas`	`1`	AWS deployments run a single replica; no leader election needed
Resource limits (reduced)	CPU 50m, RAM 64Mi	AWS instances are smaller than HA bare metal

Bare Metal uses the base configuration with 3 replicas and leader election:

Setting	Value	Why
`replicas`	`3`	HA deployment with leader election for redundancy
Resource limits	CPU 50m (base), RAM 64Mi (base)	Allows higher thresholds for rebalancing

Bare Metal uses the base configuration with 3 replicas and leader election:

Setting	Value	Why
`replicas`	`3`	HA deployment with leader election for redundancy
Resource limits	CPU 50m (base), RAM 64Mi (base)	Allows higher thresholds for rebalancing

Commit and Deploy¶

Once all files are in place, commit and push to trigger Flux deployment:

AWSBare MetalProxmox VMs

git add flux/infra/base/descheduler.yaml \
        flux/infra/aws/descheduler/
git commit -m "feat(descheduler): add Descheduler for AWS environment"
git push

git add flux/infra/base/descheduler.yaml
git commit -m "feat(descheduler): add Descheduler for bare metal environment"
git push

git add flux/infra/base/descheduler.yaml
git commit -m "feat(descheduler): add Descheduler for bare metal environment"
git push

Flux will detect the new commit and begin deploying Descheduler. To trigger an immediate sync instead of waiting for the next poll interval:

flux reconcile kustomization infra-descheduler -n flux-system --with-source

Verify¶

# Check Descheduler is running (Deployment mode)
kubectl get pods -n kube-system -l app.kubernetes.io/name=descheduler

# Check logs for eviction activity
kubectl logs -n kube-system -l app.kubernetes.io/name=descheduler --tail=50

Flux Operations¶

This component is managed by Flux as HelmRelease descheduler and Kustomization infra-descheduler.

Check whether the HelmRelease and Kustomization are in a Ready state:

flux get helmrelease descheduler -n flux-system

flux get kustomization infra-descheduler -n flux-system

Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:

flux reconcile kustomization infra-descheduler -n flux-system --with-source

Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:

flux reconcile helmrelease descheduler -n flux-system

View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:

flux logs --kind=HelmRelease --name=descheduler -n flux-system

Recovering a stalled HelmRelease

If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:

flux suspend helmrelease descheduler -n flux-system
flux resume helmrelease descheduler -n flux-system
flux reconcile kustomization infra-descheduler -n flux-system

Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.

Next Steps¶

Backup and scheduling infrastructure is now configured. Proceed to 5.3.4 Identity & Access Management to set up Kubernetes RBAC, role-based access control, and authentication policies for cluster security.