5.3.3 Backup & Scheduling¶
The backup strategy combines two complementary approaches: Velero for Kubernetes resource and PVC-level backups using CSI volume snapshots, and CloudNativePG continuous backup for PostgreSQL point-in-time recovery (PITR) using WAL archiving to S3-compatible storage. The Descheduler continuously rebalances pod placement to maintain even resource distribution across nodes.
How to use this page
Each component has an Install section showing the Flux HelmRelease, a Configuration section with Helm values, and a Verify section to confirm it is working.
All code blocks are labelled with their file path in the repository. Select your target environment (AWS or Bare Metal) in any tab group — the choice syncs across the entire page.
- Using the existing
rciis-devopsrepository: All files already exist. Skip themkdirandgit add/git commitcommands — they are for users building a new repository. Simply review the files, edit values for your environment, and push. - Building a new repository from scratch: Follow the
mkdir, file creation, andgitcommands in order. - No Git access: Expand the "Alternative: Helm CLI" block under each Install section.
Velero¶
Velero backs up Kubernetes resources and persistent volumes. It uses CSI snapshots for volume backups and stores backup metadata in a cloud object store. On AWS, this is AWS S3. On Bare Metal, it is the in-cluster Ceph Object Store (S3-compatible via RGW). This enables both scheduled backups and on-demand disaster recovery.
Install¶
The base HelmRelease tells Flux which chart to install. This file is shared across all environments — environment-specific settings are applied via patches (shown in the Configuration section).
Create the base directory and file:
| Field | Value | Explanation |
|---|---|---|
chart |
velero |
The Helm chart name from the VMware Tanzu registry |
version |
11.3.2 |
Pinned chart version — update this to upgrade Velero |
sourceRef.name |
vmware-tanzu |
References a HelmRepository CR pointing to the VMware Tanzu Helm repository |
targetNamespace |
velero |
Velero is installed in its own namespace |
crds: CreateReplace |
— | Automatically installs and updates Velero CRDs |
remediation.retries |
3 |
Flux retries up to 3 times if the install or upgrade fails |
Save the following as flux/infra/base/velero.yaml:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: velero
namespace: flux-system
spec:
targetNamespace: velero
interval: 30m
chart:
spec:
chart: velero
version: "11.3.2"
sourceRef:
kind: HelmRepository
name: vmware-tanzu
namespace: flux-system
releaseName: velero
install:
createNamespace: true
crds: CreateReplace
remediation:
retries: 3
upgrade:
crds: CreateReplace
remediation:
retries: 3
values:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.13.0
volumeMounts:
- mountPath: /target
name: plugins
configuration:
features: EnableCSI
volumeSnapshotLocation: []
credentials:
useSecret: true
existingSecret: velero-s3-credentials
deployNodeAgent: false
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus
schedules: {}
kubectl:
image:
repository: public.ecr.aws/bitnami/kubectl
Alternative: Helm CLI
If you do not have Git access, install Velero directly:
Configuration¶
The environment patch overrides the base HelmRelease with cluster-specific settings. The values file controls where backups are stored and how Velero behaves. Select your environment below.
Create the environment overlay directory:
Environment Patch¶
The patch file sets the backup storage location. This differs fundamentally between AWS and Bare Metal.
Save the following as the patch file for your environment:
On AWS, Velero stores backup metadata directly in AWS S3. The AWS plugin
uses native S3 endpoints — no s3Url or s3ForcePathStyle is needed.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: velero
spec:
values:
resources:
requests:
cpu: 25m
memory: 64Mi
limits:
cpu: 250m
memory: 256Mi
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: rciis-aws-velero-backups
config:
region: af-south-1
| Setting | Value | Why |
|---|---|---|
bucket |
rciis-aws-velero-backups |
AWS S3 bucket for backup storage |
region |
af-south-1 |
AWS region where the bucket is located |
| Resource limits (reduced) | CPU 25m, RAM 64Mi | AWS deployments use less resources than HA bare metal |
On Bare Metal, Velero stores backup metadata in the in-cluster Ceph Object Store (RGW). The patch configures S3 compatibility settings for Ceph RGW.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: velero
spec:
values:
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: velero-backups
config:
region: rciis-kenya
s3ForcePathStyle: true
s3Url: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
| Setting | Value | Why |
|---|---|---|
bucket |
velero-backups |
Ceph RGW bucket for backup storage |
region |
rciis-kenya |
Region identifier for Ceph RGW (arbitrary) |
s3ForcePathStyle |
true |
Uses path-style S3 URLs (required for Ceph RGW) |
s3Url |
http://rook-ceph-rgw-... |
Ceph RGW endpoint within the cluster |
On Bare Metal, Velero stores backup metadata in the in-cluster Ceph Object Store (RGW). The patch configures S3 compatibility settings for Ceph RGW.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: velero
spec:
values:
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: velero-backups
config:
region: rciis-kenya
s3ForcePathStyle: true
s3Url: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
| Setting | Value | Why |
|---|---|---|
bucket |
velero-backups |
Ceph RGW bucket for backup storage |
region |
rciis-kenya |
Region identifier for Ceph RGW (arbitrary) |
s3ForcePathStyle |
true |
Uses path-style S3 URLs (required for Ceph RGW) |
s3Url |
http://rook-ceph-rgw-... |
Ceph RGW endpoint within the cluster |
Helm Values¶
The values file controls Velero's backup schedules and feature flags. Save the following as the values file for your environment:
# Velero — AWS HA configuration
# Automated backup schedules, CSI snapshots, S3 backend
podSecurityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
seccompProfile:
type: RuntimeDefault
containerSecurityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus
# Automated backup schedules
schedules:
# Daily namespace backup — retains 30 days
daily-namespaces:
disabled: false
schedule: "0 2 * * *" # 02:00 UTC daily
useOwnerReferencesInBackup: false
template:
ttl: "720h" # 30 days
storageLocation: default
includedNamespaces:
- rciis-aws
- monitoring
- strimzi-operator
- cnpg-system
snapshotMoveData: false
# Weekly full-cluster backup — retains 90 days
weekly-full:
disabled: false
schedule: "0 3 * * 0" # 03:00 UTC Sunday
useOwnerReferencesInBackup: false
template:
ttl: "2160h" # 90 days
storageLocation: default
includeClusterResources: true
snapshotMoveData: false
# Velero — Bare Metal HA configuration
# Automated backup schedules, CSI snapshots, Ceph RGW backend
podSecurityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
seccompProfile:
type: RuntimeDefault
containerSecurityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus
# Automated backup schedules
schedules:
# Daily namespace backup — retains 30 days
daily-namespaces:
disabled: false
schedule: "0 2 * * *" # 02:00 UTC daily
useOwnerReferencesInBackup: false
template:
ttl: "720h" # 30 days
storageLocation: default
includedNamespaces:
- rciis-kenya
- monitoring
- strimzi-operator
- cnpg-system
snapshotMoveData: false
# Weekly full-cluster backup — retains 90 days
weekly-full:
disabled: false
schedule: "0 3 * * 0" # 03:00 UTC Sunday
useOwnerReferencesInBackup: false
template:
ttl: "2160h" # 90 days
storageLocation: default
includeClusterResources: true
snapshotMoveData: false
# Velero — Bare Metal HA configuration
# Automated backup schedules, CSI snapshots, Ceph RGW backend
podSecurityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
seccompProfile:
type: RuntimeDefault
containerSecurityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus
# Automated backup schedules
schedules:
# Daily namespace backup — retains 30 days
daily-namespaces:
disabled: false
schedule: "0 2 * * *" # 02:00 UTC daily
useOwnerReferencesInBackup: false
template:
ttl: "720h" # 30 days
storageLocation: default
includedNamespaces:
- rciis-kenya
- monitoring
- strimzi-operator
- cnpg-system
snapshotMoveData: false
# Weekly full-cluster backup — retains 90 days
weekly-full:
disabled: false
schedule: "0 3 * * 0" # 03:00 UTC Sunday
useOwnerReferencesInBackup: false
template:
ttl: "2160h" # 90 days
storageLocation: default
includeClusterResources: true
snapshotMoveData: false
Key settings (all environments):
| Setting | HA | Non-HA | Why |
|---|---|---|---|
schedules.* |
Daily + weekly | Empty {} |
Automated schedules provide continuous protection vs on-demand only |
metrics.serviceMonitor.enabled |
true |
false |
HA exports Velero metrics to Prometheus for monitoring |
podSecurityContext |
Strict (65534:65534) |
Inherited from base | HA enforces non-root execution for security |
EnableCSI |
Enabled in base | Enabled in base | CSI snapshots required for PVC-level backups |
Commit and Deploy¶
Once all files are in place, commit and push to trigger Flux deployment:
Flux will detect the new commit and begin deploying Velero. To trigger an immediate sync instead of waiting for the next poll interval:
Extra Manifests - Ceph S3 User¶
Bare Metal only
This manifest is only required when using Ceph RGW as the backup storage backend. AWS deployments use IAM credentials instead.
Velero needs an S3 user in Ceph to access the backup bucket:
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
name: velero
namespace: velero
spec:
store: ceph-objectstore
clusterNamespace: rook-ceph
displayName: "Velero Backup User"
capabilities:
user: "*"
bucket: "*"
CSI-only backup strategy
With deployNodeAgent: false, only PVCs backed by CSI-compatible storage
classes (Ceph RBD) are snapshotted. Ensure all critical workloads use
ceph-rbd or ceph-rbd-single storage classes.
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
name: velero
namespace: velero
spec:
store: ceph-objectstore
clusterNamespace: rook-ceph
displayName: "Velero Backup User"
capabilities:
user: "*"
bucket: "*"
CSI-only backup strategy
With deployNodeAgent: false, only PVCs backed by CSI-compatible storage
classes (Ceph RBD) are snapshotted. Ensure all critical workloads use
ceph-rbd or ceph-rbd-single storage classes.
Verify¶
# Check Velero is running
kubectl get pods -n velero
# Verify backup storage location
velero backup-location get
# Create a test backup
velero backup create test-backup --include-namespaces default --wait
# Check backup status
velero backup describe test-backup
# Clean up test backup
velero backup delete test-backup --confirm
Flux Operations¶
This component is managed by Flux as HelmRelease velero and Kustomization infra-velero.
Check whether the HelmRelease and Kustomization are in a Ready state:
Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:
Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:
View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:
Recovering a stalled HelmRelease
If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:
flux suspend helmrelease velero -n flux-system
flux resume helmrelease velero -n flux-system
flux reconcile kustomization infra-velero -n flux-system
Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.
Next: Continue to CloudNativePG Backups below.
CloudNativePG Backups¶
CloudNativePG provides continuous backup at the PostgreSQL level using Barman. This is independent of Velero — while Velero backs up Kubernetes resources and PVCs as CSI snapshots, CNPG archives the PostgreSQL Write-Ahead Log (WAL) stream and performs periodic base backups directly to S3-compatible storage.
This enables point-in-time recovery (PITR) for all PostgreSQL databases managed by the CNPG operator (Grafana, Keycloak, application databases).
Operator vs Cluster backups
The CNPG operator (installed in Data Services) does not configure backups itself. Backups are configured per Cluster CR in each application namespace. The examples below show the backup stanza to add to any CNPG Cluster.
Cluster Backup Configuration¶
Add the backup stanza to any CNPG Cluster CR to enable continuous WAL
archiving and base backups. The S3 destination depends on your deployment model:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-db
namespace: app-namespace
spec:
instances: 3
storage:
size: 10Gi
storageClassName: ceph-rbd-single
backup:
barmanObjectStore:
destinationPath: s3://rciis-cnpg-backups/example-db
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
The cnpg-s3-credentials Secret contains AWS IAM credentials:
apiVersion: v1
kind: Secret
metadata:
name: cnpg-s3-credentials
namespace: app-namespace
type: Opaque
stringData:
ACCESS_KEY_ID: "<AWS_ACCESS_KEY_ID>"
ACCESS_SECRET_KEY: "<AWS_SECRET_ACCESS_KEY>"
IAM Roles for Service Accounts (IRSA)
On EKS, prefer IRSA over static credentials. Set
backup.barmanObjectStore.s3Credentials.inheritFromIAMRole: true
and annotate the CNPG ServiceAccount with the IAM role ARN.
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-db
namespace: app-namespace
spec:
instances: 3
storage:
size: 10Gi
storageClassName: ceph-rbd-single
backup:
barmanObjectStore:
destinationPath: s3://cnpg-backups/example-db
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
The cnpg-s3-credentials Secret contains the Ceph RGW user credentials:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-db
namespace: app-namespace
spec:
instances: 3
storage:
size: 10Gi
storageClassName: ceph-rbd-single
backup:
barmanObjectStore:
destinationPath: s3://cnpg-backups/example-db
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
The cnpg-s3-credentials Secret contains the Ceph RGW user credentials:
Scheduled Base Backups¶
WAL archiving is continuous, but periodic base backups are needed for efficient
recovery. Create a ScheduledBackup CR for each PostgreSQL cluster:
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: example-db-daily
namespace: app-namespace
spec:
schedule: "0 2 * * *" # 02:00 UTC daily
backupOwnerReference: self
cluster:
name: example-db
method: barmanObjectStore
Backup retention
The retentionPolicy: "30d" in the Cluster CR controls how long base
backups and WAL files are retained. The ScheduledBackup creates new base
backups on schedule — old base backups and WAL segments beyond the retention
window are automatically pruned by Barman.
Ceph S3 User for CNPG¶
Bare Metal only
This manifest is only required when using Ceph RGW as the backup storage backend. AWS deployments use IAM credentials instead.
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
name: cnpg-backup
namespace: rook-ceph
spec:
store: ceph-objectstore
clusterNamespace: rook-ceph
displayName: "CNPG Backup User"
capabilities:
user: "*"
bucket: "*"
Recovery¶
To recover a PostgreSQL cluster to a specific point in time, create a new
Cluster CR that bootstraps from the backup:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-db-recovered
namespace: app-namespace
spec:
instances: 3
storage:
size: 10Gi
storageClassName: ceph-rbd-single
bootstrap:
recovery:
source: example-db-backup
recoveryTarget:
targetTime: "2026-02-15T12:00:00Z"
externalClusters:
- name: example-db-backup
barmanObjectStore:
destinationPath: s3://rciis-cnpg-backups/example-db
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-db-recovered
namespace: app-namespace
spec:
instances: 3
storage:
size: 10Gi
storageClassName: ceph-rbd-single
bootstrap:
recovery:
source: example-db-backup
recoveryTarget:
targetTime: "2026-02-15T12:00:00Z"
externalClusters:
- name: example-db-backup
barmanObjectStore:
destinationPath: s3://cnpg-backups/example-db
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: example-db-recovered
namespace: app-namespace
spec:
instances: 3
storage:
size: 10Gi
storageClassName: ceph-rbd-single
bootstrap:
recovery:
source: example-db-backup
recoveryTarget:
targetTime: "2026-02-15T12:00:00Z"
externalClusters:
- name: example-db-backup
barmanObjectStore:
destinationPath: s3://cnpg-backups/example-db
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
Verify¶
# Check backup status on a CNPG cluster
kubectl get cluster example-db -n app-namespace \
-o jsonpath='{.status.lastSuccessfulBackup}'
# List backups
kubectl get backups -n app-namespace
# Check WAL archiving — first recoverable point
kubectl get cluster example-db -n app-namespace \
-o jsonpath='{.status.firstRecoverabilityPoint}'
# Verify scheduled backups
kubectl get scheduledbackups -n app-namespace
Descheduler¶
The Kubernetes Descheduler evicts pods that violate scheduling constraints or contribute to resource imbalance. It works alongside the default scheduler — the descheduler evicts, and the scheduler re-places pods on better-suited nodes.
Install¶
The base HelmRelease tells Flux which chart to install. This file is shared across all environments — environment-specific settings are applied via patches.
Create the base directory and file:
| Field | Value | Explanation |
|---|---|---|
chart |
descheduler |
The Helm chart name from the Descheduler registry |
version |
0.34.0 |
Pinned chart version — update this to upgrade Descheduler |
sourceRef.name |
descheduler |
References a HelmRepository CR pointing to the Descheduler Helm repository |
targetNamespace |
kube-system |
Descheduler runs in the system namespace |
crds: CreateReplace |
— | Automatically installs and updates Descheduler CRDs |
remediation.retries |
3 |
Flux retries up to 3 times if the install or upgrade fails |
Save the following as flux/infra/base/descheduler.yaml:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: descheduler
namespace: flux-system
spec:
targetNamespace: kube-system
interval: 30m
chart:
spec:
chart: descheduler
version: "0.34.0"
sourceRef:
kind: HelmRepository
name: descheduler
namespace: flux-system
releaseName: descheduler
install:
createNamespace: true
crds: CreateReplace
remediation:
retries: 3
upgrade:
crds: CreateReplace
remediation:
retries: 3
values:
replicas: 3
leaderElection:
enabled: true
kind: Deployment
deschedulerPolicy:
profiles:
- name: default
pluginConfig:
- name: DefaultEvictor
args:
evictLocalStoragePods: false
evictSystemCriticalPods: false
nodeFit: true
- name: LowNodeUtilization
args:
useDeviationThresholds: true
thresholds:
cpu: 10
memory: 10
pods: 10
targetThresholds:
cpu: 20
memory: 20
pods: 20
- name: RemovePodsViolatingTopologySpreadConstraint
args:
constraints:
- DoNotSchedule
plugins:
balance:
enabled:
- LowNodeUtilization
- RemovePodsViolatingTopologySpreadConstraint
Alternative: Helm CLI
If you do not have Git access, install Descheduler directly:
Configuration¶
The environment patch overrides the base HelmRelease with cluster-specific resource settings. Only AWS has a patch — Bare Metal uses the base configuration as-is.
Create the environment overlay directory:
Environment Patch¶
The patch file adjusts resource limits for your deployment model.
AWS deployments typically run on smaller instances, so Descheduler uses reduced resource limits.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: descheduler
spec:
values:
replicas: 1
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
| Setting | Value | Why |
|---|---|---|
replicas |
1 |
AWS deployments run a single replica; no leader election needed |
| Resource limits (reduced) | CPU 50m, RAM 64Mi | AWS instances are smaller than HA bare metal |
Bare Metal uses the base configuration with 3 replicas and leader election:
| Setting | Value | Why |
|---|---|---|
replicas |
3 |
HA deployment with leader election for redundancy |
| Resource limits | CPU 50m (base), RAM 64Mi (base) | Allows higher thresholds for rebalancing |
Bare Metal uses the base configuration with 3 replicas and leader election:
| Setting | Value | Why |
|---|---|---|
replicas |
3 |
HA deployment with leader election for redundancy |
| Resource limits | CPU 50m (base), RAM 64Mi (base) | Allows higher thresholds for rebalancing |
Commit and Deploy¶
Once all files are in place, commit and push to trigger Flux deployment:
Flux will detect the new commit and begin deploying Descheduler. To trigger an immediate sync instead of waiting for the next poll interval:
Verify¶
# Check Descheduler is running (Deployment mode)
kubectl get pods -n kube-system -l app.kubernetes.io/name=descheduler
# Check logs for eviction activity
kubectl logs -n kube-system -l app.kubernetes.io/name=descheduler --tail=50
Flux Operations¶
This component is managed by Flux as HelmRelease descheduler and Kustomization infra-descheduler.
Check whether the HelmRelease and Kustomization are in a Ready state:
Trigger an immediate sync — pulls the latest Git revision and re-applies the manifests. Use after pushing config changes or to verify a fix:
Trigger a Helm upgrade — re-runs the Helm install/upgrade for this release without waiting for the next interval. Use when the HelmRelease values have changed:
View recent Flux controller logs for this release — useful for diagnosing why a sync or upgrade failed:
Recovering a stalled HelmRelease
If the HelmRelease shows Stalled with RetriesExceeded, Flux will not retry automatically. Suspend and resume to clear the failure counter, then reconcile:
flux suspend helmrelease descheduler -n flux-system
flux resume helmrelease descheduler -n flux-system
flux reconcile kustomization infra-descheduler -n flux-system
Only run this after confirming the underlying issue (e.g. pod crash, timeout) has been resolved. See Maintenance — Recovering Stalled Resources for details.
Next Steps¶
Backup and scheduling infrastructure is now configured. Proceed to 5.3.4 Identity & Access Management to set up Kubernetes RBAC, role-based access control, and authentication policies for cluster security.