9.4 Backup & Recovery Procedures¶
This page provides the complete operational guide for configuring, verifying, and restoring backups across the RCIIS platform. Three backup mechanisms protect different data tiers:
| Backup System | Scope | Method | RPO | RTO |
|---|---|---|---|---|
| CloudNativePG / Barman | PostgreSQL databases | Continuous WAL archiving + scheduled base backups to S3 | Near-zero (last archived WAL) | Minutes (PITR) |
| SQL Server BACKUP TO URL | RCIIS application database (MSSQL) | Native S3 backup via sqlcmd |
Last scheduled backup | Minutes (full restore) |
| Velero | Stateful non-database workloads (Kafka, etcd) | CSI volume snapshots + S3 metadata | Last scheduled snapshot | Minutes (PVC restore) |
Relationship to Phase 5
Phase 5 (Backup & Scheduling) covers installing the backup tools (Velero, CNPG operator, Descheduler). This page covers the Day-2 operational procedures — enabling backups on each database, configuring schedules, verifying health, and executing restores.
Backup Architecture
===================
┌─────────────────────────────────────────────────────────────┐
│ S3-Compatible Storage │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ cnpg-backups │ │ mssql-backups│ │velero-backups│ │
│ │ (WAL + base)│ │ (.bak files)│ │ (snapshots) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ Ceph RGW (on-prem) / AWS S3 (cloud) │
└─────────┼─────────────────┼─────────────────┼───────────────┘
│ │ │
┌─────┴─────┐ ┌────┴────┐ ┌─────┴─────┐
│ CNPG │ │ MSSQL │ │ Velero │
│ Barman │ │ sqlcmd │ │ CSI │
│ archiver │ │ BACKUP │ │ snapshots │
└─────┬─────┘ └────┬────┘ └─────┬─────┘
│ │ │
┌───────┴───────┐ ┌───┴───┐ ┌──────┴──────┐
│ PostgreSQL │ │ RCIIS │ │ Kafka PVCs │
│ Clusters │ │ DB │ │ etcd PVCs │
│ (esb, ss, │ │(MSSQL)│ │ other PVCs │
│ grafana, kc) │ └───────┘ └─────────────┘
└───────────────┘
9.4.1 PostgreSQL Backups (CloudNativePG / Barman)¶
CloudNativePG uses Barman to continuously archive the PostgreSQL Write-Ahead Log (WAL) and perform periodic base backups directly to S3-compatible object storage. This enables point-in-time recovery (PITR) for every PostgreSQL database managed by the CNPG operator.
RCIIS PostgreSQL Database Inventory¶
| Cluster Name | Namespace | Database | Purpose | Owner |
|---|---|---|---|---|
esb-postgres |
rciis-prod |
failed_message_offset |
ESB Kafka offset tracking | esb |
ss-postgres |
rciis-prod |
ss |
SignServer digital signatures | ss |
grafana-postgres |
monitoring |
grafana |
Grafana dashboards & users | grafana |
keycloak-pg |
keycloak |
keycloak |
Keycloak identity & access | keycloak |
S3 Credentials¶
Every CNPG cluster needs a Secret containing S3 access credentials in its namespace. The credential source depends on your deployment model:
Create a Secret with AWS IAM credentials in each namespace that hosts a CNPG cluster:
apiVersion: v1
kind: Secret
metadata:
name: cnpg-s3-credentials
namespace: rciis-prod # Repeat per namespace
type: Opaque
stringData:
ACCESS_KEY_ID: "<AWS_ACCESS_KEY_ID>"
ACCESS_SECRET_KEY: "<AWS_SECRET_ACCESS_KEY>"
IAM Roles for Service Accounts (IRSA)
On EKS, prefer IRSA over static credentials. Set
backup.barmanObjectStore.s3Credentials.inheritFromIAMRole: true
on each Cluster CR and annotate the CNPG ServiceAccount with the
IAM role ARN. The IAM policy needs s3:PutObject, s3:GetObject,
s3:DeleteObject, and s3:ListBucket on the backup bucket.
Credentials come from the CephObjectStoreUser created in
Phase 5 — Backup & Scheduling.
After the cnpg-backup CephObjectStoreUser is created, Rook-Ceph generates
a Secret containing the RGW access key pair.
Extract the credentials and create a Secret in each namespace:
# Get the Ceph RGW credentials
ACCESS_KEY=$(kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
-n rook-ceph -o jsonpath='{.data.AccessKey}' | base64 -d)
SECRET_KEY=$(kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
-n rook-ceph -o jsonpath='{.data.SecretKey}' | base64 -d)
# Create the Secret in each CNPG namespace
for NS in rciis-prod monitoring keycloak; do
kubectl create secret generic cnpg-s3-credentials \
--namespace "$NS" \
--from-literal=ACCESS_KEY_ID="$ACCESS_KEY" \
--from-literal=ACCESS_SECRET_KEY="$SECRET_KEY" \
--dry-run=client -o yaml | kubectl apply -f -
done
SOPS encryption
For GitOps, SOPS-encrypt the Secret manifests rather than creating them imperatively. The commands above are for initial setup and verification only.
Credentials come from the CephObjectStoreUser created in
Phase 5 — Backup & Scheduling.
After the cnpg-backup CephObjectStoreUser is created, Rook-Ceph generates
a Secret containing the RGW access key pair.
Extract the credentials and create a Secret in each namespace:
# Get the Ceph RGW credentials
ACCESS_KEY=$(kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
-n rook-ceph -o jsonpath='{.data.AccessKey}' | base64 -d)
SECRET_KEY=$(kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
-n rook-ceph -o jsonpath='{.data.SecretKey}' | base64 -d)
# Create the Secret in each CNPG namespace
for NS in rciis-prod monitoring keycloak; do
kubectl create secret generic cnpg-s3-credentials \
--namespace "$NS" \
--from-literal=ACCESS_KEY_ID="$ACCESS_KEY" \
--from-literal=ACCESS_SECRET_KEY="$SECRET_KEY" \
--dry-run=client -o yaml | kubectl apply -f -
done
SOPS encryption
For GitOps, SOPS-encrypt the Secret manifests rather than creating them imperatively. The commands above are for initial setup and verification only.
Enable Backups on CNPG Clusters¶
Add the backup stanza to each CNPG Cluster CR. The existing Cluster CRs
in the repository have the backup section commented out — uncomment and update
the S3 destination and credentials.
esb-postgres¶
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: esb-postgres
annotations:
cnpg.io/reload-secrets: "true"
spec:
description: "PostgreSQL for ESB"
instances: 1
imageCatalogRef:
apiGroup: postgresql.cnpg.io
kind: ImageCatalog
name: postgresql-17
major: 17
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://rciis-cnpg-backups/esb-postgres
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
postgresql:
pg_hba:
- host failed_message_offset esb 0.0.0.0/0 scram-sha-256
managed:
roles:
- name: esb
ensure: present
comment: ESB User
login: true
superuser: false
inRoles:
- pg_monitor
- pg_signal_backend
passwordSecret:
name: cnpg-esb-owner
bootstrap:
initdb:
database: failed_message_offset
owner: esb
secret:
name: cnpg-esb-owner
superuserSecret:
name: cnpg-esb-superuser
storage:
storageClass: ceph-rbd-single
size: 10Gi
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: esb-postgres
annotations:
cnpg.io/reload-secrets: "true"
spec:
description: "PostgreSQL for ESB"
instances: 1
imageCatalogRef:
apiGroup: postgresql.cnpg.io
kind: ImageCatalog
name: postgresql-17
major: 17
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://cnpg-backups/esb-postgres
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
postgresql:
pg_hba:
- host failed_message_offset esb 0.0.0.0/0 scram-sha-256
managed:
roles:
- name: esb
ensure: present
comment: ESB User
login: true
superuser: false
inRoles:
- pg_monitor
- pg_signal_backend
passwordSecret:
name: cnpg-esb-owner
bootstrap:
initdb:
database: failed_message_offset
owner: esb
secret:
name: cnpg-esb-owner
superuserSecret:
name: cnpg-esb-superuser
storage:
storageClass: ceph-rbd-single
size: 10Gi
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: esb-postgres
annotations:
cnpg.io/reload-secrets: "true"
spec:
description: "PostgreSQL for ESB"
instances: 1
imageCatalogRef:
apiGroup: postgresql.cnpg.io
kind: ImageCatalog
name: postgresql-17
major: 17
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://cnpg-backups/esb-postgres
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
postgresql:
pg_hba:
- host failed_message_offset esb 0.0.0.0/0 scram-sha-256
managed:
roles:
- name: esb
ensure: present
comment: ESB User
login: true
superuser: false
inRoles:
- pg_monitor
- pg_signal_backend
passwordSecret:
name: cnpg-esb-owner
bootstrap:
initdb:
database: failed_message_offset
owner: esb
secret:
name: cnpg-esb-owner
superuserSecret:
name: cnpg-esb-superuser
storage:
storageClass: ceph-rbd-single
size: 10Gi
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
ss-postgres¶
The SignServer PostgreSQL cluster follows the same pattern. Update
apps/rciis/signserver/proxmox/extra/pg-instance.yaml:
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://rciis-cnpg-backups/ss-postgres
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://cnpg-backups/ss-postgres
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://cnpg-backups/ss-postgres
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
grafana-postgres¶
The Grafana PostgreSQL cluster lives in the monitoring namespace. Update
apps/infra/prometheus/proxmox/extra/pg-instance.yaml:
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://rciis-cnpg-backups/grafana-postgres
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://cnpg-backups/grafana-postgres
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
backup:
target: prefer-standby
barmanObjectStore:
destinationPath: s3://cnpg-backups/grafana-postgres
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
compression: gzip
maxParallel: 2
data:
compression: gzip
retentionPolicy: "30d"
keycloak-pg¶
The Keycloak PostgreSQL cluster lives in the keycloak namespace. See
Identity Management for the full
Cluster CR — add the same backup stanza pattern with
destinationPath set to s3://cnpg-backups/keycloak-pg (Bare Metal) or
s3://rciis-cnpg-backups/keycloak-pg (AWS).
Scheduled Base Backups¶
WAL archiving starts automatically once the backup stanza is added to a
Cluster CR. However, periodic base backups are essential for efficient
recovery — without them, PITR would need to replay the entire WAL history.
Create a ScheduledBackup CR for each PostgreSQL cluster:
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: esb-postgres-daily
namespace: rciis-prod
spec:
schedule: "0 0 2 * * *" # 02:00 UTC daily
backupOwnerReference: self
cluster:
name: esb-postgres
method: barmanObjectStore
immediate: true # Take a backup on creation
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: ss-postgres-daily
namespace: rciis-prod
spec:
schedule: "0 0 2 * * *"
backupOwnerReference: self
cluster:
name: ss-postgres
method: barmanObjectStore
immediate: true
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: grafana-postgres-daily
namespace: monitoring
spec:
schedule: "0 0 2 * * *"
backupOwnerReference: self
cluster:
name: grafana-postgres
method: barmanObjectStore
immediate: true
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: keycloak-pg-daily
namespace: keycloak
spec:
schedule: "0 0 2 * * *"
backupOwnerReference: self
cluster:
name: keycloak-pg
method: barmanObjectStore
immediate: true
CNPG cron format
CNPG uses a 6-field cron format (seconds, minutes, hours, day-of-month,
month, day-of-week). The 0 0 2 * * * schedule fires at 02:00:00 UTC daily.
Standard 5-field cron (0 2 * * *) is NOT valid for ScheduledBackup CRs.
Retention policy
retentionPolicy: "30d" on the Cluster CR controls how long base backups
and WAL files are retained. Barman automatically prunes backups and WAL
segments older than the retention window. Adjust per database based on
compliance requirements.
Verify PostgreSQL Backup Health¶
After applying the backup configuration and ScheduledBackup CRs, verify that WAL archiving and base backups are working:
# Check WAL archiving status — first recoverable point
kubectl get cluster esb-postgres -n rciis-prod \
-o jsonpath='{.status.firstRecoverabilityPoint}'
# Check last successful backup timestamp
kubectl get cluster esb-postgres -n rciis-prod \
-o jsonpath='{.status.lastSuccessfulBackup}'
# Check backup conditions
kubectl get cluster esb-postgres -n rciis-prod \
-o jsonpath='{.status.conditions}' | python3 -m json.tool
# List all backups for a cluster
kubectl get backups -n rciis-prod -l cnpg.io/cluster=esb-postgres
# Verify scheduled backups are registered
kubectl get scheduledbackups -n rciis-prod
# Full cluster status (requires cnpg kubectl plugin)
kubectl cnpg status esb-postgres -n rciis-prod
Repeat for each database cluster (ss-postgres, grafana-postgres,
keycloak-pg) in their respective namespaces.
First backup delay
The first base backup may take several minutes depending on database size.
WAL archiving begins immediately after the backup stanza is applied, but
the firstRecoverabilityPoint will not appear until the first base backup
completes successfully.
Restore a PostgreSQL Database (PITR)¶
To restore a PostgreSQL database to a specific point in time, create a new Cluster CR that bootstraps from the backup source. CNPG does not support in-place restore — you always create a new cluster and then cut over.
Step 1: Identify the Recovery Target¶
# Find the available recovery window
kubectl get cluster esb-postgres -n rciis-prod \
-o jsonpath='First: {.status.firstRecoverabilityPoint} — Last WAL: now'
# List available base backups
kubectl get backups -n rciis-prod -l cnpg.io/cluster=esb-postgres \
-o custom-columns='NAME:.metadata.name,STARTED:.status.startedAt,COMPLETED:.status.stoppedAt,STATUS:.status.phase'
Step 2: Create a Recovery Cluster¶
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: esb-postgres-recovered
namespace: rciis-prod
spec:
instances: 1
imageCatalogRef:
apiGroup: postgresql.cnpg.io
kind: ImageCatalog
name: postgresql-17
major: 17
bootstrap:
recovery:
source: esb-postgres-backup
recoveryTarget:
targetTime: "2026-02-17T12:00:00Z" # Adjust to your target
externalClusters:
- name: esb-postgres-backup
barmanObjectStore:
destinationPath: s3://rciis-cnpg-backups/esb-postgres
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
maxParallel: 4
storage:
storageClass: ceph-rbd-single
size: 10Gi
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: esb-postgres-recovered
namespace: rciis-prod
spec:
instances: 1
imageCatalogRef:
apiGroup: postgresql.cnpg.io
kind: ImageCatalog
name: postgresql-17
major: 17
bootstrap:
recovery:
source: esb-postgres-backup
recoveryTarget:
targetTime: "2026-02-17T12:00:00Z" # Adjust to your target
externalClusters:
- name: esb-postgres-backup
barmanObjectStore:
destinationPath: s3://cnpg-backups/esb-postgres
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
maxParallel: 4
storage:
storageClass: ceph-rbd-single
size: 10Gi
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: esb-postgres-recovered
namespace: rciis-prod
spec:
instances: 1
imageCatalogRef:
apiGroup: postgresql.cnpg.io
kind: ImageCatalog
name: postgresql-17
major: 17
bootstrap:
recovery:
source: esb-postgres-backup
recoveryTarget:
targetTime: "2026-02-17T12:00:00Z" # Adjust to your target
externalClusters:
- name: esb-postgres-backup
barmanObjectStore:
destinationPath: s3://cnpg-backups/esb-postgres
endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
s3Credentials:
accessKeyId:
name: cnpg-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-s3-credentials
key: ACCESS_SECRET_KEY
wal:
maxParallel: 4
storage:
storageClass: ceph-rbd-single
size: 10Gi
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
Step 3: Apply and Monitor Recovery¶
# Apply the recovery cluster
kubectl apply -f esb-postgres-recovery.yaml
# Monitor recovery progress
kubectl get cluster esb-postgres-recovered -n rciis-prod -w
# Watch pods — recovery creates a new primary from the backup
kubectl get pods -n rciis-prod -l cnpg.io/cluster=esb-postgres-recovered -w
# Once the cluster reaches "Cluster in healthy state", verify data
kubectl cnpg psql esb-postgres-recovered -n rciis-prod -- \
-c "SELECT count(*) FROM failed_message_offset;"
Step 4: Cut Over to the Recovered Cluster¶
Once the recovered cluster is verified, update the application to point to the new cluster's service:
# The recovered cluster exposes services under its own name
# Old: esb-postgres-rw.rciis-prod.svc.cluster.local
# New: esb-postgres-recovered-rw.rciis-prod.svc.cluster.local
# Option A: Rename the recovered cluster (delete old, rename new)
kubectl delete cluster esb-postgres -n rciis-prod
# Then update the recovered cluster's name in the CR and re-apply
# Option B: Update application connection strings to point to the new service
# This depends on how the ESB integration references the database
Production cut-over
Always verify the recovered data before deleting the original cluster. The old cluster should be kept until the recovery is confirmed successful. Consider renaming via DNS (a CNAME or Service update) rather than deleting and recreating.
9.4.2 SQL Server Backups (BACKUP TO URL)¶
SQL Server 2022 supports native backup to S3-compatible object storage using
the BACKUP DATABASE ... TO URL syntax. The RCIIS platform uses this for the
main application database (RCIIS), deployed as a StatefulSet via the rciis
Helm chart.
RCIIS SQL Server Inventory¶
| StatefulSet | Namespace | Database | Image | Storage |
|---|---|---|---|---|
mssql |
rciis-prod |
RCIIS |
mcr.microsoft.com/mssql/server:2022-CU20-GDR2-ubuntu-22.04 |
20Gi ceph-rbd-single |
How It Works¶
SQL Server 2022 on Linux natively supports S3 as a backup destination:
- Create an S3 credential in SQL Server — maps an S3 access key pair to a named credential object
- Execute
BACKUP DATABASE ... TO URL— writes the.bakfile directly to the S3 bucket - Restore with
RESTORE DATABASE ... FROM URL— reads the.bakfile back from S3
The RCIIS Helm chart includes a backup-job.yaml template that automates
this as a Kubernetes Job.
S3 Credential Setup¶
Before backups can run, SQL Server needs an S3 credential. The credential is created inside SQL Server (not Kubernetes) and maps an access key pair to an S3 endpoint.
Create the SQL Server Credential¶
Connect to the SQL Server instance and create the credential:
# Port-forward to the MSSQL pod
kubectl port-forward statefulset/mssql -n rciis-prod 1433:1433 &
# Connect with sqlcmd (from mssql-tools18)
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C
-- Drop existing credential if re-creating
IF EXISTS (SELECT * FROM sys.credentials WHERE name = 's3_backup_credential')
DROP CREDENTIAL [s3_backup_credential];
GO
CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>';
GO
-- Verify the credential was created
SELECT name, credential_identity, create_date
FROM sys.credentials
WHERE name = 's3_backup_credential';
GO
-- Drop existing credential if re-creating
IF EXISTS (SELECT * FROM sys.credentials WHERE name = 's3_backup_credential')
DROP CREDENTIAL [s3_backup_credential];
GO
CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '<CEPH_RGW_ACCESS_KEY>:<CEPH_RGW_SECRET_KEY>';
GO
-- Verify the credential was created
SELECT name, credential_identity, create_date
FROM sys.credentials
WHERE name = 's3_backup_credential';
GO
Obtaining Ceph RGW credentials
Use the same CephObjectStoreUser credentials as CNPG, or create a
dedicated user for MSSQL backups:
# Get credentials from the CephObjectStoreUser secret
kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
-n rook-ceph -o jsonpath='{.data.AccessKey}' | base64 -d
kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
-n rook-ceph -o jsonpath='{.data.SecretKey}' | base64 -d
-- Drop existing credential if re-creating
IF EXISTS (SELECT * FROM sys.credentials WHERE name = 's3_backup_credential')
DROP CREDENTIAL [s3_backup_credential];
GO
CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '<CEPH_RGW_ACCESS_KEY>:<CEPH_RGW_SECRET_KEY>';
GO
-- Verify the credential was created
SELECT name, credential_identity, create_date
FROM sys.credentials
WHERE name = 's3_backup_credential';
GO
Obtaining Ceph RGW credentials
Use the same CephObjectStoreUser credentials as CNPG, or create a
dedicated user for MSSQL backups:
# Get credentials from the CephObjectStoreUser secret
kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
-n rook-ceph -o jsonpath='{.data.AccessKey}' | base64 -d
kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
-n rook-ceph -o jsonpath='{.data.SecretKey}' | base64 -d
Credential key restrictions
SQL Server uses the AccessKeyID:SecretKeyID format with a colon (:)
separator. Neither the access key nor the secret key may contain a colon
character. If your S3 credentials contain colons, regenerate them.
Store S3 Credentials as a Kubernetes Secret¶
The backup job reads S3 credentials from a Kubernetes Secret and passes them
to the sqlcmd session. Create the Secret in the application namespace:
Configure the Backup Job¶
The rciis Helm chart includes a backup job template
(charts/rciis/templates/backup-job.yaml) that runs as a Flux pre-reconciliation job.
Enable it in the environment values file:
jobs:
db-backup:
enabled: true
repository:
image: harbor.devops.africa/nucleus/mssql-tools
tag: latest
imagePullSecrets:
- container-registry
databaseName: RCIIS
s3:
url: "s3://s3.rciis.africa"
bucket: rciis-prod
path: "backups/mssql"
credentialName: s3_backup_credential
credentials:
secretName: "ceph-bucket-credentials"
accessKey: "AWS_ACCESS_KEY_ID"
secretKey: "AWS_SECRET_ACCESS_KEY"
compressionOption: COMPRESSION
copyOnly: false
retentionDays: 30
env:
- name: HOST
value: mssql
- name: MSSQL_SA_USER
value: sa
- name: MSSQL_SA_PASSWORD
valueFrom:
secretKeyRef:
name: mssql-admin
key: MSSQL_SA_PASSWORD
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
jobs:
db-backup:
enabled: true
repository:
image: harbor.devops.africa/nucleus/mssql-tools
tag: latest
imagePullSecrets:
- container-registry
databaseName: RCIIS
s3:
url: "s3://s3.rciis.africa"
bucket: rciis-prod
path: "backups/mssql"
credentialName: s3_backup_credential
credentials:
secretName: "ceph-bucket-credentials"
accessKey: "AWS_ACCESS_KEY_ID"
secretKey: "AWS_SECRET_ACCESS_KEY"
compressionOption: COMPRESSION
copyOnly: false
retentionDays: 30
env:
- name: HOST
value: mssql
- name: MSSQL_SA_USER
value: sa
- name: MSSQL_SA_PASSWORD
valueFrom:
secretKeyRef:
name: mssql-admin
key: MSSQL_SA_PASSWORD
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
jobs:
db-backup:
enabled: true
repository:
image: harbor.devops.africa/nucleus/mssql-tools
tag: latest
imagePullSecrets:
- container-registry
databaseName: RCIIS
s3:
url: "s3://s3.rciis.africa"
bucket: rciis-prod
path: "backups/mssql"
credentialName: s3_backup_credential
credentials:
secretName: "ceph-bucket-credentials"
accessKey: "AWS_ACCESS_KEY_ID"
secretKey: "AWS_SECRET_ACCESS_KEY"
compressionOption: COMPRESSION
copyOnly: false
retentionDays: 30
env:
- name: HOST
value: mssql
- name: MSSQL_SA_USER
value: sa
- name: MSSQL_SA_PASSWORD
valueFrom:
secretKeyRef:
name: mssql-admin
key: MSSQL_SA_PASSWORD
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
The backup job executes the following SQL when triggered:
BACKUP DATABASE [RCIIS]
TO URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_020000.bak'
WITH CREDENTIAL = 's3_backup_credential', COMPRESSION, STATS = 10
Sync hook behaviour
The backup job runs as a Flux pre-reconciliation hook. It triggers before every Flux reconciliation, ensuring a backup exists before any database migration runs. For scheduled backups independent of reconciliations, use a CronJob (see below).
Scheduled Backups (CronJob)¶
For regular backups independent of Flux reconciliations, create a Kubernetes CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: mssql-daily-backup
namespace: rciis-prod
spec:
schedule: "0 3 * * *" # 03:00 UTC daily
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 3
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: OnFailure
imagePullSecrets:
- name: container-registry
containers:
- name: mssql-backup
image: harbor.devops.africa/nucleus/mssql-tools:latest
command: ["/bin/bash", "-c"]
args:
- |
set -e
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_URL="s3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_${TIMESTAMP}.bak"
echo "Starting MSSQL backup to ${BACKUP_URL}"
# Wait for SQL Server
until /opt/mssql-tools18/bin/sqlcmd -b -S mssql -U sa \
-P "${MSSQL_SA_PASSWORD}" -C -Q "SELECT 1" > /dev/null 2>&1; do
echo "Waiting for SQL Server..."
sleep 5
done
# Re-create S3 credential (idempotent)
/opt/mssql-tools18/bin/sqlcmd -b -S mssql -U sa \
-P "${MSSQL_SA_PASSWORD}" -C -Q "
IF EXISTS (SELECT * FROM sys.credentials WHERE name = 's3_backup_credential')
DROP CREDENTIAL [s3_backup_credential];
CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '${S3_ACCESS_KEY}:${S3_SECRET_KEY}';
"
# Execute backup
/opt/mssql-tools18/bin/sqlcmd -b -S mssql -U sa \
-P "${MSSQL_SA_PASSWORD}" -C -Q "
BACKUP DATABASE [RCIIS]
TO URL = N'${BACKUP_URL}'
WITH CREDENTIAL = 's3_backup_credential',
COMPRESSION, STATS = 10;
"
echo "Backup completed: ${BACKUP_URL}"
env:
- name: MSSQL_SA_PASSWORD
valueFrom:
secretKeyRef:
name: mssql-admin
key: MSSQL_SA_PASSWORD
- name: S3_ACCESS_KEY
valueFrom:
secretKeyRef:
name: ceph-bucket-credentials
key: AWS_ACCESS_KEY_ID
- name: S3_SECRET_KEY
valueFrom:
secretKeyRef:
name: ceph-bucket-credentials
key: AWS_SECRET_ACCESS_KEY
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Verify SQL Server Backups¶
# Check recent CronJob runs
kubectl get jobs -n rciis-prod -l job-name=mssql-daily-backup --sort-by=.metadata.creationTimestamp
# View the last backup job logs
kubectl logs -n rciis-prod job/$(kubectl get jobs -n rciis-prod \
-l job-name=mssql-daily-backup -o jsonpath='{.items[-1].metadata.name}')
# Verify backup exists in S3 (using the rook-ceph toolbox or AWS CLI)
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- \
s3cmd ls s3://rciis-prod/backups/mssql/ \
--host=rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local \
--host-bucket="" --no-ssl
# Inspect backup header from SQL Server
kubectl exec -n rciis-prod statefulset/mssql -- \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C -Q "
RESTORE HEADERONLY
FROM URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_030000.bak'
WITH CREDENTIAL = 's3_backup_credential';
"
Restore a SQL Server Database¶
Restore to the Same Instance¶
# Port-forward to MSSQL (or exec into the pod)
kubectl exec -it -n rciis-prod statefulset/mssql -- \
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C
-- Inspect the backup contents
RESTORE FILELISTONLY
FROM URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_030000.bak'
WITH CREDENTIAL = 's3_backup_credential';
GO
-- Restore with REPLACE (overwrites existing database)
RESTORE DATABASE [RCIIS]
FROM URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_030000.bak'
WITH CREDENTIAL = 's3_backup_credential',
REPLACE,
STATS = 10;
GO
-- Verify the restore
SELECT name, state_desc, recovery_model_desc
FROM sys.databases
WHERE name = 'RCIIS';
GO
Restore to a New Instance¶
To restore to a different SQL Server instance (e.g., a new StatefulSet for disaster recovery):
-- Ensure the S3 credential exists on the target instance
CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '<ACCESS_KEY>:<SECRET_KEY>';
GO
-- Restore with MOVE to place files in the correct location
RESTORE DATABASE [RCIIS]
FROM URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_030000.bak'
WITH CREDENTIAL = 's3_backup_credential',
MOVE N'RCIIS' TO N'/var/opt/mssql/data/RCIIS.mdf',
MOVE N'RCIIS_Log' TO N'/var/opt/mssql/data/RCIIS_Log.ldf',
REPLACE,
STATS = 10;
GO
Application downtime
Restoring a SQL Server database with REPLACE drops the existing database
and restores from the backup. All active connections are terminated.
Schedule restores during a maintenance window and notify dependent services
(Nucleus API, ESB) beforehand.
9.4.3 Velero Backups (Stateful Workloads)¶
Velero backs up Kubernetes resources and persistent volumes via CSI snapshots. It is specifically used for stateful workloads that are NOT databases — databases have their own native backup mechanisms (CNPG/Barman for PostgreSQL, BACKUP TO URL for SQL Server).
What Velero Backs Up¶
| Workload | Namespace | Storage | Why Velero |
|---|---|---|---|
| Kafka (Strimzi) | rciis-prod |
JBOD 10Gi per broker, ceph-rbd-single |
Topic data, consumer offsets, KRaft metadata |
| etcd (APISIX) | rciis-prod |
8Gi per replica, ceph-rbd-single |
API gateway configuration |
| Kubernetes resources | Multiple | N/A (metadata only) | CRDs, ConfigMaps, Secrets, RBAC |
What Velero does NOT back up
Velero CSI snapshots capture PVC data at a point in time but are not suitable for databases that require transactional consistency. PostgreSQL and SQL Server use their own WAL-based and log-based backup mechanisms respectively. Velero's role is to protect the Kubernetes resource definitions and the persistent data of non-database stateful workloads.
Configure Backup Schedules¶
The Velero Helm values control backup schedules. Update the values file to enable automated backups:
schedules:
# Daily namespace backup — retains 30 days
daily-namespaces:
disabled: false
schedule: "0 2 * * *" # 02:00 UTC daily
useOwnerReferencesInBackup: false
template:
ttl: "720h" # 30 days
storageLocation: default
includedNamespaces:
- rciis-prod
- monitoring
- strimzi-operator
- cnpg-system
snapshotMoveData: false
# Weekly full-cluster backup — retains 90 days
weekly-full:
disabled: false
schedule: "0 3 * * 0" # 03:00 UTC Sunday
useOwnerReferencesInBackup: false
template:
ttl: "2160h" # 90 days
storageLocation: default
includeClusterResources: true
snapshotMoveData: false
After updating the values, reconcile via Flux or apply with Helm:
# Flux reconciliation
flux reconcile kustomization rciis-velero
# Or Helm CLI
helm upgrade velero vmware-tanzu/velero \
--namespace velero \
--reuse-values \
-f apps/infra/velero/proxmox/values.yaml
On-Demand Backups¶
Create ad-hoc backups before maintenance, upgrades, or risky changes:
# Backup specific namespaces
velero backup create pre-upgrade-rciis \
--include-namespaces rciis-prod \
--wait
# Backup with volume snapshots (Kafka/etcd PVCs)
velero backup create kafka-snapshot \
--include-namespaces rciis-prod \
--selector strimzi.io/cluster=kafka-rciis-prod \
--wait
# Full cluster backup
velero backup create full-cluster-$(date +%Y%m%d) \
--wait
# Check backup status
velero backup describe pre-upgrade-rciis --details
Verify Velero Backups¶
# List all backups
velero backup get
# Check backup storage location health
velero backup-location get
# Describe a specific backup (check for errors/warnings)
velero backup describe daily-namespaces-20260217020000 --details
# View backup logs
velero backup logs daily-namespaces-20260217020000
# Verify CSI snapshots were created
kubectl get volumesnapshots -n rciis-prod
Restore from Velero¶
Restore Specific Namespaces¶
# Restore a single namespace from the latest daily backup
velero restore create --from-backup daily-namespaces-20260217020000 \
--include-namespaces rciis-prod \
--wait
# Check restore status
velero restore describe <restore-name> --details
Restore Specific Resources¶
# Restore only Kafka-related resources
velero restore create kafka-restore \
--from-backup daily-namespaces-20260217020000 \
--include-namespaces rciis-prod \
--selector strimzi.io/cluster=kafka-rciis-prod \
--wait
# Restore only etcd PVCs
velero restore create etcd-restore \
--from-backup daily-namespaces-20260217020000 \
--include-namespaces rciis-prod \
--include-resources persistentvolumeclaims \
--selector app=etcd \
--wait
Full Cluster Restore (Disaster Recovery)¶
# List available full-cluster backups
velero backup get --selector velero.io/schedule-name=weekly-full
# Restore the entire cluster (excludes velero namespace itself)
velero restore create full-dr-restore \
--from-backup weekly-full-20260216030000 \
--exclude-namespaces velero \
--wait
# Monitor restore progress
velero restore describe full-dr-restore --details
Restore order matters
When restoring a full cluster, CRDs and namespaces are restored first, then workloads. If operator CRDs (CNPG, Strimzi) are restored but the operators themselves are not yet running, the reconciliation loop will not start until the operator pods are restored. Velero handles this automatically through its resource ordering, but verify operator health after restore completes.
Backup Monitoring & Alerting¶
PrometheusRule for Backup Health¶
Create alerting rules to detect backup failures across all three systems:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backup-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: backup.rules
rules:
# Velero backup failure
- alert: VeleroBackupFailed
expr: |
increase(velero_backup_failure_total[24h]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Velero backup failed"
description: "A Velero backup has failed in the last 24 hours."
runbook: "Check velero backup logs: velero backup logs <backup-name>"
# Velero backup not running
- alert: VeleroBackupNotRunning
expr: |
time() - velero_backup_last_successful_timestamp{schedule!=""} > 90000
for: 10m
labels:
severity: warning
annotations:
summary: "Velero scheduled backup overdue"
description: >-
Velero scheduled backup {{ $labels.schedule }} has not
completed successfully in the last 25 hours.
# CNPG WAL archiving failure
- alert: CNPGWALArchivingFailed
expr: |
cnpg_pg_stat_archiver_failed_count > 0
for: 5m
labels:
severity: critical
annotations:
summary: "CNPG WAL archiving failed for {{ $labels.cluster }}"
description: >-
PostgreSQL cluster {{ $labels.cluster }} in namespace
{{ $labels.namespace }} has WAL archiving failures.
runbook: >-
Check CNPG cluster status:
kubectl cnpg status {{ $labels.cluster }} -n {{ $labels.namespace }}
# CNPG backup not recent
- alert: CNPGBackupStale
expr: |
(time() - cnpg_pg_stat_archiver_last_archived_time) > 3600
for: 10m
labels:
severity: warning
annotations:
summary: "CNPG WAL archiving stale for {{ $labels.cluster }}"
description: >-
No WAL has been archived for PostgreSQL cluster
{{ $labels.cluster }} in the last hour.
Verification Checklist¶
Run this checklist periodically (weekly recommended) to confirm all backup systems are healthy:
#!/bin/bash
echo "=== RCIIS Backup Health Check ==="
echo ""
echo "--- PostgreSQL (CNPG) Backups ---"
for CLUSTER in esb-postgres ss-postgres; do
echo "Cluster: $CLUSTER (rciis-prod)"
kubectl get cluster "$CLUSTER" -n rciis-prod \
-o jsonpath=' First recoverable: {.status.firstRecoverabilityPoint}
Last backup: {.status.lastSuccessfulBackup}
'
done
echo "Cluster: grafana-postgres (monitoring)"
kubectl get cluster grafana-postgres -n monitoring \
-o jsonpath=' First recoverable: {.status.firstRecoverabilityPoint}
Last backup: {.status.lastSuccessfulBackup}
'
echo ""
echo "--- SQL Server Backups ---"
kubectl get cronjob mssql-daily-backup -n rciis-prod \
-o jsonpath=' Schedule: {.spec.schedule}
Last run: {.status.lastScheduleTime}
Last success:{.status.lastSuccessfulTime}
'
echo ""
echo ""
echo "--- Velero Backups ---"
velero backup get --output json 2>/dev/null | \
python3 -c "
import json, sys
data = json.load(sys.stdin)
for item in (data.get('items') or [])[-5:]:
name = item['metadata']['name']
phase = item['status'].get('phase', 'Unknown')
print(f' {name}: {phase}')
" 2>/dev/null || echo " (velero CLI not available)"
echo ""
echo "--- Backup Storage ---"
velero backup-location get 2>/dev/null || echo " (velero CLI not available)"
echo ""
echo "=== Health Check Complete ==="
Backup Matrix Summary¶
| Component | Backup Method | Schedule | Retention | S3 Bucket | Recovery Method |
|---|---|---|---|---|---|
esb-postgres |
CNPG/Barman WAL + base | Daily 02:00 UTC | 30 days | cnpg-backups/esb-postgres |
PITR via new Cluster CR |
ss-postgres |
CNPG/Barman WAL + base | Daily 02:00 UTC | 30 days | cnpg-backups/ss-postgres |
PITR via new Cluster CR |
grafana-postgres |
CNPG/Barman WAL + base | Daily 02:00 UTC | 30 days | cnpg-backups/grafana-postgres |
PITR via new Cluster CR |
keycloak-pg |
CNPG/Barman WAL + base | Daily 02:00 UTC | 30 days | cnpg-backups/keycloak-pg |
PITR via new Cluster CR |
| RCIIS (MSSQL) | BACKUP TO URL (S3) | Daily 03:00 UTC + on sync | 30 days | rciis-prod/backups/mssql |
RESTORE FROM URL |
| Kafka PVCs | Velero CSI snapshot | Daily 02:00 UTC | 30 days | velero-backups |
Velero restore |
| etcd PVCs | Velero CSI snapshot | Daily 02:00 UTC | 30 days | velero-backups |
Velero restore |
| Cluster resources | Velero metadata | Weekly 03:00 Sun | 90 days | velero-backups |
Velero full restore |