Skip to content

9.4 Backup & Recovery Procedures

This page provides the complete operational guide for configuring, verifying, and restoring backups across the RCIIS platform. Three backup mechanisms protect different data tiers:

Backup System Scope Method RPO RTO
CloudNativePG / Barman PostgreSQL databases Continuous WAL archiving + scheduled base backups to S3 Near-zero (last archived WAL) Minutes (PITR)
SQL Server BACKUP TO URL RCIIS application database (MSSQL) Native S3 backup via sqlcmd Last scheduled backup Minutes (full restore)
Velero Stateful non-database workloads (Kafka, etcd) CSI volume snapshots + S3 metadata Last scheduled snapshot Minutes (PVC restore)

Relationship to Phase 5

Phase 5 (Backup & Scheduling) covers installing the backup tools (Velero, CNPG operator, Descheduler). This page covers the Day-2 operational procedures — enabling backups on each database, configuring schedules, verifying health, and executing restores.

Backup Architecture
===================

┌─────────────────────────────────────────────────────────────┐
│                    S3-Compatible Storage                     │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ cnpg-backups │  │ mssql-backups│  │velero-backups│      │
│  │  (WAL + base)│  │  (.bak files)│  │  (snapshots) │      │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘      │
│         │                 │                 │               │
│         │    Ceph RGW (on-prem) / AWS S3 (cloud)           │
└─────────┼─────────────────┼─────────────────┼───────────────┘
          │                 │                 │
    ┌─────┴─────┐     ┌────┴────┐     ┌─────┴─────┐
    │   CNPG    │     │  MSSQL  │     │  Velero   │
    │  Barman   │     │ sqlcmd  │     │   CSI     │
    │  archiver │     │ BACKUP  │     │ snapshots │
    └─────┬─────┘     └────┬────┘     └─────┬─────┘
          │                │                 │
  ┌───────┴───────┐   ┌───┴───┐     ┌──────┴──────┐
  │  PostgreSQL   │   │ RCIIS │     │ Kafka PVCs  │
  │   Clusters    │   │  DB   │     │ etcd  PVCs  │
  │ (esb, ss,     │   │(MSSQL)│     │ other PVCs  │
  │  grafana, kc) │   └───────┘     └─────────────┘
  └───────────────┘

9.4.1 PostgreSQL Backups (CloudNativePG / Barman)

CloudNativePG uses Barman to continuously archive the PostgreSQL Write-Ahead Log (WAL) and perform periodic base backups directly to S3-compatible object storage. This enables point-in-time recovery (PITR) for every PostgreSQL database managed by the CNPG operator.

RCIIS PostgreSQL Database Inventory

Cluster Name Namespace Database Purpose Owner
esb-postgres rciis-prod failed_message_offset ESB Kafka offset tracking esb
ss-postgres rciis-prod ss SignServer digital signatures ss
grafana-postgres monitoring grafana Grafana dashboards & users grafana
keycloak-pg keycloak keycloak Keycloak identity & access keycloak

S3 Credentials

Every CNPG cluster needs a Secret containing S3 access credentials in its namespace. The credential source depends on your deployment model:

Create a Secret with AWS IAM credentials in each namespace that hosts a CNPG cluster:

cnpg-s3-credentials.yaml (SOPS-encrypted)
apiVersion: v1
kind: Secret
metadata:
  name: cnpg-s3-credentials
  namespace: rciis-prod          # Repeat per namespace
type: Opaque
stringData:
  ACCESS_KEY_ID: "<AWS_ACCESS_KEY_ID>"
  ACCESS_SECRET_KEY: "<AWS_SECRET_ACCESS_KEY>"

IAM Roles for Service Accounts (IRSA)

On EKS, prefer IRSA over static credentials. Set backup.barmanObjectStore.s3Credentials.inheritFromIAMRole: true on each Cluster CR and annotate the CNPG ServiceAccount with the IAM role ARN. The IAM policy needs s3:PutObject, s3:GetObject, s3:DeleteObject, and s3:ListBucket on the backup bucket.

Credentials come from the CephObjectStoreUser created in Phase 5 — Backup & Scheduling. After the cnpg-backup CephObjectStoreUser is created, Rook-Ceph generates a Secret containing the RGW access key pair.

Extract the credentials and create a Secret in each namespace:

# Get the Ceph RGW credentials
ACCESS_KEY=$(kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
  -n rook-ceph -o jsonpath='{.data.AccessKey}' | base64 -d)
SECRET_KEY=$(kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
  -n rook-ceph -o jsonpath='{.data.SecretKey}' | base64 -d)

# Create the Secret in each CNPG namespace
for NS in rciis-prod monitoring keycloak; do
  kubectl create secret generic cnpg-s3-credentials \
    --namespace "$NS" \
    --from-literal=ACCESS_KEY_ID="$ACCESS_KEY" \
    --from-literal=ACCESS_SECRET_KEY="$SECRET_KEY" \
    --dry-run=client -o yaml | kubectl apply -f -
done

SOPS encryption

For GitOps, SOPS-encrypt the Secret manifests rather than creating them imperatively. The commands above are for initial setup and verification only.

Credentials come from the CephObjectStoreUser created in Phase 5 — Backup & Scheduling. After the cnpg-backup CephObjectStoreUser is created, Rook-Ceph generates a Secret containing the RGW access key pair.

Extract the credentials and create a Secret in each namespace:

# Get the Ceph RGW credentials
ACCESS_KEY=$(kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
  -n rook-ceph -o jsonpath='{.data.AccessKey}' | base64 -d)
SECRET_KEY=$(kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
  -n rook-ceph -o jsonpath='{.data.SecretKey}' | base64 -d)

# Create the Secret in each CNPG namespace
for NS in rciis-prod monitoring keycloak; do
  kubectl create secret generic cnpg-s3-credentials \
    --namespace "$NS" \
    --from-literal=ACCESS_KEY_ID="$ACCESS_KEY" \
    --from-literal=ACCESS_SECRET_KEY="$SECRET_KEY" \
    --dry-run=client -o yaml | kubectl apply -f -
done

SOPS encryption

For GitOps, SOPS-encrypt the Secret manifests rather than creating them imperatively. The commands above are for initial setup and verification only.

Enable Backups on CNPG Clusters

Add the backup stanza to each CNPG Cluster CR. The existing Cluster CRs in the repository have the backup section commented out — uncomment and update the S3 destination and credentials.

esb-postgres

apps/rciis/nucleus/proxmox/extra/pg-instance.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: esb-postgres
  annotations:
    cnpg.io/reload-secrets: "true"
spec:
  description: "PostgreSQL for ESB"
  instances: 1
  imageCatalogRef:
    apiGroup: postgresql.cnpg.io
    kind: ImageCatalog
    name: postgresql-17
    major: 17

  backup:
    target: prefer-standby
    barmanObjectStore:
      destinationPath: s3://rciis-cnpg-backups/esb-postgres
      s3Credentials:
        accessKeyId:
          name: cnpg-s3-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: cnpg-s3-credentials
          key: ACCESS_SECRET_KEY
      wal:
        compression: gzip
        maxParallel: 2
      data:
        compression: gzip
    retentionPolicy: "30d"

  postgresql:
    pg_hba:
      - host failed_message_offset esb 0.0.0.0/0 scram-sha-256
  managed:
    roles:
      - name: esb
        ensure: present
        comment: ESB User
        login: true
        superuser: false
        inRoles:
          - pg_monitor
          - pg_signal_backend
        passwordSecret:
          name: cnpg-esb-owner
  bootstrap:
    initdb:
      database: failed_message_offset
      owner: esb
      secret:
        name: cnpg-esb-owner
  superuserSecret:
    name: cnpg-esb-superuser
  storage:
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "2000m"
apps/rciis/nucleus/proxmox/extra/pg-instance.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: esb-postgres
  annotations:
    cnpg.io/reload-secrets: "true"
spec:
  description: "PostgreSQL for ESB"
  instances: 1
  imageCatalogRef:
    apiGroup: postgresql.cnpg.io
    kind: ImageCatalog
    name: postgresql-17
    major: 17

  backup:
    target: prefer-standby
    barmanObjectStore:
      destinationPath: s3://cnpg-backups/esb-postgres
      endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
      s3Credentials:
        accessKeyId:
          name: cnpg-s3-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: cnpg-s3-credentials
          key: ACCESS_SECRET_KEY
      wal:
        compression: gzip
        maxParallel: 2
      data:
        compression: gzip
    retentionPolicy: "30d"

  postgresql:
    pg_hba:
      - host failed_message_offset esb 0.0.0.0/0 scram-sha-256
  managed:
    roles:
      - name: esb
        ensure: present
        comment: ESB User
        login: true
        superuser: false
        inRoles:
          - pg_monitor
          - pg_signal_backend
        passwordSecret:
          name: cnpg-esb-owner
  bootstrap:
    initdb:
      database: failed_message_offset
      owner: esb
      secret:
        name: cnpg-esb-owner
  superuserSecret:
    name: cnpg-esb-superuser
  storage:
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "2000m"
apps/rciis/nucleus/proxmox/extra/pg-instance.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: esb-postgres
  annotations:
    cnpg.io/reload-secrets: "true"
spec:
  description: "PostgreSQL for ESB"
  instances: 1
  imageCatalogRef:
    apiGroup: postgresql.cnpg.io
    kind: ImageCatalog
    name: postgresql-17
    major: 17

  backup:
    target: prefer-standby
    barmanObjectStore:
      destinationPath: s3://cnpg-backups/esb-postgres
      endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
      s3Credentials:
        accessKeyId:
          name: cnpg-s3-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: cnpg-s3-credentials
          key: ACCESS_SECRET_KEY
      wal:
        compression: gzip
        maxParallel: 2
      data:
        compression: gzip
    retentionPolicy: "30d"

  postgresql:
    pg_hba:
      - host failed_message_offset esb 0.0.0.0/0 scram-sha-256
  managed:
    roles:
      - name: esb
        ensure: present
        comment: ESB User
        login: true
        superuser: false
        inRoles:
          - pg_monitor
          - pg_signal_backend
        passwordSecret:
          name: cnpg-esb-owner
  bootstrap:
    initdb:
      database: failed_message_offset
      owner: esb
      secret:
        name: cnpg-esb-owner
  superuserSecret:
    name: cnpg-esb-superuser
  storage:
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "2000m"

ss-postgres

The SignServer PostgreSQL cluster follows the same pattern. Update apps/rciis/signserver/proxmox/extra/pg-instance.yaml:

backup stanza for ss-postgres
backup:
  target: prefer-standby
  barmanObjectStore:
    destinationPath: s3://rciis-cnpg-backups/ss-postgres
    s3Credentials:
      accessKeyId:
        name: cnpg-s3-credentials
        key: ACCESS_KEY_ID
      secretAccessKey:
        name: cnpg-s3-credentials
        key: ACCESS_SECRET_KEY
    wal:
      compression: gzip
      maxParallel: 2
    data:
      compression: gzip
  retentionPolicy: "30d"
backup stanza for ss-postgres
backup:
  target: prefer-standby
  barmanObjectStore:
    destinationPath: s3://cnpg-backups/ss-postgres
    endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
    s3Credentials:
      accessKeyId:
        name: cnpg-s3-credentials
        key: ACCESS_KEY_ID
      secretAccessKey:
        name: cnpg-s3-credentials
        key: ACCESS_SECRET_KEY
    wal:
      compression: gzip
      maxParallel: 2
    data:
      compression: gzip
  retentionPolicy: "30d"
backup stanza for ss-postgres
backup:
  target: prefer-standby
  barmanObjectStore:
    destinationPath: s3://cnpg-backups/ss-postgres
    endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
    s3Credentials:
      accessKeyId:
        name: cnpg-s3-credentials
        key: ACCESS_KEY_ID
      secretAccessKey:
        name: cnpg-s3-credentials
        key: ACCESS_SECRET_KEY
    wal:
      compression: gzip
      maxParallel: 2
    data:
      compression: gzip
  retentionPolicy: "30d"

grafana-postgres

The Grafana PostgreSQL cluster lives in the monitoring namespace. Update apps/infra/prometheus/proxmox/extra/pg-instance.yaml:

backup stanza for grafana-postgres
backup:
  target: prefer-standby
  barmanObjectStore:
    destinationPath: s3://rciis-cnpg-backups/grafana-postgres
    s3Credentials:
      accessKeyId:
        name: cnpg-s3-credentials
        key: ACCESS_KEY_ID
      secretAccessKey:
        name: cnpg-s3-credentials
        key: ACCESS_SECRET_KEY
    wal:
      compression: gzip
      maxParallel: 2
    data:
      compression: gzip
  retentionPolicy: "30d"
backup stanza for grafana-postgres
backup:
  target: prefer-standby
  barmanObjectStore:
    destinationPath: s3://cnpg-backups/grafana-postgres
    endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
    s3Credentials:
      accessKeyId:
        name: cnpg-s3-credentials
        key: ACCESS_KEY_ID
      secretAccessKey:
        name: cnpg-s3-credentials
        key: ACCESS_SECRET_KEY
    wal:
      compression: gzip
      maxParallel: 2
    data:
      compression: gzip
  retentionPolicy: "30d"
backup stanza for grafana-postgres
backup:
  target: prefer-standby
  barmanObjectStore:
    destinationPath: s3://cnpg-backups/grafana-postgres
    endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
    s3Credentials:
      accessKeyId:
        name: cnpg-s3-credentials
        key: ACCESS_KEY_ID
      secretAccessKey:
        name: cnpg-s3-credentials
        key: ACCESS_SECRET_KEY
    wal:
      compression: gzip
      maxParallel: 2
    data:
      compression: gzip
  retentionPolicy: "30d"

keycloak-pg

The Keycloak PostgreSQL cluster lives in the keycloak namespace. See Identity Management for the full Cluster CR — add the same backup stanza pattern with destinationPath set to s3://cnpg-backups/keycloak-pg (Bare Metal) or s3://rciis-cnpg-backups/keycloak-pg (AWS).

Scheduled Base Backups

WAL archiving starts automatically once the backup stanza is added to a Cluster CR. However, periodic base backups are essential for efficient recovery — without them, PITR would need to replay the entire WAL history.

Create a ScheduledBackup CR for each PostgreSQL cluster:

scheduled-backups.yaml
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: esb-postgres-daily
  namespace: rciis-prod
spec:
  schedule: "0 0 2 * * *"            # 02:00 UTC daily
  backupOwnerReference: self
  cluster:
    name: esb-postgres
  method: barmanObjectStore
  immediate: true                     # Take a backup on creation
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: ss-postgres-daily
  namespace: rciis-prod
spec:
  schedule: "0 0 2 * * *"
  backupOwnerReference: self
  cluster:
    name: ss-postgres
  method: barmanObjectStore
  immediate: true
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: grafana-postgres-daily
  namespace: monitoring
spec:
  schedule: "0 0 2 * * *"
  backupOwnerReference: self
  cluster:
    name: grafana-postgres
  method: barmanObjectStore
  immediate: true
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: keycloak-pg-daily
  namespace: keycloak
spec:
  schedule: "0 0 2 * * *"
  backupOwnerReference: self
  cluster:
    name: keycloak-pg
  method: barmanObjectStore
  immediate: true

CNPG cron format

CNPG uses a 6-field cron format (seconds, minutes, hours, day-of-month, month, day-of-week). The 0 0 2 * * * schedule fires at 02:00:00 UTC daily. Standard 5-field cron (0 2 * * *) is NOT valid for ScheduledBackup CRs.

Retention policy

retentionPolicy: "30d" on the Cluster CR controls how long base backups and WAL files are retained. Barman automatically prunes backups and WAL segments older than the retention window. Adjust per database based on compliance requirements.

Verify PostgreSQL Backup Health

After applying the backup configuration and ScheduledBackup CRs, verify that WAL archiving and base backups are working:

# Check WAL archiving status — first recoverable point
kubectl get cluster esb-postgres -n rciis-prod \
  -o jsonpath='{.status.firstRecoverabilityPoint}'

# Check last successful backup timestamp
kubectl get cluster esb-postgres -n rciis-prod \
  -o jsonpath='{.status.lastSuccessfulBackup}'

# Check backup conditions
kubectl get cluster esb-postgres -n rciis-prod \
  -o jsonpath='{.status.conditions}' | python3 -m json.tool

# List all backups for a cluster
kubectl get backups -n rciis-prod -l cnpg.io/cluster=esb-postgres

# Verify scheduled backups are registered
kubectl get scheduledbackups -n rciis-prod

# Full cluster status (requires cnpg kubectl plugin)
kubectl cnpg status esb-postgres -n rciis-prod

Repeat for each database cluster (ss-postgres, grafana-postgres, keycloak-pg) in their respective namespaces.

First backup delay

The first base backup may take several minutes depending on database size. WAL archiving begins immediately after the backup stanza is applied, but the firstRecoverabilityPoint will not appear until the first base backup completes successfully.

Restore a PostgreSQL Database (PITR)

To restore a PostgreSQL database to a specific point in time, create a new Cluster CR that bootstraps from the backup source. CNPG does not support in-place restore — you always create a new cluster and then cut over.

Step 1: Identify the Recovery Target

# Find the available recovery window
kubectl get cluster esb-postgres -n rciis-prod \
  -o jsonpath='First: {.status.firstRecoverabilityPoint} — Last WAL: now'

# List available base backups
kubectl get backups -n rciis-prod -l cnpg.io/cluster=esb-postgres \
  -o custom-columns='NAME:.metadata.name,STARTED:.status.startedAt,COMPLETED:.status.stoppedAt,STATUS:.status.phase'

Step 2: Create a Recovery Cluster

esb-postgres-recovery.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: esb-postgres-recovered
  namespace: rciis-prod
spec:
  instances: 1
  imageCatalogRef:
    apiGroup: postgresql.cnpg.io
    kind: ImageCatalog
    name: postgresql-17
    major: 17

  bootstrap:
    recovery:
      source: esb-postgres-backup
      recoveryTarget:
        targetTime: "2026-02-17T12:00:00Z"    # Adjust to your target

  externalClusters:
    - name: esb-postgres-backup
      barmanObjectStore:
        destinationPath: s3://rciis-cnpg-backups/esb-postgres
        s3Credentials:
          accessKeyId:
            name: cnpg-s3-credentials
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: cnpg-s3-credentials
            key: ACCESS_SECRET_KEY
        wal:
          maxParallel: 4

  storage:
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "2000m"
esb-postgres-recovery.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: esb-postgres-recovered
  namespace: rciis-prod
spec:
  instances: 1
  imageCatalogRef:
    apiGroup: postgresql.cnpg.io
    kind: ImageCatalog
    name: postgresql-17
    major: 17

  bootstrap:
    recovery:
      source: esb-postgres-backup
      recoveryTarget:
        targetTime: "2026-02-17T12:00:00Z"    # Adjust to your target

  externalClusters:
    - name: esb-postgres-backup
      barmanObjectStore:
        destinationPath: s3://cnpg-backups/esb-postgres
        endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
        s3Credentials:
          accessKeyId:
            name: cnpg-s3-credentials
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: cnpg-s3-credentials
            key: ACCESS_SECRET_KEY
        wal:
          maxParallel: 4

  storage:
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "2000m"
esb-postgres-recovery.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: esb-postgres-recovered
  namespace: rciis-prod
spec:
  instances: 1
  imageCatalogRef:
    apiGroup: postgresql.cnpg.io
    kind: ImageCatalog
    name: postgresql-17
    major: 17

  bootstrap:
    recovery:
      source: esb-postgres-backup
      recoveryTarget:
        targetTime: "2026-02-17T12:00:00Z"    # Adjust to your target

  externalClusters:
    - name: esb-postgres-backup
      barmanObjectStore:
        destinationPath: s3://cnpg-backups/esb-postgres
        endpointURL: http://rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local:80
        s3Credentials:
          accessKeyId:
            name: cnpg-s3-credentials
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: cnpg-s3-credentials
            key: ACCESS_SECRET_KEY
        wal:
          maxParallel: 4

  storage:
    storageClass: ceph-rbd-single
    size: 10Gi
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "2000m"

Step 3: Apply and Monitor Recovery

# Apply the recovery cluster
kubectl apply -f esb-postgres-recovery.yaml

# Monitor recovery progress
kubectl get cluster esb-postgres-recovered -n rciis-prod -w

# Watch pods — recovery creates a new primary from the backup
kubectl get pods -n rciis-prod -l cnpg.io/cluster=esb-postgres-recovered -w

# Once the cluster reaches "Cluster in healthy state", verify data
kubectl cnpg psql esb-postgres-recovered -n rciis-prod -- \
  -c "SELECT count(*) FROM failed_message_offset;"

Step 4: Cut Over to the Recovered Cluster

Once the recovered cluster is verified, update the application to point to the new cluster's service:

# The recovered cluster exposes services under its own name
# Old: esb-postgres-rw.rciis-prod.svc.cluster.local
# New: esb-postgres-recovered-rw.rciis-prod.svc.cluster.local

# Option A: Rename the recovered cluster (delete old, rename new)
kubectl delete cluster esb-postgres -n rciis-prod
# Then update the recovered cluster's name in the CR and re-apply

# Option B: Update application connection strings to point to the new service
# This depends on how the ESB integration references the database

Production cut-over

Always verify the recovered data before deleting the original cluster. The old cluster should be kept until the recovery is confirmed successful. Consider renaming via DNS (a CNAME or Service update) rather than deleting and recreating.


9.4.2 SQL Server Backups (BACKUP TO URL)

SQL Server 2022 supports native backup to S3-compatible object storage using the BACKUP DATABASE ... TO URL syntax. The RCIIS platform uses this for the main application database (RCIIS), deployed as a StatefulSet via the rciis Helm chart.

RCIIS SQL Server Inventory

StatefulSet Namespace Database Image Storage
mssql rciis-prod RCIIS mcr.microsoft.com/mssql/server:2022-CU20-GDR2-ubuntu-22.04 20Gi ceph-rbd-single

How It Works

SQL Server 2022 on Linux natively supports S3 as a backup destination:

  1. Create an S3 credential in SQL Server — maps an S3 access key pair to a named credential object
  2. Execute BACKUP DATABASE ... TO URL — writes the .bak file directly to the S3 bucket
  3. Restore with RESTORE DATABASE ... FROM URL — reads the .bak file back from S3

The RCIIS Helm chart includes a backup-job.yaml template that automates this as a Kubernetes Job.

S3 Credential Setup

Before backups can run, SQL Server needs an S3 credential. The credential is created inside SQL Server (not Kubernetes) and maps an access key pair to an S3 endpoint.

Create the SQL Server Credential

Connect to the SQL Server instance and create the credential:

# Port-forward to the MSSQL pod
kubectl port-forward statefulset/mssql -n rciis-prod 1433:1433 &

# Connect with sqlcmd (from mssql-tools18)
/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C
Create S3 credential (AWS)
-- Drop existing credential if re-creating
IF EXISTS (SELECT * FROM sys.credentials WHERE name = 's3_backup_credential')
  DROP CREDENTIAL [s3_backup_credential];
GO

CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>';
GO

-- Verify the credential was created
SELECT name, credential_identity, create_date
FROM sys.credentials
WHERE name = 's3_backup_credential';
GO
Create S3 credential (Ceph RGW)
-- Drop existing credential if re-creating
IF EXISTS (SELECT * FROM sys.credentials WHERE name = 's3_backup_credential')
  DROP CREDENTIAL [s3_backup_credential];
GO

CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '<CEPH_RGW_ACCESS_KEY>:<CEPH_RGW_SECRET_KEY>';
GO

-- Verify the credential was created
SELECT name, credential_identity, create_date
FROM sys.credentials
WHERE name = 's3_backup_credential';
GO

Obtaining Ceph RGW credentials

Use the same CephObjectStoreUser credentials as CNPG, or create a dedicated user for MSSQL backups:

# Get credentials from the CephObjectStoreUser secret
kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
  -n rook-ceph -o jsonpath='{.data.AccessKey}' | base64 -d
kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
  -n rook-ceph -o jsonpath='{.data.SecretKey}' | base64 -d
Create S3 credential (Ceph RGW)
-- Drop existing credential if re-creating
IF EXISTS (SELECT * FROM sys.credentials WHERE name = 's3_backup_credential')
  DROP CREDENTIAL [s3_backup_credential];
GO

CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '<CEPH_RGW_ACCESS_KEY>:<CEPH_RGW_SECRET_KEY>';
GO

-- Verify the credential was created
SELECT name, credential_identity, create_date
FROM sys.credentials
WHERE name = 's3_backup_credential';
GO

Obtaining Ceph RGW credentials

Use the same CephObjectStoreUser credentials as CNPG, or create a dedicated user for MSSQL backups:

# Get credentials from the CephObjectStoreUser secret
kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
  -n rook-ceph -o jsonpath='{.data.AccessKey}' | base64 -d
kubectl get secret rook-ceph-object-user-ceph-objectstore-cnpg-backup \
  -n rook-ceph -o jsonpath='{.data.SecretKey}' | base64 -d

Credential key restrictions

SQL Server uses the AccessKeyID:SecretKeyID format with a colon (:) separator. Neither the access key nor the secret key may contain a colon character. If your S3 credentials contain colons, regenerate them.

Store S3 Credentials as a Kubernetes Secret

The backup job reads S3 credentials from a Kubernetes Secret and passes them to the sqlcmd session. Create the Secret in the application namespace:

mssql-s3-credentials.yaml (SOPS-encrypted)
apiVersion: v1
kind: Secret
metadata:
  name: ceph-bucket-credentials
  namespace: rciis-prod
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "<AWS_ACCESS_KEY_ID>"
  AWS_SECRET_ACCESS_KEY: "<AWS_SECRET_ACCESS_KEY>"
mssql-s3-credentials.yaml (SOPS-encrypted)
apiVersion: v1
kind: Secret
metadata:
  name: ceph-bucket-credentials
  namespace: rciis-prod
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "<CEPH_RGW_ACCESS_KEY>"
  AWS_SECRET_ACCESS_KEY: "<CEPH_RGW_SECRET_KEY>"
mssql-s3-credentials.yaml (SOPS-encrypted)
apiVersion: v1
kind: Secret
metadata:
  name: ceph-bucket-credentials
  namespace: rciis-prod
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "<CEPH_RGW_ACCESS_KEY>"
  AWS_SECRET_ACCESS_KEY: "<CEPH_RGW_SECRET_KEY>"

Configure the Backup Job

The rciis Helm chart includes a backup job template (charts/rciis/templates/backup-job.yaml) that runs as a Flux pre-reconciliation job. Enable it in the environment values file:

apps/rciis/nucleus/proxmox/values.yaml
jobs:
  db-backup:
    enabled: true
    repository:
      image: harbor.devops.africa/nucleus/mssql-tools
      tag: latest
    imagePullSecrets:
      - container-registry
    databaseName: RCIIS
    s3:
      url: "s3://s3.rciis.africa"
      bucket: rciis-prod
      path: "backups/mssql"
      credentialName: s3_backup_credential
      credentials:
        secretName: "ceph-bucket-credentials"
        accessKey: "AWS_ACCESS_KEY_ID"
        secretKey: "AWS_SECRET_ACCESS_KEY"
    compressionOption: COMPRESSION
    copyOnly: false
    retentionDays: 30
    env:
      - name: HOST
        value: mssql
      - name: MSSQL_SA_USER
        value: sa
      - name: MSSQL_SA_PASSWORD
        valueFrom:
          secretKeyRef:
            name: mssql-admin
            key: MSSQL_SA_PASSWORD
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 512Mi
apps/rciis/nucleus/proxmox/values.yaml
jobs:
  db-backup:
    enabled: true
    repository:
      image: harbor.devops.africa/nucleus/mssql-tools
      tag: latest
    imagePullSecrets:
      - container-registry
    databaseName: RCIIS
    s3:
      url: "s3://s3.rciis.africa"
      bucket: rciis-prod
      path: "backups/mssql"
      credentialName: s3_backup_credential
      credentials:
        secretName: "ceph-bucket-credentials"
        accessKey: "AWS_ACCESS_KEY_ID"
        secretKey: "AWS_SECRET_ACCESS_KEY"
    compressionOption: COMPRESSION
    copyOnly: false
    retentionDays: 30
    env:
      - name: HOST
        value: mssql
      - name: MSSQL_SA_USER
        value: sa
      - name: MSSQL_SA_PASSWORD
        valueFrom:
          secretKeyRef:
            name: mssql-admin
            key: MSSQL_SA_PASSWORD
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 512Mi
apps/rciis/nucleus/proxmox/values.yaml
jobs:
  db-backup:
    enabled: true
    repository:
      image: harbor.devops.africa/nucleus/mssql-tools
      tag: latest
    imagePullSecrets:
      - container-registry
    databaseName: RCIIS
    s3:
      url: "s3://s3.rciis.africa"
      bucket: rciis-prod
      path: "backups/mssql"
      credentialName: s3_backup_credential
      credentials:
        secretName: "ceph-bucket-credentials"
        accessKey: "AWS_ACCESS_KEY_ID"
        secretKey: "AWS_SECRET_ACCESS_KEY"
    compressionOption: COMPRESSION
    copyOnly: false
    retentionDays: 30
    env:
      - name: HOST
        value: mssql
      - name: MSSQL_SA_USER
        value: sa
      - name: MSSQL_SA_PASSWORD
        valueFrom:
          secretKeyRef:
            name: mssql-admin
            key: MSSQL_SA_PASSWORD
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 512Mi

The backup job executes the following SQL when triggered:

BACKUP DATABASE [RCIIS]
TO URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_020000.bak'
WITH CREDENTIAL = 's3_backup_credential', COMPRESSION, STATS = 10

Sync hook behaviour

The backup job runs as a Flux pre-reconciliation hook. It triggers before every Flux reconciliation, ensuring a backup exists before any database migration runs. For scheduled backups independent of reconciliations, use a CronJob (see below).

Scheduled Backups (CronJob)

For regular backups independent of Flux reconciliations, create a Kubernetes CronJob:

mssql-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: mssql-daily-backup
  namespace: rciis-prod
spec:
  schedule: "0 3 * * *"              # 03:00 UTC daily
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 3
      ttlSecondsAfterFinished: 3600
      template:
        spec:
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: container-registry
          containers:
            - name: mssql-backup
              image: harbor.devops.africa/nucleus/mssql-tools:latest
              command: ["/bin/bash", "-c"]
              args:
                - |
                  set -e
                  TIMESTAMP=$(date +%Y%m%d_%H%M%S)
                  BACKUP_URL="s3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_${TIMESTAMP}.bak"

                  echo "Starting MSSQL backup to ${BACKUP_URL}"

                  # Wait for SQL Server
                  until /opt/mssql-tools18/bin/sqlcmd -b -S mssql -U sa \
                    -P "${MSSQL_SA_PASSWORD}" -C -Q "SELECT 1" > /dev/null 2>&1; do
                    echo "Waiting for SQL Server..."
                    sleep 5
                  done

                  # Re-create S3 credential (idempotent)
                  /opt/mssql-tools18/bin/sqlcmd -b -S mssql -U sa \
                    -P "${MSSQL_SA_PASSWORD}" -C -Q "
                    IF EXISTS (SELECT * FROM sys.credentials WHERE name = 's3_backup_credential')
                      DROP CREDENTIAL [s3_backup_credential];
                    CREATE CREDENTIAL [s3_backup_credential]
                    WITH IDENTITY = 'S3 Access Key',
                    SECRET = '${S3_ACCESS_KEY}:${S3_SECRET_KEY}';
                  "

                  # Execute backup
                  /opt/mssql-tools18/bin/sqlcmd -b -S mssql -U sa \
                    -P "${MSSQL_SA_PASSWORD}" -C -Q "
                    BACKUP DATABASE [RCIIS]
                    TO URL = N'${BACKUP_URL}'
                    WITH CREDENTIAL = 's3_backup_credential',
                    COMPRESSION, STATS = 10;
                  "

                  echo "Backup completed: ${BACKUP_URL}"
              env:
                - name: MSSQL_SA_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: mssql-admin
                      key: MSSQL_SA_PASSWORD
                - name: S3_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: ceph-bucket-credentials
                      key: AWS_ACCESS_KEY_ID
                - name: S3_SECRET_KEY
                  valueFrom:
                    secretKeyRef:
                      name: ceph-bucket-credentials
                      key: AWS_SECRET_ACCESS_KEY
              resources:
                requests:
                  cpu: 100m
                  memory: 128Mi
                limits:
                  cpu: 500m
                  memory: 512Mi

Verify SQL Server Backups

# Check recent CronJob runs
kubectl get jobs -n rciis-prod -l job-name=mssql-daily-backup --sort-by=.metadata.creationTimestamp

# View the last backup job logs
kubectl logs -n rciis-prod job/$(kubectl get jobs -n rciis-prod \
  -l job-name=mssql-daily-backup -o jsonpath='{.items[-1].metadata.name}')

# Verify backup exists in S3 (using the rook-ceph toolbox or AWS CLI)
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- \
  s3cmd ls s3://rciis-prod/backups/mssql/ \
  --host=rook-ceph-rgw-ceph-objectstore.rook-ceph.svc.cluster.local \
  --host-bucket="" --no-ssl

# Inspect backup header from SQL Server
kubectl exec -n rciis-prod statefulset/mssql -- \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C -Q "
  RESTORE HEADERONLY
  FROM URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_030000.bak'
  WITH CREDENTIAL = 's3_backup_credential';
  "

Restore a SQL Server Database

Restore to the Same Instance

# Port-forward to MSSQL (or exec into the pod)
kubectl exec -it -n rciis-prod statefulset/mssql -- \
  /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$MSSQL_SA_PASSWORD" -C
-- Inspect the backup contents
RESTORE FILELISTONLY
FROM URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_030000.bak'
WITH CREDENTIAL = 's3_backup_credential';
GO

-- Restore with REPLACE (overwrites existing database)
RESTORE DATABASE [RCIIS]
FROM URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_030000.bak'
WITH CREDENTIAL = 's3_backup_credential',
REPLACE,
STATS = 10;
GO

-- Verify the restore
SELECT name, state_desc, recovery_model_desc
FROM sys.databases
WHERE name = 'RCIIS';
GO

Restore to a New Instance

To restore to a different SQL Server instance (e.g., a new StatefulSet for disaster recovery):

-- Ensure the S3 credential exists on the target instance
CREATE CREDENTIAL [s3_backup_credential]
WITH IDENTITY = 'S3 Access Key',
SECRET = '<ACCESS_KEY>:<SECRET_KEY>';
GO

-- Restore with MOVE to place files in the correct location
RESTORE DATABASE [RCIIS]
FROM URL = N's3://s3.rciis.africa/rciis-prod/backups/mssql/RCIIS_20260217_030000.bak'
WITH CREDENTIAL = 's3_backup_credential',
MOVE N'RCIIS' TO N'/var/opt/mssql/data/RCIIS.mdf',
MOVE N'RCIIS_Log' TO N'/var/opt/mssql/data/RCIIS_Log.ldf',
REPLACE,
STATS = 10;
GO

Application downtime

Restoring a SQL Server database with REPLACE drops the existing database and restores from the backup. All active connections are terminated. Schedule restores during a maintenance window and notify dependent services (Nucleus API, ESB) beforehand.


9.4.3 Velero Backups (Stateful Workloads)

Velero backs up Kubernetes resources and persistent volumes via CSI snapshots. It is specifically used for stateful workloads that are NOT databases — databases have their own native backup mechanisms (CNPG/Barman for PostgreSQL, BACKUP TO URL for SQL Server).

What Velero Backs Up

Workload Namespace Storage Why Velero
Kafka (Strimzi) rciis-prod JBOD 10Gi per broker, ceph-rbd-single Topic data, consumer offsets, KRaft metadata
etcd (APISIX) rciis-prod 8Gi per replica, ceph-rbd-single API gateway configuration
Kubernetes resources Multiple N/A (metadata only) CRDs, ConfigMaps, Secrets, RBAC

What Velero does NOT back up

Velero CSI snapshots capture PVC data at a point in time but are not suitable for databases that require transactional consistency. PostgreSQL and SQL Server use their own WAL-based and log-based backup mechanisms respectively. Velero's role is to protect the Kubernetes resource definitions and the persistent data of non-database stateful workloads.

Configure Backup Schedules

The Velero Helm values control backup schedules. Update the values file to enable automated backups:

apps/infra/velero/proxmox/values.yaml
schedules:
  # Daily namespace backup — retains 30 days
  daily-namespaces:
    disabled: false
    schedule: "0 2 * * *"           # 02:00 UTC daily
    useOwnerReferencesInBackup: false
    template:
      ttl: "720h"                   # 30 days
      storageLocation: default
      includedNamespaces:
        - rciis-prod
        - monitoring
        - strimzi-operator
        - cnpg-system
      snapshotMoveData: false

  # Weekly full-cluster backup — retains 90 days
  weekly-full:
    disabled: false
    schedule: "0 3 * * 0"           # 03:00 UTC Sunday
    useOwnerReferencesInBackup: false
    template:
      ttl: "2160h"                  # 90 days
      storageLocation: default
      includeClusterResources: true
      snapshotMoveData: false
apps/infra/velero/proxmox/values.yaml
# Non-HA: No automated schedules — create on-demand backups as needed
schedules: {}

After updating the values, reconcile via Flux or apply with Helm:

# Flux reconciliation
flux reconcile kustomization rciis-velero

# Or Helm CLI
helm upgrade velero vmware-tanzu/velero \
  --namespace velero \
  --reuse-values \
  -f apps/infra/velero/proxmox/values.yaml

On-Demand Backups

Create ad-hoc backups before maintenance, upgrades, or risky changes:

# Backup specific namespaces
velero backup create pre-upgrade-rciis \
  --include-namespaces rciis-prod \
  --wait

# Backup with volume snapshots (Kafka/etcd PVCs)
velero backup create kafka-snapshot \
  --include-namespaces rciis-prod \
  --selector strimzi.io/cluster=kafka-rciis-prod \
  --wait

# Full cluster backup
velero backup create full-cluster-$(date +%Y%m%d) \
  --wait

# Check backup status
velero backup describe pre-upgrade-rciis --details

Verify Velero Backups

# List all backups
velero backup get

# Check backup storage location health
velero backup-location get

# Describe a specific backup (check for errors/warnings)
velero backup describe daily-namespaces-20260217020000 --details

# View backup logs
velero backup logs daily-namespaces-20260217020000

# Verify CSI snapshots were created
kubectl get volumesnapshots -n rciis-prod

Restore from Velero

Restore Specific Namespaces

# Restore a single namespace from the latest daily backup
velero restore create --from-backup daily-namespaces-20260217020000 \
  --include-namespaces rciis-prod \
  --wait

# Check restore status
velero restore describe <restore-name> --details

Restore Specific Resources

# Restore only Kafka-related resources
velero restore create kafka-restore \
  --from-backup daily-namespaces-20260217020000 \
  --include-namespaces rciis-prod \
  --selector strimzi.io/cluster=kafka-rciis-prod \
  --wait

# Restore only etcd PVCs
velero restore create etcd-restore \
  --from-backup daily-namespaces-20260217020000 \
  --include-namespaces rciis-prod \
  --include-resources persistentvolumeclaims \
  --selector app=etcd \
  --wait

Full Cluster Restore (Disaster Recovery)

# List available full-cluster backups
velero backup get --selector velero.io/schedule-name=weekly-full

# Restore the entire cluster (excludes velero namespace itself)
velero restore create full-dr-restore \
  --from-backup weekly-full-20260216030000 \
  --exclude-namespaces velero \
  --wait

# Monitor restore progress
velero restore describe full-dr-restore --details

Restore order matters

When restoring a full cluster, CRDs and namespaces are restored first, then workloads. If operator CRDs (CNPG, Strimzi) are restored but the operators themselves are not yet running, the reconciliation loop will not start until the operator pods are restored. Velero handles this automatically through its resource ordering, but verify operator health after restore completes.


Backup Monitoring & Alerting

PrometheusRule for Backup Health

Create alerting rules to detect backup failures across all three systems:

backup-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backup-alerts
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: backup.rules
      rules:
        # Velero backup failure
        - alert: VeleroBackupFailed
          expr: |
            increase(velero_backup_failure_total[24h]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Velero backup failed"
            description: "A Velero backup has failed in the last 24 hours."
            runbook: "Check velero backup logs: velero backup logs <backup-name>"

        # Velero backup not running
        - alert: VeleroBackupNotRunning
          expr: |
            time() - velero_backup_last_successful_timestamp{schedule!=""} > 90000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Velero scheduled backup overdue"
            description: >-
              Velero scheduled backup {{ $labels.schedule }} has not
              completed successfully in the last 25 hours.

        # CNPG WAL archiving failure
        - alert: CNPGWALArchivingFailed
          expr: |
            cnpg_pg_stat_archiver_failed_count > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "CNPG WAL archiving failed for {{ $labels.cluster }}"
            description: >-
              PostgreSQL cluster {{ $labels.cluster }} in namespace
              {{ $labels.namespace }} has WAL archiving failures.
            runbook: >-
              Check CNPG cluster status:
              kubectl cnpg status {{ $labels.cluster }} -n {{ $labels.namespace }}

        # CNPG backup not recent
        - alert: CNPGBackupStale
          expr: |
            (time() - cnpg_pg_stat_archiver_last_archived_time) > 3600
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "CNPG WAL archiving stale for {{ $labels.cluster }}"
            description: >-
              No WAL has been archived for PostgreSQL cluster
              {{ $labels.cluster }} in the last hour.

Verification Checklist

Run this checklist periodically (weekly recommended) to confirm all backup systems are healthy:

#!/bin/bash
echo "=== RCIIS Backup Health Check ==="
echo ""

echo "--- PostgreSQL (CNPG) Backups ---"
for CLUSTER in esb-postgres ss-postgres; do
  echo "Cluster: $CLUSTER (rciis-prod)"
  kubectl get cluster "$CLUSTER" -n rciis-prod \
    -o jsonpath='  First recoverable: {.status.firstRecoverabilityPoint}
  Last backup:       {.status.lastSuccessfulBackup}
'
done
echo "Cluster: grafana-postgres (monitoring)"
kubectl get cluster grafana-postgres -n monitoring \
  -o jsonpath='  First recoverable: {.status.firstRecoverabilityPoint}
  Last backup:       {.status.lastSuccessfulBackup}
'
echo ""

echo "--- SQL Server Backups ---"
kubectl get cronjob mssql-daily-backup -n rciis-prod \
  -o jsonpath='  Schedule:    {.spec.schedule}
  Last run:    {.status.lastScheduleTime}
  Last success:{.status.lastSuccessfulTime}
'
echo ""
echo ""

echo "--- Velero Backups ---"
velero backup get --output json 2>/dev/null | \
  python3 -c "
import json, sys
data = json.load(sys.stdin)
for item in (data.get('items') or [])[-5:]:
    name = item['metadata']['name']
    phase = item['status'].get('phase', 'Unknown')
    print(f'  {name}: {phase}')
" 2>/dev/null || echo "  (velero CLI not available)"
echo ""

echo "--- Backup Storage ---"
velero backup-location get 2>/dev/null || echo "  (velero CLI not available)"
echo ""
echo "=== Health Check Complete ==="

Backup Matrix Summary

Component Backup Method Schedule Retention S3 Bucket Recovery Method
esb-postgres CNPG/Barman WAL + base Daily 02:00 UTC 30 days cnpg-backups/esb-postgres PITR via new Cluster CR
ss-postgres CNPG/Barman WAL + base Daily 02:00 UTC 30 days cnpg-backups/ss-postgres PITR via new Cluster CR
grafana-postgres CNPG/Barman WAL + base Daily 02:00 UTC 30 days cnpg-backups/grafana-postgres PITR via new Cluster CR
keycloak-pg CNPG/Barman WAL + base Daily 02:00 UTC 30 days cnpg-backups/keycloak-pg PITR via new Cluster CR
RCIIS (MSSQL) BACKUP TO URL (S3) Daily 03:00 UTC + on sync 30 days rciis-prod/backups/mssql RESTORE FROM URL
Kafka PVCs Velero CSI snapshot Daily 02:00 UTC 30 days velero-backups Velero restore
etcd PVCs Velero CSI snapshot Daily 02:00 UTC 30 days velero-backups Velero restore
Cluster resources Velero metadata Weekly 03:00 Sun 90 days velero-backups Velero full restore