9.3 Maintenance Windows¶

This page covers routine maintenance procedures for the RCIIS platform, including security tool updates, Kubernetes component maintenance, and scheduled upkeep.

Maintenance Schedule¶

Task	Frequency	Window Required	Impact
Trivy vulnerability DB update	Daily (automatic)	None	None — background update
Falco rule updates	Monthly or on new threat advisory	None	Falco pod restart (seconds)
Tracee policy updates	As needed	None	Tracee pod restart (seconds)
Kyverno policy changes	As needed	None (Audit) / Low (Enforce)	Brief admission delay during rollout
Keycloak upgrades	Quarterly	15–30 minutes	Authentication unavailable during rollout
HSM firmware updates	As released by vendor	30–60 minutes per HSM	None if HA (update one HSM at a time)
Helm chart version updates	As released (Renovate Bot)	None	Rolling pod restarts
Talos OS upgrades	As released	See Talos Upgrades	Rolling node reboots
Kubernetes version upgrades	Quarterly	See Talos Upgrades	Rolling node reboots

Security Tool Maintenance¶

Trivy Operator¶

Vulnerability Database Updates¶

The Trivy Operator automatically downloads the latest vulnerability database from ghcr.io/aquasecurity/trivy-db. No manual intervention is needed.

Verify the database is current:

kubectl -n trivy-system logs -l app.kubernetes.io/name=trivy-operator --tail=20 | grep -i "db"

Air-Gapped Environments¶

If the cluster cannot reach ghcr.io, mirror the Trivy DB to Harbor:

# On a machine with internet access
oras pull ghcr.io/aquasecurity/trivy-db:2
oras push harbor.devops.africa/rciis/trivy-db:2 db.tar.gz:application/vnd.aquasec.trivy.db.layer.v1.tar+gzip

Update the Helm values to point to the mirrored DB:

values-trivy-operator.yaml (update)

trivy:
  dbRepository: harbor.devops.africa/rciis/trivy-db

Upgrading the Trivy Operator¶

helm repo update
helm upgrade trivy-operator aqua/trivy-operator \
  --namespace trivy-system \
  -f values-trivy-operator.yaml

Review the Trivy Operator changelog before upgrading — check for breaking changes in CRD schemas or Helm values.

Falco¶

Rule Updates¶

Falco rules are updated by upgrading the Helm chart or updating the customRules section in the values file.

Update built-in rules (via Helm upgrade):

helm repo update
helm upgrade falco falcosecurity/falco \
  --namespace falco \
  -f values-falco.yaml

Add or update custom RCIIS rules:

Edit the customRules section in values-falco.yaml
Test new rules in a non-production cluster first
Apply via Helm upgrade

Test a new rule before deploying:

# Dry-run — validate rule syntax without deploying
falco -c /etc/falco/falco.yaml -r /path/to/new-rules.yaml --dry-run

Tuning False Positives¶

If Falco generates alerts for known-good behaviour:

Identify the rule name from the alert
Add an exception to the rule in customRules:

- rule: Shell Spawned in Container
  append: true
  condition: and not (container.image.repository = "harbor.devops.africa/rciis/debug-tools")

Test the exception, then deploy via Helm upgrade

Tracee¶

Policy Updates¶

Update Tracee policies by applying new Policy CRDs:

kubectl apply -f tracee-policy-updated.yaml -n tracee

Tracee picks up policy changes dynamically — no pod restart required for CRD-based policies.

Upgrading Tracee¶

helm repo update
helm upgrade tracee aqua/tracee \
  --namespace tracee \
  -f values-tracee.yaml

Kyverno¶

Policy Lifecycle¶

Adding a new policy:

Write the policy YAML (ClusterPolicy or Policy)
Deploy in Audit mode first: validationFailureAction: Audit
Monitor PolicyReport resources for violations over 1–2 weeks
Fix violations or create PolicyException resources
Switch to Enforce mode: validationFailureAction: Enforce

Modifying an existing policy:

Update the policy YAML
If the change is more restrictive: switch to Audit mode, monitor, then re-enable Enforce
If the change is less restrictive: apply directly in Enforce mode

Upgrading Kyverno:

helm repo update

# Check for CRD changes — Kyverno CRDs are not managed by Helm
kubectl apply -f https://raw.githubusercontent.com/kyverno/kyverno/main/config/crds/kyverno.io_clusterpolicies.yaml
kubectl apply -f https://raw.githubusercontent.com/kyverno/kyverno/main/config/crds/kyverno.io_policyexceptions.yaml

helm upgrade kyverno kyverno/kyverno \
  --namespace kyverno \
  -f values-kyverno.yaml

Warning

Always review Kyverno release notes before upgrading. Major version upgrades may change policy API versions or webhook behaviour. Test upgrades in a non-production cluster first.

Flux GitOps — Recovering Stalled Resources¶

HelmReleases and Kustomizations can become permanently stuck after transient failures (e.g., slow pod startup exceeding the Helm timeout). Once retries are exhausted, Flux will not retry automatically — the resource enters a Stalled state.

Symptom¶

kubectl get helmrelease -A | grep False
# kafka-ui   False   Helm install failed for release rciis-prod/kafka-ui: timeout waiting for...

The underlying pods may be perfectly healthy:

kubectl get pods -n rciis-prod | grep kafka-ui
# kafka-ui-6cbbc6b6b4-gxxbm   1/1   Running   0   159m

Resolution — Suspend and Resume¶

Suspending and resuming the HelmRelease clears the exhausted retry counter and forces a fresh reconciliation:

# Step 1: Clear the stalled HelmRelease
flux suspend helmrelease <name> -n flux-system
flux resume helmrelease <name> -n flux-system

# Step 2: Reconcile the parent Kustomization (if also stalled)
flux reconcile kustomization <kustomization-name> -n flux-system

Example — recovering kafka-ui:

flux suspend helmrelease kafka-ui -n flux-system
flux resume helmrelease kafka-ui -n flux-system
# ✔ HelmRelease kafka-ui reconciliation completed

flux reconcile kustomization rciis-kafka-ui -n flux-system
# ✔ applied revision master@sha1:...

Verification¶

kubectl get helmrelease <name> -n flux-system
# READY: True   STATUS: Helm upgrade succeeded...

kubectl get kustomization <name> -n flux-system
# READY: True   STATUS: Applied revision...

Tip

If multiple HelmReleases are stalled after a cluster rebuild, recover them in dependency order. Check spec.dependsOn in each HelmRelease to determine the correct sequence.

Note

If the HelmRelease keeps failing after resume, the issue is not transient — check the pod logs and events for the underlying cause before retrying.

Keycloak¶

Keycloak is managed by the Keycloak Operator. Upgrades are performed by updating the operator manifests and/or the Keycloak CR image version.

Operator Upgrades¶

The Keycloak Operator is installed from pinned upstream manifests via FluxCD. To upgrade:

Check the release notes for the target version at keycloak.org

Backup the database:

kubectl cnpg backup keycloak-pg -n keycloak

Export the realm (as a safety backup):

KC_TOKEN=$(curl -s -X POST \
  "https://auth.rciis.eac.int/realms/master/protocol/openid-connect/token" \
  -d "client_id=admin-cli" \
  -d "username=admin" \
  -d "password=<admin-password>" \
  -d "grant_type=password" | jq -r .access_token)

curl -s "https://auth.rciis.eac.int/admin/realms/rciis" \
  -H "Authorization: Bearer $KC_TOKEN" | jq . > rciis-realm-export.json

Update the operator version — change the targetRevision in the FluxCD Kustomization manifests for both the CRDs and the operator deployment:

# Update CRDs first
KC_VERSION=26.1.0   # New target version

kubectl apply -f \
  https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/keycloaks.k8s.keycloak.org-v1.yml
kubectl apply -f \
  https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml

# Then update the operator deployment
kubectl -n keycloak apply -f \
  https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/kubernetes.yml

Update the Keycloak CR image to match the new operator version:
```
kubectl -n keycloak patch keycloak rciis-keycloak --type merge \
  -p '{"spec":{"image":"quay.io/keycloak/keycloak:26.1.0"}}'
```
The operator performs a rolling update of the Keycloak StatefulSet. It handles database migrations automatically.
Verify: Test OIDC login for Weave GitOps and Kubernetes API

Warning

Always upgrade CRDs before the operator deployment. CRD changes are not backwards-compatible — an older operator cannot reconcile CRs with a newer schema. Test upgrades in a non-production environment first.

Realm Configuration Changes¶

The KeycloakRealmImport CR handles initial realm creation only. For ongoing changes:

Make changes in the Keycloak admin console or via the Admin REST API
Export the updated realm configuration (see export command above)
Update the KeycloakRealmImport CR in Git to match the current state
Commit — this provides an audit trail and disaster recovery capability

The KeycloakRealmImport CR in Git should always reflect the current realm state so that the realm can be fully recreated from Git in a DR scenario. See Identity & Access Management for the full change workflow.

HSM¶

Firmware Updates¶

AWSBare MetalProxmox VMs

CloudHSM firmware updates are managed by AWS. Monitor the AWS CloudHSM release notes for updates.

AWS applies firmware updates during the next reboot. To control timing:

# Check current firmware version
aws cloudhsmv2 describe-clusters \
  --query 'Clusters[0].Hsms[].HsmId' --output text

Firmware updates for on-premises HSMs must be applied manually:

Schedule a maintenance window
If running HA (2+ HSMs): update one HSM at a time
Download the firmware update from the vendor portal
Apply the update using the vendor's administration tool
Verify the HSM is functional after the update: run a test signing operation
Repeat for the second HSM

Warning

Some firmware updates may require the HSM to be re-initialised, which destroys all keys. Always verify with the vendor documentation and ensure key backups are current before updating.

Firmware updates for on-premises HSMs must be applied manually:

Schedule a maintenance window
If running HA (2+ HSMs): update one HSM at a time
Download the firmware update from the vendor portal
Apply the update using the vendor's administration tool
Verify the HSM is functional after the update: run a test signing operation
Repeat for the second HSM

Warning

Some firmware updates may require the HSM to be re-initialised, which destroys all keys. Always verify with the vendor documentation and ensure key backups are current before updating.

HSM Audit Log Review¶

Review HSM access logs weekly for:

Unauthorised login attempts
Unexpected key operations (sign, decrypt, export)
Administrative actions (key creation, deletion, policy changes)

AWSBare MetalProxmox VMs

# CloudHSM audit logs are in CloudWatch
aws logs filter-log-events \
  --log-group-name /aws/cloudhsm/cluster-xxxxxxxxxxxx \
  --start-time $(date -d '7 days ago' +%s000) \
  --filter-pattern "MGMT_KEY"

# Vendor-specific — example for Thales Luna
lunash:> audit log show

# Vendor-specific — example for Thales Luna
lunash:> audit log show

AWS Cluster Rebuild — Post-Bootstrap Restart Sequence¶

When the AWS cluster is destroyed and rebuilt via Terraform, the bootstrap process (helmfile + Flux) installs components in a specific order. However, certain components require manual restarts because they initialize before their dependencies are ready.

Required Restarts (in order)¶

Step	Component	Command	Reason
1	Cilium Operator	`kubectl rollout restart deployment cilium-operator -n kube-system`	Starts before Gateway API CRDs and Gateway resources are applied by Flux. Without restart, the operator never picks up the `aws-gateway` Gateway resource (status stays "Waiting for controller").
2	Cilium Envoy DaemonSet	`kubectl rollout restart daemonset cilium-envoy -n kube-system`	Envoy pods start before the Cilium operator reconciles the gateway. Without restart, Envoy has 0 listeners and the gateway NLB targets fail health checks (connection refused on ports 80/443).
3	Cilium Agent DaemonSet	`kubectl rollout restart daemonset cilium -n kube-system`	Agents need to reload proxy redirects after the operator reconciles the CiliumEnvoyConfig for the gateway. Without restart, `cilium status` shows "0 redirects active".

Pre-Requisites Before Restarts¶

Before restarting Cilium components, ensure:

Prometheus Operator CRDs are installed — cert-manager and cloudnative-pg HelmReleases include ServiceMonitor/PodMonitor resources. Prometheus depends on cert-manager, creating a circular dependency. Install CRDs manually:

for crd in servicemonitors podmonitors prometheusrules alertmanagerconfigs \
           alertmanagers prometheuses thanosrulers probes scrapeconfigs; do
  kubectl apply --server-side -f \
    "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_${crd}.yaml"
done

AWS Load Balancer Controller VPC ID — The vpcId in apps/infra/aws-load-balancer-controller/aws/values.yaml must match the new VPC. Each terraform apply creates a new VPC with a new ID.

Verification¶

After restarts, verify the gateway is working:

# Gateway should show PROGRAMMED: True with an NLB address
kubectl get gateway -n kube-system

# Service should have an EXTERNAL-IP
kubectl get svc cilium-gateway-aws-gateway -n kube-system

# NLB targets should be healthy
aws elbv2 describe-target-health \
  --target-group-arn $(aws elbv2 describe-target-groups \
    --query 'TargetGroups[?contains(TargetGroupName, `ciliumga`)].TargetGroupArn' \
    --output text --region af-south-1) \
  --region af-south-1

# Health endpoint should return 200
curl -sk https://health.rciis.africa/health \
  --resolve "health.rciis.africa:443:$(dig +short <gateway-nlb-hostname>)"

Cleanup on Destroy¶

When running terraform destroy, Kubernetes-managed AWS resources (NLBs, security groups, target groups) are not in Terraform state and must be cleaned up manually. Otherwise the VPC deletion will hang:

VPC_ID="<vpc-id>"

# Delete orphaned NLBs
for arn in $(aws elbv2 describe-load-balancers \
  --query "LoadBalancers[?VpcId==\`${VPC_ID}\`].LoadBalancerArn" \
  --output text --region af-south-1); do
  aws elbv2 delete-load-balancer --load-balancer-arn "$arn" --region af-south-1
done

# Delete orphaned target groups
for arn in $(aws elbv2 describe-target-groups \
  --query "TargetGroups[?VpcId==\`${VPC_ID}\`].TargetGroupArn" \
  --output text --region af-south-1); do
  aws elbv2 delete-target-group --target-group-arn "$arn" --region af-south-1
done

# Wait ~60s for ENIs to release, then delete orphaned security groups
for sg in $(aws ec2 describe-security-groups \
  --filters "Name=vpc-id,Values=${VPC_ID}" \
  --query 'SecurityGroups[?GroupName!=`default`].GroupId' \
  --output text --region af-south-1); do
  aws ec2 delete-security-group --group-id "$sg" --region af-south-1
done

# Retry terraform destroy
terraform destroy -var-file=../envs/aws.tfvars -auto-approve