9.3 Maintenance Windows¶
This page covers routine maintenance procedures for the RCIIS platform, including security tool updates, Kubernetes component maintenance, and scheduled upkeep.
Maintenance Schedule¶
| Task | Frequency | Window Required | Impact |
|---|---|---|---|
| Trivy vulnerability DB update | Daily (automatic) | None | None — background update |
| Falco rule updates | Monthly or on new threat advisory | None | Falco pod restart (seconds) |
| Tracee policy updates | As needed | None | Tracee pod restart (seconds) |
| Kyverno policy changes | As needed | None (Audit) / Low (Enforce) | Brief admission delay during rollout |
| Keycloak upgrades | Quarterly | 15–30 minutes | Authentication unavailable during rollout |
| HSM firmware updates | As released by vendor | 30–60 minutes per HSM | None if HA (update one HSM at a time) |
| Helm chart version updates | As released (Renovate Bot) | None | Rolling pod restarts |
| Talos OS upgrades | As released | See Talos Upgrades | Rolling node reboots |
| Kubernetes version upgrades | Quarterly | See Talos Upgrades | Rolling node reboots |
Security Tool Maintenance¶
Trivy Operator¶
Vulnerability Database Updates¶
The Trivy Operator automatically downloads the latest vulnerability database from ghcr.io/aquasecurity/trivy-db. No manual intervention is needed.
Verify the database is current:
Air-Gapped Environments¶
If the cluster cannot reach ghcr.io, mirror the Trivy DB to Harbor:
# On a machine with internet access
oras pull ghcr.io/aquasecurity/trivy-db:2
oras push harbor.devops.africa/rciis/trivy-db:2 db.tar.gz:application/vnd.aquasec.trivy.db.layer.v1.tar+gzip
Update the Helm values to point to the mirrored DB:
Upgrading the Trivy Operator¶
helm repo update
helm upgrade trivy-operator aqua/trivy-operator \
--namespace trivy-system \
-f values-trivy-operator.yaml
Review the Trivy Operator changelog before upgrading — check for breaking changes in CRD schemas or Helm values.
Falco¶
Rule Updates¶
Falco rules are updated by upgrading the Helm chart or updating the customRules section in the values file.
Update built-in rules (via Helm upgrade):
Add or update custom RCIIS rules:
- Edit the
customRulessection invalues-falco.yaml - Test new rules in a non-production cluster first
- Apply via Helm upgrade
Test a new rule before deploying:
# Dry-run — validate rule syntax without deploying
falco -c /etc/falco/falco.yaml -r /path/to/new-rules.yaml --dry-run
Tuning False Positives¶
If Falco generates alerts for known-good behaviour:
- Identify the rule name from the alert
- Add an exception to the rule in
customRules:
- rule: Shell Spawned in Container
append: true
condition: and not (container.image.repository = "harbor.devops.africa/rciis/debug-tools")
- Test the exception, then deploy via Helm upgrade
Tracee¶
Policy Updates¶
Update Tracee policies by applying new Policy CRDs:
Tracee picks up policy changes dynamically — no pod restart required for CRD-based policies.
Upgrading Tracee¶
Kyverno¶
Policy Lifecycle¶
Adding a new policy:
- Write the policy YAML (ClusterPolicy or Policy)
- Deploy in
Auditmode first:validationFailureAction: Audit - Monitor
PolicyReportresources for violations over 1–2 weeks - Fix violations or create
PolicyExceptionresources - Switch to
Enforcemode:validationFailureAction: Enforce
Modifying an existing policy:
- Update the policy YAML
- If the change is more restrictive: switch to
Auditmode, monitor, then re-enableEnforce - If the change is less restrictive: apply directly in
Enforcemode
Upgrading Kyverno:
helm repo update
# Check for CRD changes — Kyverno CRDs are not managed by Helm
kubectl apply -f https://raw.githubusercontent.com/kyverno/kyverno/main/config/crds/kyverno.io_clusterpolicies.yaml
kubectl apply -f https://raw.githubusercontent.com/kyverno/kyverno/main/config/crds/kyverno.io_policyexceptions.yaml
helm upgrade kyverno kyverno/kyverno \
--namespace kyverno \
-f values-kyverno.yaml
Warning
Always review Kyverno release notes before upgrading. Major version upgrades may change policy API versions or webhook behaviour. Test upgrades in a non-production cluster first.
Flux GitOps — Recovering Stalled Resources¶
HelmReleases and Kustomizations can become permanently stuck after transient failures (e.g., slow pod startup exceeding the Helm timeout). Once retries are exhausted, Flux will not retry automatically — the resource enters a Stalled state.
Symptom¶
kubectl get helmrelease -A | grep False
# kafka-ui False Helm install failed for release rciis-prod/kafka-ui: timeout waiting for...
The underlying pods may be perfectly healthy:
Resolution — Suspend and Resume¶
Suspending and resuming the HelmRelease clears the exhausted retry counter and forces a fresh reconciliation:
# Step 1: Clear the stalled HelmRelease
flux suspend helmrelease <name> -n flux-system
flux resume helmrelease <name> -n flux-system
# Step 2: Reconcile the parent Kustomization (if also stalled)
flux reconcile kustomization <kustomization-name> -n flux-system
Example — recovering kafka-ui:
flux suspend helmrelease kafka-ui -n flux-system
flux resume helmrelease kafka-ui -n flux-system
# ✔ HelmRelease kafka-ui reconciliation completed
flux reconcile kustomization rciis-kafka-ui -n flux-system
# ✔ applied revision master@sha1:...
Verification¶
kubectl get helmrelease <name> -n flux-system
# READY: True STATUS: Helm upgrade succeeded...
kubectl get kustomization <name> -n flux-system
# READY: True STATUS: Applied revision...
Tip
If multiple HelmReleases are stalled after a cluster rebuild, recover them in dependency order. Check spec.dependsOn in each HelmRelease to determine the correct sequence.
Note
If the HelmRelease keeps failing after resume, the issue is not transient — check the pod logs and events for the underlying cause before retrying.
Keycloak¶
Keycloak is managed by the Keycloak Operator. Upgrades are performed by updating the operator manifests and/or the Keycloak CR image version.
Operator Upgrades¶
The Keycloak Operator is installed from pinned upstream manifests via FluxCD. To upgrade:
-
Check the release notes for the target version at keycloak.org
-
Backup the database:
-
Export the realm (as a safety backup):
KC_TOKEN=$(curl -s -X POST \ "https://auth.rciis.eac.int/realms/master/protocol/openid-connect/token" \ -d "client_id=admin-cli" \ -d "username=admin" \ -d "password=<admin-password>" \ -d "grant_type=password" | jq -r .access_token) curl -s "https://auth.rciis.eac.int/admin/realms/rciis" \ -H "Authorization: Bearer $KC_TOKEN" | jq . > rciis-realm-export.json -
Update the operator version — change the
targetRevisionin the FluxCD Kustomization manifests for both the CRDs and the operator deployment:# Update CRDs first KC_VERSION=26.1.0 # New target version kubectl apply -f \ https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/keycloaks.k8s.keycloak.org-v1.yml kubectl apply -f \ https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml # Then update the operator deployment kubectl -n keycloak apply -f \ https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/kubernetes.yml -
Update the Keycloak CR image to match the new operator version:
kubectl -n keycloak patch keycloak rciis-keycloak --type merge \ -p '{"spec":{"image":"quay.io/keycloak/keycloak:26.1.0"}}'The operator performs a rolling update of the Keycloak StatefulSet. It handles database migrations automatically.
-
Verify: Test OIDC login for Weave GitOps and Kubernetes API
Warning
Always upgrade CRDs before the operator deployment. CRD changes are not backwards-compatible — an older operator cannot reconcile CRs with a newer schema. Test upgrades in a non-production environment first.
Realm Configuration Changes¶
The KeycloakRealmImport CR handles initial realm creation only. For ongoing changes:
- Make changes in the Keycloak admin console or via the Admin REST API
- Export the updated realm configuration (see export command above)
- Update the
KeycloakRealmImportCR in Git to match the current state - Commit — this provides an audit trail and disaster recovery capability
The KeycloakRealmImport CR in Git should always reflect the current realm state so that the realm can be fully recreated from Git in a DR scenario. See Identity & Access Management for the full change workflow.
HSM¶
Firmware Updates¶
CloudHSM firmware updates are managed by AWS. Monitor the AWS CloudHSM release notes for updates.
AWS applies firmware updates during the next reboot. To control timing:
Firmware updates for on-premises HSMs must be applied manually:
- Schedule a maintenance window
- If running HA (2+ HSMs): update one HSM at a time
- Download the firmware update from the vendor portal
- Apply the update using the vendor's administration tool
- Verify the HSM is functional after the update: run a test signing operation
- Repeat for the second HSM
Warning
Some firmware updates may require the HSM to be re-initialised, which destroys all keys. Always verify with the vendor documentation and ensure key backups are current before updating.
Firmware updates for on-premises HSMs must be applied manually:
- Schedule a maintenance window
- If running HA (2+ HSMs): update one HSM at a time
- Download the firmware update from the vendor portal
- Apply the update using the vendor's administration tool
- Verify the HSM is functional after the update: run a test signing operation
- Repeat for the second HSM
Warning
Some firmware updates may require the HSM to be re-initialised, which destroys all keys. Always verify with the vendor documentation and ensure key backups are current before updating.
HSM Audit Log Review¶
Review HSM access logs weekly for:
- Unauthorised login attempts
- Unexpected key operations (sign, decrypt, export)
- Administrative actions (key creation, deletion, policy changes)
AWS Cluster Rebuild — Post-Bootstrap Restart Sequence¶
When the AWS cluster is destroyed and rebuilt via Terraform, the bootstrap process (helmfile + Flux) installs components in a specific order. However, certain components require manual restarts because they initialize before their dependencies are ready.
Required Restarts (in order)¶
| Step | Component | Command | Reason |
|---|---|---|---|
| 1 | Cilium Operator | kubectl rollout restart deployment cilium-operator -n kube-system |
Starts before Gateway API CRDs and Gateway resources are applied by Flux. Without restart, the operator never picks up the aws-gateway Gateway resource (status stays "Waiting for controller"). |
| 2 | Cilium Envoy DaemonSet | kubectl rollout restart daemonset cilium-envoy -n kube-system |
Envoy pods start before the Cilium operator reconciles the gateway. Without restart, Envoy has 0 listeners and the gateway NLB targets fail health checks (connection refused on ports 80/443). |
| 3 | Cilium Agent DaemonSet | kubectl rollout restart daemonset cilium -n kube-system |
Agents need to reload proxy redirects after the operator reconciles the CiliumEnvoyConfig for the gateway. Without restart, cilium status shows "0 redirects active". |
Pre-Requisites Before Restarts¶
Before restarting Cilium components, ensure:
-
Prometheus Operator CRDs are installed — cert-manager and cloudnative-pg HelmReleases include ServiceMonitor/PodMonitor resources. Prometheus depends on cert-manager, creating a circular dependency. Install CRDs manually:
for crd in servicemonitors podmonitors prometheusrules alertmanagerconfigs \ alertmanagers prometheuses thanosrulers probes scrapeconfigs; do kubectl apply --server-side -f \ "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_${crd}.yaml" done -
AWS Load Balancer Controller VPC ID — The
vpcIdinapps/infra/aws-load-balancer-controller/aws/values.yamlmust match the new VPC. Eachterraform applycreates a new VPC with a new ID.
Verification¶
After restarts, verify the gateway is working:
# Gateway should show PROGRAMMED: True with an NLB address
kubectl get gateway -n kube-system
# Service should have an EXTERNAL-IP
kubectl get svc cilium-gateway-aws-gateway -n kube-system
# NLB targets should be healthy
aws elbv2 describe-target-health \
--target-group-arn $(aws elbv2 describe-target-groups \
--query 'TargetGroups[?contains(TargetGroupName, `ciliumga`)].TargetGroupArn' \
--output text --region af-south-1) \
--region af-south-1
# Health endpoint should return 200
curl -sk https://health.rciis.africa/health \
--resolve "health.rciis.africa:443:$(dig +short <gateway-nlb-hostname>)"
Cleanup on Destroy¶
When running terraform destroy, Kubernetes-managed AWS resources (NLBs, security groups, target groups) are not in Terraform state and must be cleaned up manually. Otherwise the VPC deletion will hang:
VPC_ID="<vpc-id>"
# Delete orphaned NLBs
for arn in $(aws elbv2 describe-load-balancers \
--query "LoadBalancers[?VpcId==\`${VPC_ID}\`].LoadBalancerArn" \
--output text --region af-south-1); do
aws elbv2 delete-load-balancer --load-balancer-arn "$arn" --region af-south-1
done
# Delete orphaned target groups
for arn in $(aws elbv2 describe-target-groups \
--query "TargetGroups[?VpcId==\`${VPC_ID}\`].TargetGroupArn" \
--output text --region af-south-1); do
aws elbv2 delete-target-group --target-group-arn "$arn" --region af-south-1
done
# Wait ~60s for ENIs to release, then delete orphaned security groups
for sg in $(aws ec2 describe-security-groups \
--filters "Name=vpc-id,Values=${VPC_ID}" \
--query 'SecurityGroups[?GroupName!=`default`].GroupId' \
--output text --region af-south-1); do
aws ec2 delete-security-group --group-id "$sg" --region af-south-1
done
# Retry terraform destroy
terraform destroy -var-file=../envs/aws.tfvars -auto-approve