Skip to content

9.3 Maintenance Windows

This page covers routine maintenance procedures for the RCIIS platform, including security tool updates, Kubernetes component maintenance, and scheduled upkeep.

Maintenance Schedule

Task Frequency Window Required Impact
Trivy vulnerability DB update Daily (automatic) None None — background update
Falco rule updates Monthly or on new threat advisory None Falco pod restart (seconds)
Tracee policy updates As needed None Tracee pod restart (seconds)
Kyverno policy changes As needed None (Audit) / Low (Enforce) Brief admission delay during rollout
Keycloak upgrades Quarterly 15–30 minutes Authentication unavailable during rollout
HSM firmware updates As released by vendor 30–60 minutes per HSM None if HA (update one HSM at a time)
Helm chart version updates As released (Renovate Bot) None Rolling pod restarts
Talos OS upgrades As released See Talos Upgrades Rolling node reboots
Kubernetes version upgrades Quarterly See Talos Upgrades Rolling node reboots

Security Tool Maintenance

Trivy Operator

Vulnerability Database Updates

The Trivy Operator automatically downloads the latest vulnerability database from ghcr.io/aquasecurity/trivy-db. No manual intervention is needed.

Verify the database is current:

kubectl -n trivy-system logs -l app.kubernetes.io/name=trivy-operator --tail=20 | grep -i "db"

Air-Gapped Environments

If the cluster cannot reach ghcr.io, mirror the Trivy DB to Harbor:

# On a machine with internet access
oras pull ghcr.io/aquasecurity/trivy-db:2
oras push harbor.devops.africa/rciis/trivy-db:2 db.tar.gz:application/vnd.aquasec.trivy.db.layer.v1.tar+gzip

Update the Helm values to point to the mirrored DB:

values-trivy-operator.yaml (update)
trivy:
  dbRepository: harbor.devops.africa/rciis/trivy-db

Upgrading the Trivy Operator

helm repo update
helm upgrade trivy-operator aqua/trivy-operator \
  --namespace trivy-system \
  -f values-trivy-operator.yaml

Review the Trivy Operator changelog before upgrading — check for breaking changes in CRD schemas or Helm values.


Falco

Rule Updates

Falco rules are updated by upgrading the Helm chart or updating the customRules section in the values file.

Update built-in rules (via Helm upgrade):

helm repo update
helm upgrade falco falcosecurity/falco \
  --namespace falco \
  -f values-falco.yaml

Add or update custom RCIIS rules:

  1. Edit the customRules section in values-falco.yaml
  2. Test new rules in a non-production cluster first
  3. Apply via Helm upgrade

Test a new rule before deploying:

# Dry-run — validate rule syntax without deploying
falco -c /etc/falco/falco.yaml -r /path/to/new-rules.yaml --dry-run

Tuning False Positives

If Falco generates alerts for known-good behaviour:

  1. Identify the rule name from the alert
  2. Add an exception to the rule in customRules:
- rule: Shell Spawned in Container
  append: true
  condition: and not (container.image.repository = "harbor.devops.africa/rciis/debug-tools")
  1. Test the exception, then deploy via Helm upgrade

Tracee

Policy Updates

Update Tracee policies by applying new Policy CRDs:

kubectl apply -f tracee-policy-updated.yaml -n tracee

Tracee picks up policy changes dynamically — no pod restart required for CRD-based policies.

Upgrading Tracee

helm repo update
helm upgrade tracee aqua/tracee \
  --namespace tracee \
  -f values-tracee.yaml

Kyverno

Policy Lifecycle

Adding a new policy:

  1. Write the policy YAML (ClusterPolicy or Policy)
  2. Deploy in Audit mode first: validationFailureAction: Audit
  3. Monitor PolicyReport resources for violations over 1–2 weeks
  4. Fix violations or create PolicyException resources
  5. Switch to Enforce mode: validationFailureAction: Enforce

Modifying an existing policy:

  1. Update the policy YAML
  2. If the change is more restrictive: switch to Audit mode, monitor, then re-enable Enforce
  3. If the change is less restrictive: apply directly in Enforce mode

Upgrading Kyverno:

helm repo update

# Check for CRD changes — Kyverno CRDs are not managed by Helm
kubectl apply -f https://raw.githubusercontent.com/kyverno/kyverno/main/config/crds/kyverno.io_clusterpolicies.yaml
kubectl apply -f https://raw.githubusercontent.com/kyverno/kyverno/main/config/crds/kyverno.io_policyexceptions.yaml

helm upgrade kyverno kyverno/kyverno \
  --namespace kyverno \
  -f values-kyverno.yaml

Warning

Always review Kyverno release notes before upgrading. Major version upgrades may change policy API versions or webhook behaviour. Test upgrades in a non-production cluster first.


Flux GitOps — Recovering Stalled Resources

HelmReleases and Kustomizations can become permanently stuck after transient failures (e.g., slow pod startup exceeding the Helm timeout). Once retries are exhausted, Flux will not retry automatically — the resource enters a Stalled state.

Symptom

kubectl get helmrelease -A | grep False
# kafka-ui   False   Helm install failed for release rciis-prod/kafka-ui: timeout waiting for...

The underlying pods may be perfectly healthy:

kubectl get pods -n rciis-prod | grep kafka-ui
# kafka-ui-6cbbc6b6b4-gxxbm   1/1   Running   0   159m

Resolution — Suspend and Resume

Suspending and resuming the HelmRelease clears the exhausted retry counter and forces a fresh reconciliation:

# Step 1: Clear the stalled HelmRelease
flux suspend helmrelease <name> -n flux-system
flux resume helmrelease <name> -n flux-system

# Step 2: Reconcile the parent Kustomization (if also stalled)
flux reconcile kustomization <kustomization-name> -n flux-system

Example — recovering kafka-ui:

flux suspend helmrelease kafka-ui -n flux-system
flux resume helmrelease kafka-ui -n flux-system
# ✔ HelmRelease kafka-ui reconciliation completed

flux reconcile kustomization rciis-kafka-ui -n flux-system
# ✔ applied revision master@sha1:...

Verification

kubectl get helmrelease <name> -n flux-system
# READY: True   STATUS: Helm upgrade succeeded...

kubectl get kustomization <name> -n flux-system
# READY: True   STATUS: Applied revision...

Tip

If multiple HelmReleases are stalled after a cluster rebuild, recover them in dependency order. Check spec.dependsOn in each HelmRelease to determine the correct sequence.

Note

If the HelmRelease keeps failing after resume, the issue is not transient — check the pod logs and events for the underlying cause before retrying.


Keycloak

Keycloak is managed by the Keycloak Operator. Upgrades are performed by updating the operator manifests and/or the Keycloak CR image version.

Operator Upgrades

The Keycloak Operator is installed from pinned upstream manifests via FluxCD. To upgrade:

  1. Check the release notes for the target version at keycloak.org

  2. Backup the database:

    kubectl cnpg backup keycloak-pg -n keycloak
    
  3. Export the realm (as a safety backup):

    KC_TOKEN=$(curl -s -X POST \
      "https://auth.rciis.eac.int/realms/master/protocol/openid-connect/token" \
      -d "client_id=admin-cli" \
      -d "username=admin" \
      -d "password=<admin-password>" \
      -d "grant_type=password" | jq -r .access_token)
    
    curl -s "https://auth.rciis.eac.int/admin/realms/rciis" \
      -H "Authorization: Bearer $KC_TOKEN" | jq . > rciis-realm-export.json
    
  4. Update the operator version — change the targetRevision in the FluxCD Kustomization manifests for both the CRDs and the operator deployment:

    # Update CRDs first
    KC_VERSION=26.1.0   # New target version
    
    kubectl apply -f \
      https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/keycloaks.k8s.keycloak.org-v1.yml
    kubectl apply -f \
      https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml
    
    # Then update the operator deployment
    kubectl -n keycloak apply -f \
      https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/${KC_VERSION}/kubernetes/kubernetes.yml
    
  5. Update the Keycloak CR image to match the new operator version:

    kubectl -n keycloak patch keycloak rciis-keycloak --type merge \
      -p '{"spec":{"image":"quay.io/keycloak/keycloak:26.1.0"}}'
    

    The operator performs a rolling update of the Keycloak StatefulSet. It handles database migrations automatically.

  6. Verify: Test OIDC login for Weave GitOps and Kubernetes API

Warning

Always upgrade CRDs before the operator deployment. CRD changes are not backwards-compatible — an older operator cannot reconcile CRs with a newer schema. Test upgrades in a non-production environment first.

Realm Configuration Changes

The KeycloakRealmImport CR handles initial realm creation only. For ongoing changes:

  1. Make changes in the Keycloak admin console or via the Admin REST API
  2. Export the updated realm configuration (see export command above)
  3. Update the KeycloakRealmImport CR in Git to match the current state
  4. Commit — this provides an audit trail and disaster recovery capability

The KeycloakRealmImport CR in Git should always reflect the current realm state so that the realm can be fully recreated from Git in a DR scenario. See Identity & Access Management for the full change workflow.


HSM

Firmware Updates

CloudHSM firmware updates are managed by AWS. Monitor the AWS CloudHSM release notes for updates.

AWS applies firmware updates during the next reboot. To control timing:

# Check current firmware version
aws cloudhsmv2 describe-clusters \
  --query 'Clusters[0].Hsms[].HsmId' --output text

Firmware updates for on-premises HSMs must be applied manually:

  1. Schedule a maintenance window
  2. If running HA (2+ HSMs): update one HSM at a time
  3. Download the firmware update from the vendor portal
  4. Apply the update using the vendor's administration tool
  5. Verify the HSM is functional after the update: run a test signing operation
  6. Repeat for the second HSM

Warning

Some firmware updates may require the HSM to be re-initialised, which destroys all keys. Always verify with the vendor documentation and ensure key backups are current before updating.

Firmware updates for on-premises HSMs must be applied manually:

  1. Schedule a maintenance window
  2. If running HA (2+ HSMs): update one HSM at a time
  3. Download the firmware update from the vendor portal
  4. Apply the update using the vendor's administration tool
  5. Verify the HSM is functional after the update: run a test signing operation
  6. Repeat for the second HSM

Warning

Some firmware updates may require the HSM to be re-initialised, which destroys all keys. Always verify with the vendor documentation and ensure key backups are current before updating.

HSM Audit Log Review

Review HSM access logs weekly for:

  • Unauthorised login attempts
  • Unexpected key operations (sign, decrypt, export)
  • Administrative actions (key creation, deletion, policy changes)
# CloudHSM audit logs are in CloudWatch
aws logs filter-log-events \
  --log-group-name /aws/cloudhsm/cluster-xxxxxxxxxxxx \
  --start-time $(date -d '7 days ago' +%s000) \
  --filter-pattern "MGMT_KEY"
# Vendor-specific — example for Thales Luna
lunash:> audit log show
# Vendor-specific — example for Thales Luna
lunash:> audit log show

AWS Cluster Rebuild — Post-Bootstrap Restart Sequence

When the AWS cluster is destroyed and rebuilt via Terraform, the bootstrap process (helmfile + Flux) installs components in a specific order. However, certain components require manual restarts because they initialize before their dependencies are ready.

Required Restarts (in order)

Step Component Command Reason
1 Cilium Operator kubectl rollout restart deployment cilium-operator -n kube-system Starts before Gateway API CRDs and Gateway resources are applied by Flux. Without restart, the operator never picks up the aws-gateway Gateway resource (status stays "Waiting for controller").
2 Cilium Envoy DaemonSet kubectl rollout restart daemonset cilium-envoy -n kube-system Envoy pods start before the Cilium operator reconciles the gateway. Without restart, Envoy has 0 listeners and the gateway NLB targets fail health checks (connection refused on ports 80/443).
3 Cilium Agent DaemonSet kubectl rollout restart daemonset cilium -n kube-system Agents need to reload proxy redirects after the operator reconciles the CiliumEnvoyConfig for the gateway. Without restart, cilium status shows "0 redirects active".

Pre-Requisites Before Restarts

Before restarting Cilium components, ensure:

  1. Prometheus Operator CRDs are installed — cert-manager and cloudnative-pg HelmReleases include ServiceMonitor/PodMonitor resources. Prometheus depends on cert-manager, creating a circular dependency. Install CRDs manually:

    for crd in servicemonitors podmonitors prometheusrules alertmanagerconfigs \
               alertmanagers prometheuses thanosrulers probes scrapeconfigs; do
      kubectl apply --server-side -f \
        "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_${crd}.yaml"
    done
    
  2. AWS Load Balancer Controller VPC ID — The vpcId in apps/infra/aws-load-balancer-controller/aws/values.yaml must match the new VPC. Each terraform apply creates a new VPC with a new ID.

Verification

After restarts, verify the gateway is working:

# Gateway should show PROGRAMMED: True with an NLB address
kubectl get gateway -n kube-system

# Service should have an EXTERNAL-IP
kubectl get svc cilium-gateway-aws-gateway -n kube-system

# NLB targets should be healthy
aws elbv2 describe-target-health \
  --target-group-arn $(aws elbv2 describe-target-groups \
    --query 'TargetGroups[?contains(TargetGroupName, `ciliumga`)].TargetGroupArn' \
    --output text --region af-south-1) \
  --region af-south-1

# Health endpoint should return 200
curl -sk https://health.rciis.africa/health \
  --resolve "health.rciis.africa:443:$(dig +short <gateway-nlb-hostname>)"

Cleanup on Destroy

When running terraform destroy, Kubernetes-managed AWS resources (NLBs, security groups, target groups) are not in Terraform state and must be cleaned up manually. Otherwise the VPC deletion will hang:

VPC_ID="<vpc-id>"

# Delete orphaned NLBs
for arn in $(aws elbv2 describe-load-balancers \
  --query "LoadBalancers[?VpcId==\`${VPC_ID}\`].LoadBalancerArn" \
  --output text --region af-south-1); do
  aws elbv2 delete-load-balancer --load-balancer-arn "$arn" --region af-south-1
done

# Delete orphaned target groups
for arn in $(aws elbv2 describe-target-groups \
  --query "TargetGroups[?VpcId==\`${VPC_ID}\`].TargetGroupArn" \
  --output text --region af-south-1); do
  aws elbv2 delete-target-group --target-group-arn "$arn" --region af-south-1
done

# Wait ~60s for ENIs to release, then delete orphaned security groups
for sg in $(aws ec2 describe-security-groups \
  --filters "Name=vpc-id,Values=${VPC_ID}" \
  --query 'SecurityGroups[?GroupName!=`default`].GroupId' \
  --output text --region af-south-1); do
  aws ec2 delete-security-group --group-id "$sg" --region af-south-1
done

# Retry terraform destroy
terraform destroy -var-file=../envs/aws.tfvars -auto-approve