9.7 Incident Response¶
This page defines procedures for responding to security incidents detected by the RCIIS runtime security stack (Falco, Tracee) and other monitoring tools.
Alert Severity Classification¶
| Severity | Falco Priority | Examples | Response SLA |
|---|---|---|---|
| P1 — Critical | CRITICAL / EMERGENCY | Container escape, HSM compromise, data exfiltration alert | Immediate (within 15 minutes) |
| P2 — High | ERROR | Unexpected outbound connection from RCIIS namespace, privilege escalation attempt | Within 1 hour |
| P3 — Medium | WARNING | Shell spawned in container, sensitive file read, failed OIDC login spike | Within 4 hours |
| P4 — Low | NOTICE / INFORMATIONAL | Policy violation in audit mode, minor config drift | Next business day |
Alert Triage Workflow¶
Alert fires (Falco / Tracee / Trivy / Keycloak)
|
v
+------------------+
| 1. Acknowledge | Assign an owner, update incident channel
+------------------+
|
v
+------------------+
| 2. Classify | Determine severity (P1–P4), confirm not a false positive
+------------------+
|
v
+------------------+
| 3. Contain | Isolate affected pod/node/namespace if P1/P2
+------------------+
|
v
+------------------+
| 4. Investigate | Gather evidence from Falco, Tracee, Kubernetes audit logs
+------------------+
|
v
+------------------+
| 5. Remediate | Fix root cause, patch, rotate credentials
+------------------+
|
v
+------------------+
| 6. Recover | Restore normal operations, remove isolation
+------------------+
|
v
+------------------+
| 7. Post-mortem | Document findings, update rules/policies, share lessons
+------------------+
Containment Procedures¶
Isolate a Pod¶
Apply a Cilium or Kubernetes NetworkPolicy to block all traffic to/from the compromised pod:
# Emergency network isolation for a specific pod
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: emergency-isolate
namespace: <namespace>
spec:
podSelector:
matchLabels:
<label-key>: <label-value> # Labels of the compromised pod
policyTypes:
- Ingress
- Egress
EOF
Isolate a Node¶
If a node is suspected of compromise (container escape, kernel-level threat):
# Cordon the node — prevent new pods from scheduling
kubectl cordon <node-name>
# Drain the node — evict workloads (except DaemonSets)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# For severe compromise — reset the node entirely via Talos
talosctl -n <node-ip> reset --graceful=false
Warning
Node reset destroys all data on the node. Only use this for confirmed compromises where the node's integrity cannot be trusted.
Lock a Keycloak User¶
If a user account is suspected of compromise:
# Disable the user in Keycloak (via API)
KEYCLOAK_TOKEN=$(curl -s -X POST \
"https://auth.rciis.eac.int/realms/master/protocol/openid-connect/token" \
-d "client_id=admin-cli" \
-d "username=admin" \
-d "password=<admin-password>" \
-d "grant_type=password" | jq -r .access_token)
curl -s -X PUT \
"https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>" \
-H "Authorization: Bearer $KEYCLOAK_TOKEN" \
-H "Content-Type: application/json" \
-d '{"enabled": false}'
Also terminate all active sessions for the user:
curl -s -X POST \
"https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>/logout" \
-H "Authorization: Bearer $KEYCLOAK_TOKEN"
Falco Alert Playbooks¶
Shell Spawned in Container¶
Alert: Shell spawned in container (user=... container=... image=...)
| Step | Action |
|---|---|
| 1 | Confirm the pod and image — is this a legitimate debugging session? |
| 2 | Check who initiated the kubectl exec — review Kubernetes audit logs |
| 3 | If unauthorised: isolate the pod, capture Tracee forensic data, investigate the user's access |
| 4 | If legitimate: document the exception, consider adding a policy exception for the specific workload |
Unexpected Outbound Connection from RCIIS¶
Alert: Unexpected outbound connection from RCIIS pod (dest=...)
| Step | Action |
|---|---|
| 1 | Identify the destination IP/hostname — is it a known service? |
| 2 | Check if the connection is data exfiltration — review Tracee network captures |
| 3 | If suspicious: isolate the pod immediately (P1), capture memory dump if possible |
| 4 | Review the container image for tampering — compare hash against Harbor registry |
| 5 | If the image is compromised: quarantine the image in Harbor, deploy a known-good version |
Sensitive File Read in Container¶
Alert: Sensitive file read in container (file=/etc/shadow ...)
| Step | Action |
|---|---|
| 1 | Identify which process read the file and why |
| 2 | If it's a runtime dependency (e.g., NSS library reading /etc/passwd): add to Falco exception list |
| 3 | If unexpected: investigate the container image for malware, check if the image was tampered with |
Critical CVE Discovered in Running Workload¶
Alert: CriticalVulnerabilityFound (Trivy Operator PrometheusRule)
| Step | Action |
|---|---|
| 1 | Identify the affected image and workload: kubectl get vulnerabilityreports -A -o json \| jq '.items[] \| select(.report.summary.criticalCount > 0)' |
| 2 | Check if a fixed image version is available — review the CVE's fixedVersion field |
| 3 | If a fix is available: update the image tag in Git, let FluxCD deploy the patched version |
| 4 | If no fix is available: assess exploitability — is the vulnerable package reachable? Does the workload have network access? |
| 5 | If exploitable and no fix: isolate the workload with a restrictive NetworkPolicy, or scale to zero if non-critical |
| 6 | Document the CVE, decision, and remediation timeline in the vulnerability tracker |
Kyverno Webhook Failure — All Resource Creation Blocked¶
Alert: KyvernoWebhookDown (Kyverno PrometheusRule)
| Step | Action |
|---|---|
| 1 | Confirm the alert: kubectl -n kyverno get pods — are admission controller pods running? |
| 2 | Check pod logs: kubectl -n kyverno logs deployment/kyverno-admission-controller --previous |
| 3 | If pods are OOMKilled: increase memory limits in Helm values and redeploy |
| 4 | If pods are Pending: check node resources with kubectl describe pod |
| 5 | If the cluster is completely blocked (cannot deploy anything): emergency-disable the webhook |
Emergency webhook removal:
# Remove the validating webhook — resources bypass all policies
kubectl delete validatingwebhookconfigurations kyverno-resource-validating-webhook-cfg
# Remove the mutating webhook
kubectl delete mutatingwebhookconfigurations kyverno-resource-mutating-webhook-cfg
# Fix the underlying issue, then restart Kyverno to recreate webhooks
kubectl -n kyverno rollout restart deployment kyverno-admission-controller
Warning
While the webhook is removed, all Kyverno policies are bypassed. Unauthorized images, privileged pods, and non-compliant resources can be deployed freely. Restore the webhook as quickly as possible.
Falco/Tracee eBPF Probe Failure on Node¶
Alert: FalcoPodDown (Falco PrometheusRule) or Tracee DaemonSet not fully running
| Step | Action |
|---|---|
| 1 | Identify which node(s) are affected: kubectl -n falco get pods -o wide \| grep -v Running |
| 2 | Check pod logs for eBPF errors: kubectl -n falco logs <pod-name> |
| 3 | Verify the node kernel supports BTF: talosctl -n <node-ip> read /sys/kernel/btf/vmlinux \| head -c 4 |
| 4 | If the node was recently upgraded: verify the Talos image includes the expected kernel version |
| 5 | If a single node is affected: cordon it and investigate; other nodes remain monitored |
| 6 | If all nodes are affected: check if a Talos upgrade changed the kernel; review Falco/Tracee compatibility with the new kernel version |
Tracee Forensic Investigation¶
When an incident requires deeper investigation, use Tracee's captured data:
Retrieve Events for a Specific Pod¶
# Filter Tracee events by pod name
kubectl -n tracee exec -it ds/tracee -- \
tracee --filter container.name=<pod-name> --output json | \
jq 'select(.eventName == "security_socket_connect" or .eventName == "process_execute")'
Extract Captured Artifacts¶
If Tracee is configured with capture enabled:
# List captured file writes
kubectl -n tracee exec -it ds/tracee -- ls /tmp/tracee/captures/
# Copy a captured artifact for offline analysis
kubectl -n tracee cp tracee-xxxxx:/tmp/tracee/captures/<artifact> ./evidence/
Correlate Falco and Tracee Events¶
- Note the timestamp from the Falco alert
- Query Tracee events in the same time window for the same pod/container
- Tracee provides additional context: full command-line arguments, file paths, network packet headers
Kyverno Violation Handling¶
Audit Mode Violations¶
Kyverno policies in Audit mode generate PolicyReport violations without blocking resources:
# List all violations
kubectl get policyreports -A -o json | \
jq -r '.items[].results[] | select(.result == "fail") |
"\(.policy): \(.message)"'
Triage procedure:
- Group violations by policy — identify systemic issues vs one-off problems
- For legitimate workloads that need exceptions: create a
PolicyExceptionresource - For genuine violations: file a remediation ticket, track to resolution
- Once all violations are resolved or excepted: switch the policy to
Enforcemode
Keycloak Security Events¶
Monitor Failed Logins¶
Keycloak logs authentication events. Watch for brute force patterns:
# Check Keycloak event logs via API
curl -s \
"https://auth.rciis.eac.int/admin/realms/rciis/events?type=LOGIN_ERROR&max=50" \
-H "Authorization: Bearer $KEYCLOAK_TOKEN" | \
jq '.[] | {time, userId, ipAddress, error}'
Response to brute force:
- If a single IP is generating many failed logins: block the IP at the Cloudflare WAF level
- If a single user account is targeted: temporarily lock the account, notify the user via a separate channel
- Enable Keycloak's built-in brute force detection: Realm Settings > Security Defenses > Brute Force Detection
Suspicious Session Activity¶
# List active sessions for a user
curl -s \
"https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>/sessions" \
-H "Authorization: Bearer $KEYCLOAK_TOKEN" | \
jq '.[] | {ipAddress, start, lastAccess, clients}'
If a session originates from an unexpected location, terminate it and disable the account pending investigation.
Post-Incident Actions¶
After resolving any security incident:
- [ ] Update Falco rules: Add new detection rules based on the attack pattern observed
- [ ] Update Kyverno policies: Add validation rules to prevent recurrence
- [ ] Rotate credentials: Rotate any credentials that may have been exposed
- [ ] Update Tracee policies: Add new forensic capture rules for the attack pattern
- [ ] Update Network Policies: Tighten network restrictions if lateral movement was involved
- [ ] Post-mortem document: Record timeline, root cause, impact, remediation, and lessons learned
- [ ] Share findings: Brief the team and update this runbook with new playbooks