Skip to content

9.7 Incident Response

This page defines procedures for responding to security incidents detected by the RCIIS runtime security stack (Falco, Tracee) and other monitoring tools.

Alert Severity Classification

Severity Falco Priority Examples Response SLA
P1 — Critical CRITICAL / EMERGENCY Container escape, HSM compromise, data exfiltration alert Immediate (within 15 minutes)
P2 — High ERROR Unexpected outbound connection from RCIIS namespace, privilege escalation attempt Within 1 hour
P3 — Medium WARNING Shell spawned in container, sensitive file read, failed OIDC login spike Within 4 hours
P4 — Low NOTICE / INFORMATIONAL Policy violation in audit mode, minor config drift Next business day

Alert Triage Workflow

Alert fires (Falco / Tracee / Trivy / Keycloak)
        |
        v
+------------------+
| 1. Acknowledge   |  Assign an owner, update incident channel
+------------------+
        |
        v
+------------------+
| 2. Classify      |  Determine severity (P1–P4), confirm not a false positive
+------------------+
        |
        v
+------------------+
| 3. Contain       |  Isolate affected pod/node/namespace if P1/P2
+------------------+
        |
        v
+------------------+
| 4. Investigate   |  Gather evidence from Falco, Tracee, Kubernetes audit logs
+------------------+
        |
        v
+------------------+
| 5. Remediate     |  Fix root cause, patch, rotate credentials
+------------------+
        |
        v
+------------------+
| 6. Recover       |  Restore normal operations, remove isolation
+------------------+
        |
        v
+------------------+
| 7. Post-mortem   |  Document findings, update rules/policies, share lessons
+------------------+

Containment Procedures

Isolate a Pod

Apply a Cilium or Kubernetes NetworkPolicy to block all traffic to/from the compromised pod:

# Emergency network isolation for a specific pod
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: emergency-isolate
  namespace: <namespace>
spec:
  podSelector:
    matchLabels:
      <label-key>: <label-value>   # Labels of the compromised pod
  policyTypes:
    - Ingress
    - Egress
EOF

Isolate a Node

If a node is suspected of compromise (container escape, kernel-level threat):

# Cordon the node — prevent new pods from scheduling
kubectl cordon <node-name>

# Drain the node — evict workloads (except DaemonSets)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# For severe compromise — reset the node entirely via Talos
talosctl -n <node-ip> reset --graceful=false

Warning

Node reset destroys all data on the node. Only use this for confirmed compromises where the node's integrity cannot be trusted.

Lock a Keycloak User

If a user account is suspected of compromise:

# Disable the user in Keycloak (via API)
KEYCLOAK_TOKEN=$(curl -s -X POST \
  "https://auth.rciis.eac.int/realms/master/protocol/openid-connect/token" \
  -d "client_id=admin-cli" \
  -d "username=admin" \
  -d "password=<admin-password>" \
  -d "grant_type=password" | jq -r .access_token)

curl -s -X PUT \
  "https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>" \
  -H "Authorization: Bearer $KEYCLOAK_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

Also terminate all active sessions for the user:

curl -s -X POST \
  "https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>/logout" \
  -H "Authorization: Bearer $KEYCLOAK_TOKEN"

Falco Alert Playbooks

Shell Spawned in Container

Alert: Shell spawned in container (user=... container=... image=...)

Step Action
1 Confirm the pod and image — is this a legitimate debugging session?
2 Check who initiated the kubectl exec — review Kubernetes audit logs
3 If unauthorised: isolate the pod, capture Tracee forensic data, investigate the user's access
4 If legitimate: document the exception, consider adding a policy exception for the specific workload

Unexpected Outbound Connection from RCIIS

Alert: Unexpected outbound connection from RCIIS pod (dest=...)

Step Action
1 Identify the destination IP/hostname — is it a known service?
2 Check if the connection is data exfiltration — review Tracee network captures
3 If suspicious: isolate the pod immediately (P1), capture memory dump if possible
4 Review the container image for tampering — compare hash against Harbor registry
5 If the image is compromised: quarantine the image in Harbor, deploy a known-good version

Sensitive File Read in Container

Alert: Sensitive file read in container (file=/etc/shadow ...)

Step Action
1 Identify which process read the file and why
2 If it's a runtime dependency (e.g., NSS library reading /etc/passwd): add to Falco exception list
3 If unexpected: investigate the container image for malware, check if the image was tampered with

Critical CVE Discovered in Running Workload

Alert: CriticalVulnerabilityFound (Trivy Operator PrometheusRule)

Step Action
1 Identify the affected image and workload: kubectl get vulnerabilityreports -A -o json \| jq '.items[] \| select(.report.summary.criticalCount > 0)'
2 Check if a fixed image version is available — review the CVE's fixedVersion field
3 If a fix is available: update the image tag in Git, let FluxCD deploy the patched version
4 If no fix is available: assess exploitability — is the vulnerable package reachable? Does the workload have network access?
5 If exploitable and no fix: isolate the workload with a restrictive NetworkPolicy, or scale to zero if non-critical
6 Document the CVE, decision, and remediation timeline in the vulnerability tracker

Kyverno Webhook Failure — All Resource Creation Blocked

Alert: KyvernoWebhookDown (Kyverno PrometheusRule)

Step Action
1 Confirm the alert: kubectl -n kyverno get pods — are admission controller pods running?
2 Check pod logs: kubectl -n kyverno logs deployment/kyverno-admission-controller --previous
3 If pods are OOMKilled: increase memory limits in Helm values and redeploy
4 If pods are Pending: check node resources with kubectl describe pod
5 If the cluster is completely blocked (cannot deploy anything): emergency-disable the webhook

Emergency webhook removal:

# Remove the validating webhook — resources bypass all policies
kubectl delete validatingwebhookconfigurations kyverno-resource-validating-webhook-cfg

# Remove the mutating webhook
kubectl delete mutatingwebhookconfigurations kyverno-resource-mutating-webhook-cfg

# Fix the underlying issue, then restart Kyverno to recreate webhooks
kubectl -n kyverno rollout restart deployment kyverno-admission-controller

Warning

While the webhook is removed, all Kyverno policies are bypassed. Unauthorized images, privileged pods, and non-compliant resources can be deployed freely. Restore the webhook as quickly as possible.

Falco/Tracee eBPF Probe Failure on Node

Alert: FalcoPodDown (Falco PrometheusRule) or Tracee DaemonSet not fully running

Step Action
1 Identify which node(s) are affected: kubectl -n falco get pods -o wide \| grep -v Running
2 Check pod logs for eBPF errors: kubectl -n falco logs <pod-name>
3 Verify the node kernel supports BTF: talosctl -n <node-ip> read /sys/kernel/btf/vmlinux \| head -c 4
4 If the node was recently upgraded: verify the Talos image includes the expected kernel version
5 If a single node is affected: cordon it and investigate; other nodes remain monitored
6 If all nodes are affected: check if a Talos upgrade changed the kernel; review Falco/Tracee compatibility with the new kernel version

Tracee Forensic Investigation

When an incident requires deeper investigation, use Tracee's captured data:

Retrieve Events for a Specific Pod

# Filter Tracee events by pod name
kubectl -n tracee exec -it ds/tracee -- \
  tracee --filter container.name=<pod-name> --output json | \
  jq 'select(.eventName == "security_socket_connect" or .eventName == "process_execute")'

Extract Captured Artifacts

If Tracee is configured with capture enabled:

# List captured file writes
kubectl -n tracee exec -it ds/tracee -- ls /tmp/tracee/captures/

# Copy a captured artifact for offline analysis
kubectl -n tracee cp tracee-xxxxx:/tmp/tracee/captures/<artifact> ./evidence/

Correlate Falco and Tracee Events

  1. Note the timestamp from the Falco alert
  2. Query Tracee events in the same time window for the same pod/container
  3. Tracee provides additional context: full command-line arguments, file paths, network packet headers

Kyverno Violation Handling

Audit Mode Violations

Kyverno policies in Audit mode generate PolicyReport violations without blocking resources:

# List all violations
kubectl get policyreports -A -o json | \
  jq -r '.items[].results[] | select(.result == "fail") |
    "\(.policy): \(.message)"'

Triage procedure:

  1. Group violations by policy — identify systemic issues vs one-off problems
  2. For legitimate workloads that need exceptions: create a PolicyException resource
  3. For genuine violations: file a remediation ticket, track to resolution
  4. Once all violations are resolved or excepted: switch the policy to Enforce mode

Keycloak Security Events

Monitor Failed Logins

Keycloak logs authentication events. Watch for brute force patterns:

# Check Keycloak event logs via API
curl -s \
  "https://auth.rciis.eac.int/admin/realms/rciis/events?type=LOGIN_ERROR&max=50" \
  -H "Authorization: Bearer $KEYCLOAK_TOKEN" | \
  jq '.[] | {time, userId, ipAddress, error}'

Response to brute force:

  1. If a single IP is generating many failed logins: block the IP at the Cloudflare WAF level
  2. If a single user account is targeted: temporarily lock the account, notify the user via a separate channel
  3. Enable Keycloak's built-in brute force detection: Realm Settings > Security Defenses > Brute Force Detection

Suspicious Session Activity

# List active sessions for a user
curl -s \
  "https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>/sessions" \
  -H "Authorization: Bearer $KEYCLOAK_TOKEN" | \
  jq '.[] | {ipAddress, start, lastAccess, clients}'

If a session originates from an unexpected location, terminate it and disable the account pending investigation.

Post-Incident Actions

After resolving any security incident:

  • [ ] Update Falco rules: Add new detection rules based on the attack pattern observed
  • [ ] Update Kyverno policies: Add validation rules to prevent recurrence
  • [ ] Rotate credentials: Rotate any credentials that may have been exposed
  • [ ] Update Tracee policies: Add new forensic capture rules for the attack pattern
  • [ ] Update Network Policies: Tighten network restrictions if lateral movement was involved
  • [ ] Post-mortem document: Record timeline, root cause, impact, remediation, and lessons learned
  • [ ] Share findings: Brief the team and update this runbook with new playbooks