9.7 Incident Response¶

This page defines procedures for responding to security incidents detected by the RCIIS runtime security stack (Falco, Tracee) and other monitoring tools.

Alert Severity Classification¶

Severity	Falco Priority	Examples	Response SLA
P1 — Critical	CRITICAL / EMERGENCY	Container escape, HSM compromise, data exfiltration alert	Immediate (within 15 minutes)
P2 — High	ERROR	Unexpected outbound connection from RCIIS namespace, privilege escalation attempt	Within 1 hour
P3 — Medium	WARNING	Shell spawned in container, sensitive file read, failed OIDC login spike	Within 4 hours
P4 — Low	NOTICE / INFORMATIONAL	Policy violation in audit mode, minor config drift	Next business day

Alert Triage Workflow¶

Alert fires (Falco / Tracee / Trivy / Keycloak)
        |
        v
+------------------+
| 1. Acknowledge   |  Assign an owner, update incident channel
+------------------+
        |
        v
+------------------+
| 2. Classify      |  Determine severity (P1–P4), confirm not a false positive
+------------------+
        |
        v
+------------------+
| 3. Contain       |  Isolate affected pod/node/namespace if P1/P2
+------------------+
        |
        v
+------------------+
| 4. Investigate   |  Gather evidence from Falco, Tracee, Kubernetes audit logs
+------------------+
        |
        v
+------------------+
| 5. Remediate     |  Fix root cause, patch, rotate credentials
+------------------+
        |
        v
+------------------+
| 6. Recover       |  Restore normal operations, remove isolation
+------------------+
        |
        v
+------------------+
| 7. Post-mortem   |  Document findings, update rules/policies, share lessons
+------------------+

Containment Procedures¶

Isolate a Pod¶

Apply a Cilium or Kubernetes NetworkPolicy to block all traffic to/from the compromised pod:

# Emergency network isolation for a specific pod
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: emergency-isolate
  namespace: <namespace>
spec:
  podSelector:
    matchLabels:
      <label-key>: <label-value>   # Labels of the compromised pod
  policyTypes:
    - Ingress
    - Egress
EOF

Isolate a Node¶

If a node is suspected of compromise (container escape, kernel-level threat):

# Cordon the node — prevent new pods from scheduling
kubectl cordon <node-name>

# Drain the node — evict workloads (except DaemonSets)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# For severe compromise — reset the node entirely via Talos
talosctl -n <node-ip> reset --graceful=false

Warning

Node reset destroys all data on the node. Only use this for confirmed compromises where the node's integrity cannot be trusted.

Lock a Keycloak User¶

If a user account is suspected of compromise:

# Disable the user in Keycloak (via API)
KEYCLOAK_TOKEN=$(curl -s -X POST \
  "https://auth.rciis.eac.int/realms/master/protocol/openid-connect/token" \
  -d "client_id=admin-cli" \
  -d "username=admin" \
  -d "password=<admin-password>" \
  -d "grant_type=password" | jq -r .access_token)

curl -s -X PUT \
  "https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>" \
  -H "Authorization: Bearer $KEYCLOAK_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

Also terminate all active sessions for the user:

curl -s -X POST \
  "https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>/logout" \
  -H "Authorization: Bearer $KEYCLOAK_TOKEN"

Falco Alert Playbooks¶

Shell Spawned in Container¶

Alert: Shell spawned in container (user=... container=... image=...)

Step	Action
1	Confirm the pod and image — is this a legitimate debugging session?
2	Check who initiated the `kubectl exec` — review Kubernetes audit logs
3	If unauthorised: isolate the pod, capture Tracee forensic data, investigate the user's access
4	If legitimate: document the exception, consider adding a policy exception for the specific workload

Unexpected Outbound Connection from RCIIS¶

Alert: Unexpected outbound connection from RCIIS pod (dest=...)

Step	Action
1	Identify the destination IP/hostname — is it a known service?
2	Check if the connection is data exfiltration — review Tracee network captures
3	If suspicious: isolate the pod immediately (P1), capture memory dump if possible
4	Review the container image for tampering — compare hash against Harbor registry
5	If the image is compromised: quarantine the image in Harbor, deploy a known-good version

Sensitive File Read in Container¶

Alert: Sensitive file read in container (file=/etc/shadow ...)

Step	Action
1	Identify which process read the file and why
2	If it's a runtime dependency (e.g., NSS library reading /etc/passwd): add to Falco exception list
3	If unexpected: investigate the container image for malware, check if the image was tampered with

Critical CVE Discovered in Running Workload¶

Alert: CriticalVulnerabilityFound (Trivy Operator PrometheusRule)

Step	Action
1	Identify the affected image and workload: `kubectl get vulnerabilityreports -A -o json \\| jq '.items[] \\| select(.report.summary.criticalCount > 0)'`
2	Check if a fixed image version is available — review the CVE's `fixedVersion` field
3	If a fix is available: update the image tag in Git, let FluxCD deploy the patched version
4	If no fix is available: assess exploitability — is the vulnerable package reachable? Does the workload have network access?
5	If exploitable and no fix: isolate the workload with a restrictive NetworkPolicy, or scale to zero if non-critical
6	Document the CVE, decision, and remediation timeline in the vulnerability tracker

Kyverno Webhook Failure — All Resource Creation Blocked¶

Alert: KyvernoWebhookDown (Kyverno PrometheusRule)

Step	Action
1	Confirm the alert: `kubectl -n kyverno get pods` — are admission controller pods running?
2	Check pod logs: `kubectl -n kyverno logs deployment/kyverno-admission-controller --previous`
3	If pods are OOMKilled: increase memory limits in Helm values and redeploy
4	If pods are Pending: check node resources with `kubectl describe pod`
5	If the cluster is completely blocked (cannot deploy anything): emergency-disable the webhook

Emergency webhook removal:

# Remove the validating webhook — resources bypass all policies
kubectl delete validatingwebhookconfigurations kyverno-resource-validating-webhook-cfg

# Remove the mutating webhook
kubectl delete mutatingwebhookconfigurations kyverno-resource-mutating-webhook-cfg

# Fix the underlying issue, then restart Kyverno to recreate webhooks
kubectl -n kyverno rollout restart deployment kyverno-admission-controller

Warning

While the webhook is removed, all Kyverno policies are bypassed. Unauthorized images, privileged pods, and non-compliant resources can be deployed freely. Restore the webhook as quickly as possible.

Falco/Tracee eBPF Probe Failure on Node¶

Alert: FalcoPodDown (Falco PrometheusRule) or Tracee DaemonSet not fully running

Step	Action
1	Identify which node(s) are affected: `kubectl -n falco get pods -o wide \\| grep -v Running`
2	Check pod logs for eBPF errors: `kubectl -n falco logs <pod-name>`
3	Verify the node kernel supports BTF: `talosctl -n <node-ip> read /sys/kernel/btf/vmlinux \\| head -c 4`
4	If the node was recently upgraded: verify the Talos image includes the expected kernel version
5	If a single node is affected: cordon it and investigate; other nodes remain monitored
6	If all nodes are affected: check if a Talos upgrade changed the kernel; review Falco/Tracee compatibility with the new kernel version

Tracee Forensic Investigation¶

When an incident requires deeper investigation, use Tracee's captured data:

Retrieve Events for a Specific Pod¶

# Filter Tracee events by pod name
kubectl -n tracee exec -it ds/tracee -- \
  tracee --filter container.name=<pod-name> --output json | \
  jq 'select(.eventName == "security_socket_connect" or .eventName == "process_execute")'

Extract Captured Artifacts¶

If Tracee is configured with capture enabled:

# List captured file writes
kubectl -n tracee exec -it ds/tracee -- ls /tmp/tracee/captures/

# Copy a captured artifact for offline analysis
kubectl -n tracee cp tracee-xxxxx:/tmp/tracee/captures/<artifact> ./evidence/

Correlate Falco and Tracee Events¶

Note the timestamp from the Falco alert
Query Tracee events in the same time window for the same pod/container
Tracee provides additional context: full command-line arguments, file paths, network packet headers

Kyverno Violation Handling¶

Audit Mode Violations¶

Kyverno policies in Audit mode generate PolicyReport violations without blocking resources:

# List all violations
kubectl get policyreports -A -o json | \
  jq -r '.items[].results[] | select(.result == "fail") |
    "\(.policy): \(.message)"'

Triage procedure:

Group violations by policy — identify systemic issues vs one-off problems
For legitimate workloads that need exceptions: create a PolicyException resource
For genuine violations: file a remediation ticket, track to resolution
Once all violations are resolved or excepted: switch the policy to Enforce mode

Keycloak Security Events¶

Monitor Failed Logins¶

Keycloak logs authentication events. Watch for brute force patterns:

# Check Keycloak event logs via API
curl -s \
  "https://auth.rciis.eac.int/admin/realms/rciis/events?type=LOGIN_ERROR&max=50" \
  -H "Authorization: Bearer $KEYCLOAK_TOKEN" | \
  jq '.[] | {time, userId, ipAddress, error}'

Response to brute force:

If a single IP is generating many failed logins: block the IP at the Cloudflare WAF level
If a single user account is targeted: temporarily lock the account, notify the user via a separate channel
Enable Keycloak's built-in brute force detection: Realm Settings > Security Defenses > Brute Force Detection

Suspicious Session Activity¶

# List active sessions for a user
curl -s \
  "https://auth.rciis.eac.int/admin/realms/rciis/users/<user-id>/sessions" \
  -H "Authorization: Bearer $KEYCLOAK_TOKEN" | \
  jq '.[] | {ipAddress, start, lastAccess, clients}'

If a session originates from an unexpected location, terminate it and disable the account pending investigation.

Post-Incident Actions¶

After resolving any security incident:

[ ] Update Falco rules: Add new detection rules based on the attack pattern observed
[ ] Update Kyverno policies: Add validation rules to prevent recurrence
[ ] Rotate credentials: Rotate any credentials that may have been exposed
[ ] Update Tracee policies: Add new forensic capture rules for the attack pattern
[ ] Update Network Policies: Tighten network restrictions if lateral movement was involved
[ ] Post-mortem document: Record timeline, root cause, impact, remediation, and lessons learned
[ ] Share findings: Brief the team and update this runbook with new playbooks