5.2.5 Security Troubleshooting¶

This page covers common issues with the security stack and their solutions. For incident response procedures, see Incident Response.

Kyverno¶

Webhook Failures Blocking All Resource Creation¶

Symptom: kubectl apply returns errors like:

Error from server (InternalError): Internal error occurred: failed calling webhook
"validate.kyverno.svc-fail": failed to call webhook: context deadline exceeded

Cause: Kyverno admission controller pods are down or unresponsive, and the webhook failurePolicy is set to Fail (the default).

Resolution:

# 1. Check Kyverno pod status
kubectl -n kyverno get pods

# 2. If pods are CrashLooping, check logs
kubectl -n kyverno logs deployment/kyverno-admission-controller --previous

# 3. If pods are Pending (resource pressure), check node resources
kubectl describe pods -n kyverno -l app.kubernetes.io/component=admission-controller

Emergency override — disable the webhook temporarily:

# This allows all resources to bypass Kyverno policies
kubectl delete validatingwebhookconfigurations kyverno-resource-validating-webhook-cfg
kubectl delete mutatingwebhookconfigurations kyverno-resource-mutating-webhook-cfg

Warning

Disabling webhooks removes ALL policy enforcement. Re-enable by restarting the admission controller:

kubectl -n kyverno rollout restart deployment kyverno-admission-controller
# Kyverno will recreate the webhook configurations on startup

Large CRD Handling with Flux¶

Symptom: Flux HelmRelease shows DryRunFailed or ReconciliationFailed for Kyverno, with errors referencing oversized resources.

Cause: Kyverno CRDs are large. Flux uses server-side apply by default, which avoids the annotation size limit issue that affects client-side apply tools. However, if CRDs are not configured for automatic management, Flux may fail to apply or upgrade them.

Resolution: Ensure spec.install.crds: CreateReplace and spec.upgrade.crds: CreateReplace are set in the Kyverno HelmRelease (these are already present in the base HelmRelease):

install:
  crds: CreateReplace
upgrade:
  crds: CreateReplace

If a specific CRD is still too large for the API server to accept, annotate the Kustomization to force replacement:

kubectl annotate kustomization kyverno \
  kustomize.toolkit.fluxcd.io/force=enabled \
  -n flux-system

Warning

The force annotation causes Flux to delete and recreate resources that cannot be patched. Only use this when the CRD is genuinely too large to update in-place.

Policy Syntax Errors¶

Symptom: ClusterPolicy fails to apply or behaves unexpectedly.

Debug steps:

# Validate policy syntax before applying
kyverno apply policy.yaml --resource test-resource.yaml

# Check Kyverno controller logs for policy compilation errors
kubectl -n kyverno logs deployment/kyverno-admission-controller | grep -i "error\|fail"

# Check events for policy-related issues
kubectl get events -n kyverno --sort-by='.lastTimestamp'

Common mistakes:

Issue	Example	Fix
Invalid regex in pattern	`image: "harbor./rciis/"`	Use `image: "harbor.devops.africa/*"` — patterns use glob, not regex
Missing anchor for optional fields	`securityContext: privileged: false`	Use `=(securityContext): =(privileged): false` for optional fields
Rule matches too broadly	Matching all `Pod` kinds	Add namespace exclusions or use `exclude` blocks

Debugging Admission Decisions¶

# Check which policies evaluated a specific resource
kubectl get policyreports -n <namespace> -o json | \
  jq '.items[].results[] | select(.resources[].name == "<resource-name>")'

# View admission controller decision logs (verbose)
kubectl -n kyverno logs deployment/kyverno-admission-controller -c kyverno | \
  grep "admission request"

Trivy Operator¶

Scan Jobs Stuck in Pending¶

Symptom: VulnerabilityReports are not generated. Scan jobs remain in Pending state.

Cause: Scan jobs exceed available node resources (CPU/memory requests too high for the cluster).

Resolution:

# Check pending scan jobs
kubectl get jobs -n trivy-system -l app.kubernetes.io/managed-by=trivy-operator

# Describe a stuck job for scheduling details
kubectl describe job <job-name> -n trivy-system
# Look for: "Insufficient cpu" or "Insufficient memory"

Reduce scan job resource requests:

values-trivy-operator.yaml

scanJob:
  resources:
    requests:
      cpu: 50m      # Lower from default
      memory: 128Mi
    limits:
      cpu: 250m
      memory: 256Mi

Private Registry Auth Failures¶

Symptom: Scan jobs fail with UNAUTHORIZED or 401 errors.

# Check scan job logs
kubectl -n trivy-system logs job/<scan-job-name>
# Expected error: "failed to get authorization token" or "UNAUTHORIZED"

Resolution:

# Verify the registry secret exists
kubectl -n trivy-system get secret harbor-creds

# Verify the secret is correctly referenced in Helm values
kubectl -n trivy-system get deployment trivy-operator -o yaml | grep -A5 privateRegistry

# Recreate the secret if needed
kubectl -n trivy-system delete secret harbor-creds
kubectl -n trivy-system create secret docker-registry harbor-creds \
  --docker-server=harbor.devops.africa \
  --docker-username="${HARBOR_USERNAME}" \
  --docker-password="${HARBOR_PASSWORD}"

Vulnerability DB Download Failures¶

Symptom: Scans fail with failed to download vulnerability DB errors.

Cause: The cluster cannot reach ghcr.io to download the Trivy vulnerability database.

Resolution for air-gapped environments:

# Pre-download the DB and host internally
oras pull ghcr.io/aquasecurity/trivy-db:2
# Push to internal Harbor registry
oras push harbor.devops.africa/trivy/trivy-db:2 db.tar.gz:application/vnd.aquasec.trivy.db.layer.v1.tar+gzip

Update Helm values to point to the internal mirror:

values-trivy-operator.yaml

trivy:
  dbRepository: harbor.devops.africa/trivy/trivy-db

Reports Not Generated for Specific Workloads¶

Symptom: Some workloads have no VulnerabilityReport.

# Check if the workload is excluded by label
kubectl get deployment <name> -n <ns> -o jsonpath='{.metadata.labels}'

# Check operator logs for skip reasons
kubectl -n trivy-system logs deployment/trivy-operator | grep "<workload-name>"

Common causes:

Workload has the skip label (trivy-operator.aquasecurity.github.io/skip-scan: "true")
The image is from a registry that requires auth but no credentials are configured
The scan job concurrent limit is reached — increase operator.scanJobsConcurrentLimit

Falco¶

eBPF Probe Load Failure on Talos¶

Symptom: Falco pods are CrashLoopBackOff with errors about eBPF probe loading.

kubectl -n falco logs ds/falco | grep -i "error\|probe\|driver"

Resolution:

Verify the driver is set to modern_ebpf (not module):

kubectl -n falco get ds falco -o yaml | grep -A2 "driver"

If using ebpf (legacy), switch to modern_ebpf which requires no kernel headers:
values-falco.yaml
```
driver:
  kind: modern_ebpf
```

Verify the Talos kernel supports BTF:

talosctl -n <node-ip> read /sys/kernel/btf/vmlinux | head -c 4
# If this returns data, BTF is available

High CPU from Noisy Rules¶

Symptom: Falco pods consume excessive CPU, node performance degrades.

Cause: Broad rules matching high-frequency syscalls generate excessive processing.

Resolution:

# Check which rules fire most frequently
kubectl -n falco logs ds/falco --tail=1000 | \
  grep -oP 'Rule: \K[^)]+' | sort | uniq -c | sort -rn | head -10

Tune the noisy rule by adding exceptions:

custom-rules-override.yaml

customRules:
  rciis-overrides.yaml: |-
    - rule: <noisy-rule-name>
      append: true
      condition: and not (proc.name in (expected-process-1, expected-process-2))

Falcosidekick Not Forwarding Alerts¶

Symptom: Falco detects events but they do not appear in Slack/webhook destinations.

# Check Falcosidekick logs
kubectl -n falco logs deployment/falco-falcosidekick

# Check Falcosidekick metrics — are outputs succeeding?
kubectl -n falco port-forward svc/falco-falcosidekick 2801:2801 &
curl -s http://localhost:2801/metrics | grep falcosidekick_outputs
kill %1

Common causes:

Cause	Fix
Webhook URL incorrect	Verify the URL in the secret or Helm values
Network policy blocking egress	Create a NetworkPolicy allowing egress from `falco` namespace
Secret not mounted	Verify `existingSecret` name matches the deployed Secret
Minimum priority too high	Lower `minimumpriority` in the Falcosidekick config

Tracee¶

Tracee Pods Not Ready on Talos Linux (0/1 Running)¶

Symptom: Tracee DaemonSet pods run but never become Ready. The startup probe fails with connection refused on port 3366, and pods restart repeatedly.

kubectl -n tracee get ds tracee
# READY: 0   AVAILABLE: 0

kubectl -n tracee logs ds/tracee
# {"level":"warn","msg":"Event canceled because of missing kernel symbol dependency","missing symbols":["_stext","_etext"]}
# {"level":"warn","msg":"Event canceled because of missing kernel symbol dependency","missing symbols":["sys_call_table"]}
# {"level":"info","msg":"enabled cap.SYS_PTRACE"}
# (then silence — health endpoint never starts)

Root Cause: Talos Linux has several differences from standard distributions that prevent Tracee from initialising:

Issue	Talos Default	Impact
debugfs not mounted	`/sys/kernel/debug` empty	eBPF kprobes fail to attach
tracefs not mounted	`/sys/kernel/tracing` empty	Tracepoint events unavailable
bpffs not mounted	`/sys/fs/bpf` empty	Cannot pin BPF maps/programs
`kptr_restrict=2`	Kernel pointers hidden	Symbol addresses show as `0000000000000000`
`CONFIG_KALLSYMS_ALL` not set	Only exported symbols in `/proc/kallsyms`	`sys_call_table` not available (warning only)
Containerd socket path	`/run/containerd/containerd.sock`	Helm chart expects `/var/run/` which doesn't exist on Talos
Health server config	`healthz: true` in config file	Ignored in Tracee v0.24.1 — must use `--server healthz` CLI flag

Resolution: The HelmRelease uses postRenderers to apply Talos-specific patches. The key fixes are:

Wrapper script mounts debugfs/tracefs/bpffs before starting Tracee
CLI flags --server healthz --server http-address=:3366 --server metrics ensure the health endpoint binds
Volume patch fixes the containerd socket hostPath from /var/run/ to /run/
kptr_restrict is set to 1 (allows privileged processes to read kernel pointers)

See flux/infra/aws/tracee/helmrelease.yaml for the full implementation.

Diagnosing on a new Talos cluster:

# Verify BTF is available (required for CO:RE eBPF)
kubectl run --rm -it btf-check --image=busybox --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"c","image":"busybox",
  "command":["ls","-la","/sys/kernel/btf/vmlinux"],
  "securityContext":{"privileged":true}}],"tolerations":[{"operator":"Exists"}]}}' 

# Check kptr_restrict value
kubectl run --rm -it kptr-check --image=busybox --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"c","image":"busybox",
  "command":["cat","/proc/sys/kernel/kptr_restrict"],
  "securityContext":{"privileged":true}}],"tolerations":[{"operator":"Exists"}]}}'

eBPF Program Attachment Failures¶

Symptom: Tracee pods fail to start with eBPF-related errors.

kubectl -n tracee logs ds/tracee | grep -i "error\|bpf\|attach"

Resolution:

Verify hostPID: true is set in the values — Tracee needs host PID namespace access
Check that the node kernel supports BTF (same check as Falco above)

Ensure the Tracee container has the required Linux capabilities:

securityContext:
  privileged: true  # Required for eBPF program loading

Missing Events for Specific Syscalls¶

Symptom: Expected events (e.g., security_socket_connect) are not captured.

# Verify the event filter includes the syscall
kubectl -n tracee get ds tracee -o yaml | grep -A20 "filter"

Resolution: Add the missing event to the filter list in the Helm values:

values-tracee.yaml

config:
  filter:
    event:
      - security_socket_connect  # Ensure this is listed

Capture Storage Filling Disk¶

Symptom: Tracee nodes report disk pressure. /tmp/tracee/captures grows unbounded.

Resolution:

# Check capture directory size on a node
kubectl -n tracee exec -it ds/tracee -- du -sh /tmp/tracee/captures/

# Clear old captures
kubectl -n tracee exec -it ds/tracee -- find /tmp/tracee/captures/ -mtime +7 -delete

For a permanent fix, disable captures or limit the capture directory:

values-tracee.yaml

config:
  capture:
    write: false  # Disable file write capture
    exec: false   # Disable exec capture
    network: false

Or mount an emptyDir with a size limit:

volumes:
  - name: tracee-captures
    emptyDir:
      sizeLimit: 5Gi

General Debugging¶

# View Tracee health status
kubectl -n tracee logs ds/tracee --tail=50

# Check if eBPF programs are loaded
kubectl -n tracee exec -it ds/tracee -- tracee --list