5.2.5 Security Troubleshooting¶
This page covers common issues with the security stack and their solutions. For incident response procedures, see Incident Response.
Kyverno¶
Webhook Failures Blocking All Resource Creation¶
Symptom: kubectl apply returns errors like:
Error from server (InternalError): Internal error occurred: failed calling webhook
"validate.kyverno.svc-fail": failed to call webhook: context deadline exceeded
Cause: Kyverno admission controller pods are down or unresponsive, and the webhook failurePolicy is set to Fail (the default).
Resolution:
# 1. Check Kyverno pod status
kubectl -n kyverno get pods
# 2. If pods are CrashLooping, check logs
kubectl -n kyverno logs deployment/kyverno-admission-controller --previous
# 3. If pods are Pending (resource pressure), check node resources
kubectl describe pods -n kyverno -l app.kubernetes.io/component=admission-controller
Emergency override — disable the webhook temporarily:
# This allows all resources to bypass Kyverno policies
kubectl delete validatingwebhookconfigurations kyverno-resource-validating-webhook-cfg
kubectl delete mutatingwebhookconfigurations kyverno-resource-mutating-webhook-cfg
Warning
Disabling webhooks removes ALL policy enforcement. Re-enable by restarting the admission controller:
Large CRD Handling with Flux¶
Symptom: Flux HelmRelease shows DryRunFailed or ReconciliationFailed for Kyverno, with errors referencing oversized resources.
Cause: Kyverno CRDs are large. Flux uses server-side apply by default, which avoids the annotation size limit issue that affects client-side apply tools. However, if CRDs are not configured for automatic management, Flux may fail to apply or upgrade them.
Resolution: Ensure spec.install.crds: CreateReplace and spec.upgrade.crds: CreateReplace are set in the Kyverno HelmRelease (these are already present in the base HelmRelease):
If a specific CRD is still too large for the API server to accept, annotate the Kustomization to force replacement:
Warning
The force annotation causes Flux to delete and recreate resources that cannot be patched. Only use this when the CRD is genuinely too large to update in-place.
Policy Syntax Errors¶
Symptom: ClusterPolicy fails to apply or behaves unexpectedly.
Debug steps:
# Validate policy syntax before applying
kyverno apply policy.yaml --resource test-resource.yaml
# Check Kyverno controller logs for policy compilation errors
kubectl -n kyverno logs deployment/kyverno-admission-controller | grep -i "error\|fail"
# Check events for policy-related issues
kubectl get events -n kyverno --sort-by='.lastTimestamp'
Common mistakes:
| Issue | Example | Fix |
|---|---|---|
| Invalid regex in pattern | image: "harbor.*/rciis/*" |
Use image: "harbor.devops.africa/*" — patterns use glob, not regex |
| Missing anchor for optional fields | securityContext: privileged: false |
Use =(securityContext): =(privileged): false for optional fields |
| Rule matches too broadly | Matching all Pod kinds |
Add namespace exclusions or use exclude blocks |
Debugging Admission Decisions¶
# Check which policies evaluated a specific resource
kubectl get policyreports -n <namespace> -o json | \
jq '.items[].results[] | select(.resources[].name == "<resource-name>")'
# View admission controller decision logs (verbose)
kubectl -n kyverno logs deployment/kyverno-admission-controller -c kyverno | \
grep "admission request"
Trivy Operator¶
Scan Jobs Stuck in Pending¶
Symptom: VulnerabilityReports are not generated. Scan jobs remain in Pending state.
Cause: Scan jobs exceed available node resources (CPU/memory requests too high for the cluster).
Resolution:
# Check pending scan jobs
kubectl get jobs -n trivy-system -l app.kubernetes.io/managed-by=trivy-operator
# Describe a stuck job for scheduling details
kubectl describe job <job-name> -n trivy-system
# Look for: "Insufficient cpu" or "Insufficient memory"
Reduce scan job resource requests:
scanJob:
resources:
requests:
cpu: 50m # Lower from default
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
Private Registry Auth Failures¶
Symptom: Scan jobs fail with UNAUTHORIZED or 401 errors.
# Check scan job logs
kubectl -n trivy-system logs job/<scan-job-name>
# Expected error: "failed to get authorization token" or "UNAUTHORIZED"
Resolution:
# Verify the registry secret exists
kubectl -n trivy-system get secret harbor-creds
# Verify the secret is correctly referenced in Helm values
kubectl -n trivy-system get deployment trivy-operator -o yaml | grep -A5 privateRegistry
# Recreate the secret if needed
kubectl -n trivy-system delete secret harbor-creds
kubectl -n trivy-system create secret docker-registry harbor-creds \
--docker-server=harbor.devops.africa \
--docker-username="${HARBOR_USERNAME}" \
--docker-password="${HARBOR_PASSWORD}"
Vulnerability DB Download Failures¶
Symptom: Scans fail with failed to download vulnerability DB errors.
Cause: The cluster cannot reach ghcr.io to download the Trivy vulnerability database.
Resolution for air-gapped environments:
# Pre-download the DB and host internally
oras pull ghcr.io/aquasecurity/trivy-db:2
# Push to internal Harbor registry
oras push harbor.devops.africa/trivy/trivy-db:2 db.tar.gz:application/vnd.aquasec.trivy.db.layer.v1.tar+gzip
Update Helm values to point to the internal mirror:
Reports Not Generated for Specific Workloads¶
Symptom: Some workloads have no VulnerabilityReport.
# Check if the workload is excluded by label
kubectl get deployment <name> -n <ns> -o jsonpath='{.metadata.labels}'
# Check operator logs for skip reasons
kubectl -n trivy-system logs deployment/trivy-operator | grep "<workload-name>"
Common causes:
- Workload has the skip label (
trivy-operator.aquasecurity.github.io/skip-scan: "true") - The image is from a registry that requires auth but no credentials are configured
- The scan job concurrent limit is reached — increase
operator.scanJobsConcurrentLimit
Falco¶
eBPF Probe Load Failure on Talos¶
Symptom: Falco pods are CrashLoopBackOff with errors about eBPF probe loading.
Resolution:
-
Verify the driver is set to
modern_ebpf(notmodule): -
If using
ebpf(legacy), switch tomodern_ebpfwhich requires no kernel headers: -
Verify the Talos kernel supports BTF:
High CPU from Noisy Rules¶
Symptom: Falco pods consume excessive CPU, node performance degrades.
Cause: Broad rules matching high-frequency syscalls generate excessive processing.
Resolution:
# Check which rules fire most frequently
kubectl -n falco logs ds/falco --tail=1000 | \
grep -oP 'Rule: \K[^)]+' | sort | uniq -c | sort -rn | head -10
Tune the noisy rule by adding exceptions:
customRules:
rciis-overrides.yaml: |-
- rule: <noisy-rule-name>
append: true
condition: and not (proc.name in (expected-process-1, expected-process-2))
Falcosidekick Not Forwarding Alerts¶
Symptom: Falco detects events but they do not appear in Slack/webhook destinations.
# Check Falcosidekick logs
kubectl -n falco logs deployment/falco-falcosidekick
# Check Falcosidekick metrics — are outputs succeeding?
kubectl -n falco port-forward svc/falco-falcosidekick 2801:2801 &
curl -s http://localhost:2801/metrics | grep falcosidekick_outputs
kill %1
Common causes:
| Cause | Fix |
|---|---|
| Webhook URL incorrect | Verify the URL in the secret or Helm values |
| Network policy blocking egress | Create a NetworkPolicy allowing egress from falco namespace |
| Secret not mounted | Verify existingSecret name matches the deployed Secret |
| Minimum priority too high | Lower minimumpriority in the Falcosidekick config |
Tracee¶
Tracee Pods Not Ready on Talos Linux (0/1 Running)¶
Symptom: Tracee DaemonSet pods run but never become Ready. The startup probe fails with connection refused on port 3366, and pods restart repeatedly.
kubectl -n tracee get ds tracee
# READY: 0 AVAILABLE: 0
kubectl -n tracee logs ds/tracee
# {"level":"warn","msg":"Event canceled because of missing kernel symbol dependency","missing symbols":["_stext","_etext"]}
# {"level":"warn","msg":"Event canceled because of missing kernel symbol dependency","missing symbols":["sys_call_table"]}
# {"level":"info","msg":"enabled cap.SYS_PTRACE"}
# (then silence — health endpoint never starts)
Root Cause: Talos Linux has several differences from standard distributions that prevent Tracee from initialising:
| Issue | Talos Default | Impact |
|---|---|---|
| debugfs not mounted | /sys/kernel/debug empty |
eBPF kprobes fail to attach |
| tracefs not mounted | /sys/kernel/tracing empty |
Tracepoint events unavailable |
| bpffs not mounted | /sys/fs/bpf empty |
Cannot pin BPF maps/programs |
kptr_restrict=2 |
Kernel pointers hidden | Symbol addresses show as 0000000000000000 |
CONFIG_KALLSYMS_ALL not set |
Only exported symbols in /proc/kallsyms |
sys_call_table not available (warning only) |
| Containerd socket path | /run/containerd/containerd.sock |
Helm chart expects /var/run/ which doesn't exist on Talos |
| Health server config | healthz: true in config file |
Ignored in Tracee v0.24.1 — must use --server healthz CLI flag |
Resolution: The HelmRelease uses postRenderers to apply Talos-specific patches. The key fixes are:
- Wrapper script mounts debugfs/tracefs/bpffs before starting Tracee
- CLI flags
--server healthz --server http-address=:3366 --server metricsensure the health endpoint binds - Volume patch fixes the containerd socket hostPath from
/var/run/to/run/ - kptr_restrict is set to
1(allows privileged processes to read kernel pointers)
See flux/infra/aws/tracee/helmrelease.yaml for the full implementation.
Diagnosing on a new Talos cluster:
# Verify BTF is available (required for CO:RE eBPF)
kubectl run --rm -it btf-check --image=busybox --restart=Never \
--overrides='{"spec":{"containers":[{"name":"c","image":"busybox",
"command":["ls","-la","/sys/kernel/btf/vmlinux"],
"securityContext":{"privileged":true}}],"tolerations":[{"operator":"Exists"}]}}'
# Check kptr_restrict value
kubectl run --rm -it kptr-check --image=busybox --restart=Never \
--overrides='{"spec":{"containers":[{"name":"c","image":"busybox",
"command":["cat","/proc/sys/kernel/kptr_restrict"],
"securityContext":{"privileged":true}}],"tolerations":[{"operator":"Exists"}]}}'
eBPF Program Attachment Failures¶
Symptom: Tracee pods fail to start with eBPF-related errors.
Resolution:
- Verify
hostPID: trueis set in the values — Tracee needs host PID namespace access - Check that the node kernel supports BTF (same check as Falco above)
-
Ensure the Tracee container has the required Linux capabilities:
Missing Events for Specific Syscalls¶
Symptom: Expected events (e.g., security_socket_connect) are not captured.
# Verify the event filter includes the syscall
kubectl -n tracee get ds tracee -o yaml | grep -A20 "filter"
Resolution: Add the missing event to the filter list in the Helm values:
Capture Storage Filling Disk¶
Symptom: Tracee nodes report disk pressure. /tmp/tracee/captures grows unbounded.
Resolution:
# Check capture directory size on a node
kubectl -n tracee exec -it ds/tracee -- du -sh /tmp/tracee/captures/
# Clear old captures
kubectl -n tracee exec -it ds/tracee -- find /tmp/tracee/captures/ -mtime +7 -delete
For a permanent fix, disable captures or limit the capture directory:
config:
capture:
write: false # Disable file write capture
exec: false # Disable exec capture
network: false
Or mount an emptyDir with a size limit: