Skip to content

5.2.5 Security Troubleshooting

This page covers common issues with the security stack and their solutions. For incident response procedures, see Incident Response.


Kyverno

Webhook Failures Blocking All Resource Creation

Symptom: kubectl apply returns errors like:

Error from server (InternalError): Internal error occurred: failed calling webhook
"validate.kyverno.svc-fail": failed to call webhook: context deadline exceeded

Cause: Kyverno admission controller pods are down or unresponsive, and the webhook failurePolicy is set to Fail (the default).

Resolution:

# 1. Check Kyverno pod status
kubectl -n kyverno get pods

# 2. If pods are CrashLooping, check logs
kubectl -n kyverno logs deployment/kyverno-admission-controller --previous

# 3. If pods are Pending (resource pressure), check node resources
kubectl describe pods -n kyverno -l app.kubernetes.io/component=admission-controller

Emergency override — disable the webhook temporarily:

# This allows all resources to bypass Kyverno policies
kubectl delete validatingwebhookconfigurations kyverno-resource-validating-webhook-cfg
kubectl delete mutatingwebhookconfigurations kyverno-resource-mutating-webhook-cfg

Warning

Disabling webhooks removes ALL policy enforcement. Re-enable by restarting the admission controller:

kubectl -n kyverno rollout restart deployment kyverno-admission-controller
# Kyverno will recreate the webhook configurations on startup

Large CRD Handling with Flux

Symptom: Flux HelmRelease shows DryRunFailed or ReconciliationFailed for Kyverno, with errors referencing oversized resources.

Cause: Kyverno CRDs are large. Flux uses server-side apply by default, which avoids the annotation size limit issue that affects client-side apply tools. However, if CRDs are not configured for automatic management, Flux may fail to apply or upgrade them.

Resolution: Ensure spec.install.crds: CreateReplace and spec.upgrade.crds: CreateReplace are set in the Kyverno HelmRelease (these are already present in the base HelmRelease):

install:
  crds: CreateReplace
upgrade:
  crds: CreateReplace

If a specific CRD is still too large for the API server to accept, annotate the Kustomization to force replacement:

kubectl annotate kustomization kyverno \
  kustomize.toolkit.fluxcd.io/force=enabled \
  -n flux-system

Warning

The force annotation causes Flux to delete and recreate resources that cannot be patched. Only use this when the CRD is genuinely too large to update in-place.

Policy Syntax Errors

Symptom: ClusterPolicy fails to apply or behaves unexpectedly.

Debug steps:

# Validate policy syntax before applying
kyverno apply policy.yaml --resource test-resource.yaml

# Check Kyverno controller logs for policy compilation errors
kubectl -n kyverno logs deployment/kyverno-admission-controller | grep -i "error\|fail"

# Check events for policy-related issues
kubectl get events -n kyverno --sort-by='.lastTimestamp'

Common mistakes:

Issue Example Fix
Invalid regex in pattern image: "harbor.*/rciis/*" Use image: "harbor.devops.africa/*" — patterns use glob, not regex
Missing anchor for optional fields securityContext: privileged: false Use =(securityContext): =(privileged): false for optional fields
Rule matches too broadly Matching all Pod kinds Add namespace exclusions or use exclude blocks

Debugging Admission Decisions

# Check which policies evaluated a specific resource
kubectl get policyreports -n <namespace> -o json | \
  jq '.items[].results[] | select(.resources[].name == "<resource-name>")'

# View admission controller decision logs (verbose)
kubectl -n kyverno logs deployment/kyverno-admission-controller -c kyverno | \
  grep "admission request"

Trivy Operator

Scan Jobs Stuck in Pending

Symptom: VulnerabilityReports are not generated. Scan jobs remain in Pending state.

Cause: Scan jobs exceed available node resources (CPU/memory requests too high for the cluster).

Resolution:

# Check pending scan jobs
kubectl get jobs -n trivy-system -l app.kubernetes.io/managed-by=trivy-operator

# Describe a stuck job for scheduling details
kubectl describe job <job-name> -n trivy-system
# Look for: "Insufficient cpu" or "Insufficient memory"

Reduce scan job resource requests:

values-trivy-operator.yaml
scanJob:
  resources:
    requests:
      cpu: 50m      # Lower from default
      memory: 128Mi
    limits:
      cpu: 250m
      memory: 256Mi

Private Registry Auth Failures

Symptom: Scan jobs fail with UNAUTHORIZED or 401 errors.

# Check scan job logs
kubectl -n trivy-system logs job/<scan-job-name>
# Expected error: "failed to get authorization token" or "UNAUTHORIZED"

Resolution:

# Verify the registry secret exists
kubectl -n trivy-system get secret harbor-creds

# Verify the secret is correctly referenced in Helm values
kubectl -n trivy-system get deployment trivy-operator -o yaml | grep -A5 privateRegistry

# Recreate the secret if needed
kubectl -n trivy-system delete secret harbor-creds
kubectl -n trivy-system create secret docker-registry harbor-creds \
  --docker-server=harbor.devops.africa \
  --docker-username="${HARBOR_USERNAME}" \
  --docker-password="${HARBOR_PASSWORD}"

Vulnerability DB Download Failures

Symptom: Scans fail with failed to download vulnerability DB errors.

Cause: The cluster cannot reach ghcr.io to download the Trivy vulnerability database.

Resolution for air-gapped environments:

# Pre-download the DB and host internally
oras pull ghcr.io/aquasecurity/trivy-db:2
# Push to internal Harbor registry
oras push harbor.devops.africa/trivy/trivy-db:2 db.tar.gz:application/vnd.aquasec.trivy.db.layer.v1.tar+gzip

Update Helm values to point to the internal mirror:

values-trivy-operator.yaml
trivy:
  dbRepository: harbor.devops.africa/trivy/trivy-db

Reports Not Generated for Specific Workloads

Symptom: Some workloads have no VulnerabilityReport.

# Check if the workload is excluded by label
kubectl get deployment <name> -n <ns> -o jsonpath='{.metadata.labels}'

# Check operator logs for skip reasons
kubectl -n trivy-system logs deployment/trivy-operator | grep "<workload-name>"

Common causes:

  • Workload has the skip label (trivy-operator.aquasecurity.github.io/skip-scan: "true")
  • The image is from a registry that requires auth but no credentials are configured
  • The scan job concurrent limit is reached — increase operator.scanJobsConcurrentLimit

Falco

eBPF Probe Load Failure on Talos

Symptom: Falco pods are CrashLoopBackOff with errors about eBPF probe loading.

kubectl -n falco logs ds/falco | grep -i "error\|probe\|driver"

Resolution:

  1. Verify the driver is set to modern_ebpf (not module):

    kubectl -n falco get ds falco -o yaml | grep -A2 "driver"
    
  2. If using ebpf (legacy), switch to modern_ebpf which requires no kernel headers:

    values-falco.yaml
    driver:
      kind: modern_ebpf
    
  3. Verify the Talos kernel supports BTF:

    talosctl -n <node-ip> read /sys/kernel/btf/vmlinux | head -c 4
    # If this returns data, BTF is available
    

High CPU from Noisy Rules

Symptom: Falco pods consume excessive CPU, node performance degrades.

Cause: Broad rules matching high-frequency syscalls generate excessive processing.

Resolution:

# Check which rules fire most frequently
kubectl -n falco logs ds/falco --tail=1000 | \
  grep -oP 'Rule: \K[^)]+' | sort | uniq -c | sort -rn | head -10

Tune the noisy rule by adding exceptions:

custom-rules-override.yaml
customRules:
  rciis-overrides.yaml: |-
    - rule: <noisy-rule-name>
      append: true
      condition: and not (proc.name in (expected-process-1, expected-process-2))

Falcosidekick Not Forwarding Alerts

Symptom: Falco detects events but they do not appear in Slack/webhook destinations.

# Check Falcosidekick logs
kubectl -n falco logs deployment/falco-falcosidekick

# Check Falcosidekick metrics — are outputs succeeding?
kubectl -n falco port-forward svc/falco-falcosidekick 2801:2801 &
curl -s http://localhost:2801/metrics | grep falcosidekick_outputs
kill %1

Common causes:

Cause Fix
Webhook URL incorrect Verify the URL in the secret or Helm values
Network policy blocking egress Create a NetworkPolicy allowing egress from falco namespace
Secret not mounted Verify existingSecret name matches the deployed Secret
Minimum priority too high Lower minimumpriority in the Falcosidekick config

Tracee

Tracee Pods Not Ready on Talos Linux (0/1 Running)

Symptom: Tracee DaemonSet pods run but never become Ready. The startup probe fails with connection refused on port 3366, and pods restart repeatedly.

kubectl -n tracee get ds tracee
# READY: 0   AVAILABLE: 0

kubectl -n tracee logs ds/tracee
# {"level":"warn","msg":"Event canceled because of missing kernel symbol dependency","missing symbols":["_stext","_etext"]}
# {"level":"warn","msg":"Event canceled because of missing kernel symbol dependency","missing symbols":["sys_call_table"]}
# {"level":"info","msg":"enabled cap.SYS_PTRACE"}
# (then silence — health endpoint never starts)

Root Cause: Talos Linux has several differences from standard distributions that prevent Tracee from initialising:

Issue Talos Default Impact
debugfs not mounted /sys/kernel/debug empty eBPF kprobes fail to attach
tracefs not mounted /sys/kernel/tracing empty Tracepoint events unavailable
bpffs not mounted /sys/fs/bpf empty Cannot pin BPF maps/programs
kptr_restrict=2 Kernel pointers hidden Symbol addresses show as 0000000000000000
CONFIG_KALLSYMS_ALL not set Only exported symbols in /proc/kallsyms sys_call_table not available (warning only)
Containerd socket path /run/containerd/containerd.sock Helm chart expects /var/run/ which doesn't exist on Talos
Health server config healthz: true in config file Ignored in Tracee v0.24.1 — must use --server healthz CLI flag

Resolution: The HelmRelease uses postRenderers to apply Talos-specific patches. The key fixes are:

  1. Wrapper script mounts debugfs/tracefs/bpffs before starting Tracee
  2. CLI flags --server healthz --server http-address=:3366 --server metrics ensure the health endpoint binds
  3. Volume patch fixes the containerd socket hostPath from /var/run/ to /run/
  4. kptr_restrict is set to 1 (allows privileged processes to read kernel pointers)

See flux/infra/aws/tracee/helmrelease.yaml for the full implementation.

Diagnosing on a new Talos cluster:

# Verify BTF is available (required for CO:RE eBPF)
kubectl run --rm -it btf-check --image=busybox --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"c","image":"busybox",
  "command":["ls","-la","/sys/kernel/btf/vmlinux"],
  "securityContext":{"privileged":true}}],"tolerations":[{"operator":"Exists"}]}}' 

# Check kptr_restrict value
kubectl run --rm -it kptr-check --image=busybox --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"c","image":"busybox",
  "command":["cat","/proc/sys/kernel/kptr_restrict"],
  "securityContext":{"privileged":true}}],"tolerations":[{"operator":"Exists"}]}}'

eBPF Program Attachment Failures

Symptom: Tracee pods fail to start with eBPF-related errors.

kubectl -n tracee logs ds/tracee | grep -i "error\|bpf\|attach"

Resolution:

  1. Verify hostPID: true is set in the values — Tracee needs host PID namespace access
  2. Check that the node kernel supports BTF (same check as Falco above)
  3. Ensure the Tracee container has the required Linux capabilities:

    securityContext:
      privileged: true  # Required for eBPF program loading
    

Missing Events for Specific Syscalls

Symptom: Expected events (e.g., security_socket_connect) are not captured.

# Verify the event filter includes the syscall
kubectl -n tracee get ds tracee -o yaml | grep -A20 "filter"

Resolution: Add the missing event to the filter list in the Helm values:

values-tracee.yaml
config:
  filter:
    event:
      - security_socket_connect  # Ensure this is listed

Capture Storage Filling Disk

Symptom: Tracee nodes report disk pressure. /tmp/tracee/captures grows unbounded.

Resolution:

# Check capture directory size on a node
kubectl -n tracee exec -it ds/tracee -- du -sh /tmp/tracee/captures/

# Clear old captures
kubectl -n tracee exec -it ds/tracee -- find /tmp/tracee/captures/ -mtime +7 -delete

For a permanent fix, disable captures or limit the capture directory:

values-tracee.yaml
config:
  capture:
    write: false  # Disable file write capture
    exec: false   # Disable exec capture
    network: false

Or mount an emptyDir with a size limit:

volumes:
  - name: tracee-captures
    emptyDir:
      sizeLimit: 5Gi

General Debugging

# View Tracee health status
kubectl -n tracee logs ds/tracee --tail=50

# Check if eBPF programs are loaded
kubectl -n tracee exec -it ds/tracee -- tracee --list