Skip to content

9.6 Security Operations

This page defines the Day-2 operational procedures for maintaining the RCIIS security stack. For initial deployment, see Secure the Cluster. For incident response, see Incident Response.


Daily Operations

Check Grafana Security Dashboard

Review the following panels each morning:

Panel Source Action Threshold
New CRITICAL/HIGH vulnerabilities Trivy Operator metrics Any new CRITICAL → triage immediately
Falco alert volume Falcosidekick metrics Spike > 2× baseline → investigate
Kyverno violation count Kyverno metrics New Enforce-mode violations → investigate blocked deployments
Failed login attempts Keycloak event logs > 10 failed logins for a single user → check for brute force

Review Falco Alert Volume

# Quick check — events in the last hour by priority
kubectl -n falco logs -l app.kubernetes.io/name=falcosidekick --since=1h | \
  grep -oP '"priority":"[^"]*"' | sort | uniq -c | sort -rn

If the volume is abnormally high, check for false positive storms:

# Identify the noisiest rule
kubectl -n falco logs ds/falco --since=1h --tail=5000 | \
  grep -oP 'Rule: \K[^)]+' | sort | uniq -c | sort -rn | head -5

Review Kyverno Audit Violations

# Count audit-mode violations by policy
kubectl get policyreports -A -o json | \
  jq -r '[.items[].results[] | select(.result == "fail")] | group_by(.policy) | .[] |
    "\(.[0].policy): \(length) violations"'

Weekly Operations

Vulnerability Triage

Review all new VulnerabilityReports and assign remediation:

# List workloads with CRITICAL vulnerabilities
kubectl get vulnerabilityreports -A -o json | \
  jq -r '.items[] | select(.report.summary.criticalCount > 0) |
    "\(.metadata.namespace)/\(.metadata.labels["trivy-operator.resource.name"]) — CRITICAL: \(.report.summary.criticalCount), HIGH: \(.report.summary.highCount)"'

# Detailed CVE list for a specific workload
kubectl get vulnerabilityreport <report-name> -n <namespace> -o json | \
  jq -r '.report.vulnerabilities[] | select(.severity == "CRITICAL" or .severity == "HIGH") |
    "\(.vulnerabilityID) | \(.severity) | \(.installedVersion) → \(.fixedVersion) | \(.resource)"'

Triage categories:

Category Action Timeline
Fix available, CRITICAL Update image, redeploy Within 7 days
Fix available, HIGH Schedule update Within 30 days
No fix available, CRITICAL Assess exploitability, consider workaround Document risk acceptance
No fix available, HIGH Monitor for fix release Review monthly

Falco Rule Tuning

Review false positive rates and adjust rules:

# Check Falcosidekick output failure rate
kubectl -n falco port-forward svc/falco-falcosidekick 2801:2801 &
curl -s http://localhost:2801/metrics | grep -E "falcosidekick_outputs_total|falcosidekick_outputs_errors"
kill %1

For each high-volume rule, decide:

  1. True positive, expected behaviour → Add an exception condition to the rule
  2. True positive, unexpected → Investigate and remediate
  3. False positive → Tune the rule condition or add a process/container exclusion

Kyverno PolicyException Audit

# List all active policy exceptions
kubectl get policyexceptions -A

# Review each exception — is it still needed?
kubectl get policyexceptions -A -o json | \
  jq -r '.items[] |
    "\(.metadata.namespace)/\(.metadata.name) — exempts: \(.spec.exceptions[].policyName)"'

Remove exceptions for workloads that have been remediated. Every active exception should have a documented rationale in its annotations.


Vulnerability Management Lifecycle

Workflow

Discovery (Trivy scan)
    |
    v
Classification (CRITICAL / HIGH / MEDIUM / LOW)
    |
    v
Assignment (team / individual owner)
    |
    v
Remediation (image update, config change, policy exception)
    |
    v
Verification (rescan confirms fix)
    |
    v
Closure (report updated, ticket closed)

Severity SLAs

Severity Remediation SLA Escalation
CRITICAL 7 calendar days Auto-escalate to platform lead after 3 days
HIGH 30 calendar days Review in weekly triage if unresolved after 14 days
MEDIUM 90 calendar days Best-effort, track in backlog
LOW Best-effort No SLA, fix opportunistically

Commands for Each Stage

Discovery:

# Trigger a full cluster rescan
kubectl delete vulnerabilityreports -A --all
# Wait 5-10 minutes for scans to complete

Classification:

# Export all findings as CSV for triage spreadsheet
kubectl get vulnerabilityreports -A -o json | \
  jq -r '.items[] | .report.vulnerabilities[] |
    [.vulnerabilityID, .severity, .resource, .installedVersion, .fixedVersion, .title] | @csv'

Verification:

# After patching, verify the CVE is resolved
kubectl get vulnerabilityreport <report-name> -n <namespace> -o json | \
  jq '.report.vulnerabilities[] | select(.vulnerabilityID == "CVE-XXXX-XXXXX")'
# Expected: No output (CVE no longer present)

Policy-as-Code Workflow

PR-Based Policy Changes

All Kyverno policy changes must follow this workflow:

1. Create branch with policy change
    |
    v
2. Test with kyverno-cli (local validation)
    |
    v
3. CI runs kyverno test (automated)
    |
    v
4. Deploy to non-production in Audit mode
    |
    v
5. Monitor PolicyReports for 1 week
    |
    v
6. Review — no false positives?
    |  No  ──> Fix rule, return to step 2
    |  Yes
    v
7. Switch to Enforce mode (separate PR)
    |
    v
8. Merge with required reviewer approval

Required Approvals

Change Type Reviewers Required Justification
New policy (Audit mode) 1 platform engineer Low risk — audit only
Switch policy to Enforce 2 platform engineers + security lead High risk — blocks deployments
New PolicyException 1 platform engineer + security lead Weakens security posture
Delete or weaken a policy 2 platform engineers + security lead Reduces protection

Testing Before Deployment

# Local validation
kyverno apply policies/new-policy.yaml --resource test/sample-deployment.yaml

# Run full test suite
kyverno test test/

# Dry-run against live cluster (read-only)
kyverno apply policies/new-policy.yaml --cluster

Security Posture Dashboard

Key Grafana Panels

Configure a dedicated Security Posture dashboard in Grafana with the following panels:

Panel Metric Source PromQL / Query
Policy Compliance % Kyverno sum(kyverno_policy_results_total{rule_result="pass"}) / sum(kyverno_policy_results_total) * 100
Vulnerability Trend (7d) Trivy Operator sum(trivy_image_vulnerabilities{severity="Critical"}) over time
Runtime Alert Rate Falcosidekick sum(rate(falcosidekick_outputs_total[5m]))
Failed Auth Attempts Keycloak Loki query: count_over_time({namespace="keycloak"} \|= "LOGIN_ERROR" [1h])
Active PolicyExceptions Kyverno Manual count or custom exporter
Scan Coverage Trivy Operator count(trivy_image_vulnerabilities) / count(kube_pod_info) (approximate)

Data Sources

Source Namespace Metrics Endpoint
Kyverno admission controller kyverno :8000/metrics
Trivy Operator trivy-system ServiceMonitor
Falcosidekick falco :2801/metrics
Keycloak keycloak Event logs via Loki