9.6 Security Operations¶

This page defines the Day-2 operational procedures for maintaining the RCIIS security stack. For initial deployment, see Secure the Cluster. For incident response, see Incident Response.

Daily Operations¶

Check Grafana Security Dashboard¶

Review the following panels each morning:

Panel	Source	Action Threshold
New CRITICAL/HIGH vulnerabilities	Trivy Operator metrics	Any new CRITICAL → triage immediately
Falco alert volume	Falcosidekick metrics	Spike > 2× baseline → investigate
Kyverno violation count	Kyverno metrics	New Enforce-mode violations → investigate blocked deployments
Failed login attempts	Keycloak event logs	> 10 failed logins for a single user → check for brute force

Review Falco Alert Volume¶

# Quick check — events in the last hour by priority
kubectl -n falco logs -l app.kubernetes.io/name=falcosidekick --since=1h | \
  grep -oP '"priority":"[^"]*"' | sort | uniq -c | sort -rn

If the volume is abnormally high, check for false positive storms:

# Identify the noisiest rule
kubectl -n falco logs ds/falco --since=1h --tail=5000 | \
  grep -oP 'Rule: \K[^)]+' | sort | uniq -c | sort -rn | head -5

Review Kyverno Audit Violations¶

# Count audit-mode violations by policy
kubectl get policyreports -A -o json | \
  jq -r '[.items[].results[] | select(.result == "fail")] | group_by(.policy) | .[] |
    "\(.[0].policy): \(length) violations"'

Weekly Operations¶

Vulnerability Triage¶

Review all new VulnerabilityReports and assign remediation:

# List workloads with CRITICAL vulnerabilities
kubectl get vulnerabilityreports -A -o json | \
  jq -r '.items[] | select(.report.summary.criticalCount > 0) |
    "\(.metadata.namespace)/\(.metadata.labels["trivy-operator.resource.name"]) — CRITICAL: \(.report.summary.criticalCount), HIGH: \(.report.summary.highCount)"'

# Detailed CVE list for a specific workload
kubectl get vulnerabilityreport <report-name> -n <namespace> -o json | \
  jq -r '.report.vulnerabilities[] | select(.severity == "CRITICAL" or .severity == "HIGH") |
    "\(.vulnerabilityID) | \(.severity) | \(.installedVersion) → \(.fixedVersion) | \(.resource)"'

Triage categories:

Category	Action	Timeline
Fix available, CRITICAL	Update image, redeploy	Within 7 days
Fix available, HIGH	Schedule update	Within 30 days
No fix available, CRITICAL	Assess exploitability, consider workaround	Document risk acceptance
No fix available, HIGH	Monitor for fix release	Review monthly

Falco Rule Tuning¶

Review false positive rates and adjust rules:

# Check Falcosidekick output failure rate
kubectl -n falco port-forward svc/falco-falcosidekick 2801:2801 &
curl -s http://localhost:2801/metrics | grep -E "falcosidekick_outputs_total|falcosidekick_outputs_errors"
kill %1

For each high-volume rule, decide:

True positive, expected behaviour → Add an exception condition to the rule
True positive, unexpected → Investigate and remediate
False positive → Tune the rule condition or add a process/container exclusion

Kyverno PolicyException Audit¶

# List all active policy exceptions
kubectl get policyexceptions -A

# Review each exception — is it still needed?
kubectl get policyexceptions -A -o json | \
  jq -r '.items[] |
    "\(.metadata.namespace)/\(.metadata.name) — exempts: \(.spec.exceptions[].policyName)"'

Remove exceptions for workloads that have been remediated. Every active exception should have a documented rationale in its annotations.

Vulnerability Management Lifecycle¶

Workflow¶

Discovery (Trivy scan)
    |
    v
Classification (CRITICAL / HIGH / MEDIUM / LOW)
    |
    v
Assignment (team / individual owner)
    |
    v
Remediation (image update, config change, policy exception)
    |
    v
Verification (rescan confirms fix)
    |
    v
Closure (report updated, ticket closed)

Severity SLAs¶

Severity	Remediation SLA	Escalation
CRITICAL	7 calendar days	Auto-escalate to platform lead after 3 days
HIGH	30 calendar days	Review in weekly triage if unresolved after 14 days
MEDIUM	90 calendar days	Best-effort, track in backlog
LOW	Best-effort	No SLA, fix opportunistically

Commands for Each Stage¶

Discovery:

# Trigger a full cluster rescan
kubectl delete vulnerabilityreports -A --all
# Wait 5-10 minutes for scans to complete

Classification:

# Export all findings as CSV for triage spreadsheet
kubectl get vulnerabilityreports -A -o json | \
  jq -r '.items[] | .report.vulnerabilities[] |
    [.vulnerabilityID, .severity, .resource, .installedVersion, .fixedVersion, .title] | @csv'

Verification:

# After patching, verify the CVE is resolved
kubectl get vulnerabilityreport <report-name> -n <namespace> -o json | \
  jq '.report.vulnerabilities[] | select(.vulnerabilityID == "CVE-XXXX-XXXXX")'
# Expected: No output (CVE no longer present)

Policy-as-Code Workflow¶

PR-Based Policy Changes¶

All Kyverno policy changes must follow this workflow:

1. Create branch with policy change
    |
    v
2. Test with kyverno-cli (local validation)
    |
    v
3. CI runs kyverno test (automated)
    |
    v
4. Deploy to non-production in Audit mode
    |
    v
5. Monitor PolicyReports for 1 week
    |
    v
6. Review — no false positives?
    |  No  ──> Fix rule, return to step 2
    |  Yes
    v
7. Switch to Enforce mode (separate PR)
    |
    v
8. Merge with required reviewer approval

Required Approvals¶

Change Type	Reviewers Required	Justification
New policy (Audit mode)	1 platform engineer	Low risk — audit only
Switch policy to Enforce	2 platform engineers + security lead	High risk — blocks deployments
New PolicyException	1 platform engineer + security lead	Weakens security posture
Delete or weaken a policy	2 platform engineers + security lead	Reduces protection

Testing Before Deployment¶

# Local validation
kyverno apply policies/new-policy.yaml --resource test/sample-deployment.yaml

# Run full test suite
kyverno test test/

# Dry-run against live cluster (read-only)
kyverno apply policies/new-policy.yaml --cluster

Security Posture Dashboard¶

Key Grafana Panels¶

Configure a dedicated Security Posture dashboard in Grafana with the following panels:

Panel	Metric Source	PromQL / Query
Policy Compliance %	Kyverno	`sum(kyverno_policy_results_total{rule_result="pass"}) / sum(kyverno_policy_results_total) * 100`
Vulnerability Trend (7d)	Trivy Operator	`sum(trivy_image_vulnerabilities{severity="Critical"})` over time
Runtime Alert Rate	Falcosidekick	`sum(rate(falcosidekick_outputs_total[5m]))`
Failed Auth Attempts	Keycloak	Loki query: `count_over_time({namespace="keycloak"} \\|= "LOGIN_ERROR" [1h])`
Active PolicyExceptions	Kyverno	Manual count or custom exporter
Scan Coverage	Trivy Operator	`count(trivy_image_vulnerabilities) / count(kube_pod_info)` (approximate)

Data Sources¶

Source	Namespace	Metrics Endpoint
Kyverno admission controller	`kyverno`	`:8000/metrics`
Trivy Operator	`trivy-system`	ServiceMonitor
Falcosidekick	`falco`	`:2801/metrics`
Keycloak	`keycloak`	Event logs via Loki