9.6 Security Operations¶
This page defines the Day-2 operational procedures for maintaining the RCIIS security stack. For initial deployment, see Secure the Cluster. For incident response, see Incident Response.
Daily Operations¶
Check Grafana Security Dashboard¶
Review the following panels each morning:
| Panel | Source | Action Threshold |
|---|---|---|
| New CRITICAL/HIGH vulnerabilities | Trivy Operator metrics | Any new CRITICAL → triage immediately |
| Falco alert volume | Falcosidekick metrics | Spike > 2× baseline → investigate |
| Kyverno violation count | Kyverno metrics | New Enforce-mode violations → investigate blocked deployments |
| Failed login attempts | Keycloak event logs | > 10 failed logins for a single user → check for brute force |
Review Falco Alert Volume¶
# Quick check — events in the last hour by priority
kubectl -n falco logs -l app.kubernetes.io/name=falcosidekick --since=1h | \
grep -oP '"priority":"[^"]*"' | sort | uniq -c | sort -rn
If the volume is abnormally high, check for false positive storms:
# Identify the noisiest rule
kubectl -n falco logs ds/falco --since=1h --tail=5000 | \
grep -oP 'Rule: \K[^)]+' | sort | uniq -c | sort -rn | head -5
Review Kyverno Audit Violations¶
# Count audit-mode violations by policy
kubectl get policyreports -A -o json | \
jq -r '[.items[].results[] | select(.result == "fail")] | group_by(.policy) | .[] |
"\(.[0].policy): \(length) violations"'
Weekly Operations¶
Vulnerability Triage¶
Review all new VulnerabilityReports and assign remediation:
# List workloads with CRITICAL vulnerabilities
kubectl get vulnerabilityreports -A -o json | \
jq -r '.items[] | select(.report.summary.criticalCount > 0) |
"\(.metadata.namespace)/\(.metadata.labels["trivy-operator.resource.name"]) — CRITICAL: \(.report.summary.criticalCount), HIGH: \(.report.summary.highCount)"'
# Detailed CVE list for a specific workload
kubectl get vulnerabilityreport <report-name> -n <namespace> -o json | \
jq -r '.report.vulnerabilities[] | select(.severity == "CRITICAL" or .severity == "HIGH") |
"\(.vulnerabilityID) | \(.severity) | \(.installedVersion) → \(.fixedVersion) | \(.resource)"'
Triage categories:
| Category | Action | Timeline |
|---|---|---|
| Fix available, CRITICAL | Update image, redeploy | Within 7 days |
| Fix available, HIGH | Schedule update | Within 30 days |
| No fix available, CRITICAL | Assess exploitability, consider workaround | Document risk acceptance |
| No fix available, HIGH | Monitor for fix release | Review monthly |
Falco Rule Tuning¶
Review false positive rates and adjust rules:
# Check Falcosidekick output failure rate
kubectl -n falco port-forward svc/falco-falcosidekick 2801:2801 &
curl -s http://localhost:2801/metrics | grep -E "falcosidekick_outputs_total|falcosidekick_outputs_errors"
kill %1
For each high-volume rule, decide:
- True positive, expected behaviour → Add an exception condition to the rule
- True positive, unexpected → Investigate and remediate
- False positive → Tune the rule condition or add a process/container exclusion
Kyverno PolicyException Audit¶
# List all active policy exceptions
kubectl get policyexceptions -A
# Review each exception — is it still needed?
kubectl get policyexceptions -A -o json | \
jq -r '.items[] |
"\(.metadata.namespace)/\(.metadata.name) — exempts: \(.spec.exceptions[].policyName)"'
Remove exceptions for workloads that have been remediated. Every active exception should have a documented rationale in its annotations.
Vulnerability Management Lifecycle¶
Workflow¶
Discovery (Trivy scan)
|
v
Classification (CRITICAL / HIGH / MEDIUM / LOW)
|
v
Assignment (team / individual owner)
|
v
Remediation (image update, config change, policy exception)
|
v
Verification (rescan confirms fix)
|
v
Closure (report updated, ticket closed)
Severity SLAs¶
| Severity | Remediation SLA | Escalation |
|---|---|---|
| CRITICAL | 7 calendar days | Auto-escalate to platform lead after 3 days |
| HIGH | 30 calendar days | Review in weekly triage if unresolved after 14 days |
| MEDIUM | 90 calendar days | Best-effort, track in backlog |
| LOW | Best-effort | No SLA, fix opportunistically |
Commands for Each Stage¶
Discovery:
# Trigger a full cluster rescan
kubectl delete vulnerabilityreports -A --all
# Wait 5-10 minutes for scans to complete
Classification:
# Export all findings as CSV for triage spreadsheet
kubectl get vulnerabilityreports -A -o json | \
jq -r '.items[] | .report.vulnerabilities[] |
[.vulnerabilityID, .severity, .resource, .installedVersion, .fixedVersion, .title] | @csv'
Verification:
# After patching, verify the CVE is resolved
kubectl get vulnerabilityreport <report-name> -n <namespace> -o json | \
jq '.report.vulnerabilities[] | select(.vulnerabilityID == "CVE-XXXX-XXXXX")'
# Expected: No output (CVE no longer present)
Policy-as-Code Workflow¶
PR-Based Policy Changes¶
All Kyverno policy changes must follow this workflow:
1. Create branch with policy change
|
v
2. Test with kyverno-cli (local validation)
|
v
3. CI runs kyverno test (automated)
|
v
4. Deploy to non-production in Audit mode
|
v
5. Monitor PolicyReports for 1 week
|
v
6. Review — no false positives?
| No ──> Fix rule, return to step 2
| Yes
v
7. Switch to Enforce mode (separate PR)
|
v
8. Merge with required reviewer approval
Required Approvals¶
| Change Type | Reviewers Required | Justification |
|---|---|---|
| New policy (Audit mode) | 1 platform engineer | Low risk — audit only |
| Switch policy to Enforce | 2 platform engineers + security lead | High risk — blocks deployments |
| New PolicyException | 1 platform engineer + security lead | Weakens security posture |
| Delete or weaken a policy | 2 platform engineers + security lead | Reduces protection |
Testing Before Deployment¶
# Local validation
kyverno apply policies/new-policy.yaml --resource test/sample-deployment.yaml
# Run full test suite
kyverno test test/
# Dry-run against live cluster (read-only)
kyverno apply policies/new-policy.yaml --cluster
Security Posture Dashboard¶
Key Grafana Panels¶
Configure a dedicated Security Posture dashboard in Grafana with the following panels:
| Panel | Metric Source | PromQL / Query |
|---|---|---|
| Policy Compliance % | Kyverno | sum(kyverno_policy_results_total{rule_result="pass"}) / sum(kyverno_policy_results_total) * 100 |
| Vulnerability Trend (7d) | Trivy Operator | sum(trivy_image_vulnerabilities{severity="Critical"}) over time |
| Runtime Alert Rate | Falcosidekick | sum(rate(falcosidekick_outputs_total[5m])) |
| Failed Auth Attempts | Keycloak | Loki query: count_over_time({namespace="keycloak"} \|= "LOGIN_ERROR" [1h]) |
| Active PolicyExceptions | Kyverno | Manual count or custom exporter |
| Scan Coverage | Trivy Operator | count(trivy_image_vulnerabilities) / count(kube_pod_info) (approximate) |
Data Sources¶
| Source | Namespace | Metrics Endpoint |
|---|---|---|
| Kyverno admission controller | kyverno |
:8000/metrics |
| Trivy Operator | trivy-system |
ServiceMonitor |
| Falcosidekick | falco |
:2801/metrics |
| Keycloak | keycloak |
Event logs via Loki |