5.0 Install Platform Services — Overview¶
After bootstrapping the Talos Linux cluster (Phase 4), the next step is to deploy the infrastructure platform layer. These components provide networking, storage, observability, GitOps delivery, data services, and backup capabilities that the RCIIS application stack depends on.
How Deployment Works¶
All infrastructure is deployed from this Git repository using Flux, a GitOps tool that continuously reconciles the cluster state with the configuration files committed here.
There are two ways to follow this guide depending on how you received this codebase:
You have been given Git access to the rciis-devops repository. All the files
described in this guide already exist in the repository. Flux is installed in
the cluster and watches for changes on the master branch.
You do not need to create files or run mkdir / git add / git commit commands.
Those commands are provided for reference and for users building a new repository
from scratch (see the next tab).
To deploy or update a component:
- Review and edit the relevant file in the repository (HelmRelease, values, or manifests)
- Commit and push to
master - Flux detects the change and applies it automatically (within 5 minutes)
-
Optionally trigger an immediate sync:
You are building a new GitOps repository from scratch (e.g., for a new
deployment site or a forked environment). The mkdir, git add, and git commit
commands shown throughout this guide are intended for you.
Follow each section top-to-bottom:
- Create the directory with
mkdir -p - Save each YAML block into the file path shown in the code block title
- Run
git addandgit commitas shown in the "Commit and Deploy" sections - Push to your repository — Flux will deploy automatically
Tip
You can also fork the rciis-devops repository and modify the files
in place rather than creating everything from scratch.
You received this codebase as a zip archive or read-only copy without Git access.
In this case, ignore the Flux HelmRelease files and use the "Alternative: Helm CLI"
blocks shown beneath each Install section. These run helm install / helm upgrade
directly from your workstation.
Warning
Without Flux, there is no automated drift detection or self-healing. You are responsible for re-running Helm commands whenever configuration changes.
Repository Structure¶
The deployment files are organised as follows:
rciis-devops/
├── flux/
│ ├── clusters/
│ │ └── aws/
│ │ └── infrastructure.yaml # Defines ALL components and their deploy order
│ │
│ └── infra/
│ ├── base/ # Base HelmRelease for each component
│ │ ├── cilium.yaml # (shared across all environments)
│ │ ├── cert-manager.yaml
│ │ ├── prometheus.yaml
│ │ └── ...
│ │
│ └── aws/ # Environment-specific overrides
│ ├── cilium/
│ │ ├── kustomization.yaml # References the base + applies patches
│ │ ├── patch.yaml # Environment-specific value overrides
│ │ └── values.yaml # Full Helm values for this environment
│ ├── cert-manager/
│ └── ...
│
└── apps/ # Application workload deployments
└── rciis/
└── ...
How the pieces fit together:
-
flux/clusters/aws/infrastructure.yamlis the entry point. It lists every component, the path to its configuration, the deploy order (dependsOn), and whether it needs secret decryption. -
Each component has a base HelmRelease in
flux/infra/base/that defines the Helm chart, version, and repository. This file is shared across all environments. -
Each component has an environment overlay in
flux/infra/aws/(orproxmox/, etc.) that customises the base for that specific cluster — adding values, patches, or extra manifests. -
Flux reads these files from Git and applies them. If a file changes, Flux updates the cluster to match.
What Each File Type Does¶
| File | Purpose | When to edit |
|---|---|---|
flux/infra/base/<name>.yaml |
Defines which Helm chart to install, its version, and install/upgrade behaviour | When upgrading chart versions |
flux/infra/aws/<name>/values.yaml |
Configures how the chart behaves (replicas, resources, feature flags) | When changing component configuration |
flux/infra/aws/<name>/patch.yaml |
Overrides specific fields from the base HelmRelease for this environment | When an environment needs different chart settings |
flux/infra/aws/<name>/kustomization.yaml |
Ties the base and patches together using Kustomize | Rarely — only when adding new overlay files |
Tool Inventory¶
The platform deploys a common set of tools across all environments, plus environment-specific components for storage and cloud integration. Select your deployment model to see the full inventory:
| # | Tool | Purpose | Namespace | Page |
|---|---|---|---|---|
| 1 | Cilium | CNI, kube-proxy replacement, Gateway API, Hubble | kube-system |
Networking |
| 2 | CoreDNS | Custom internal DNS | kube-system |
Networking |
| 3 | cert-manager | X.509 certificate automation | cert-manager |
Certificates |
| 4 | Flux | GitOps continuous deployment | flux-system |
GitOps |
| 5 | Argo Rollouts | Progressive delivery (canary/blue-green) | argo-rollouts |
GitOps |
| 6 | AWS EBS CSI Driver | EBS volume provisioning and attachment | kube-system |
Storage |
| 7 | AWS Cloud Controller Manager | Node lifecycle, load balancer integration | kube-system |
Storage |
| 8 | AWS Load Balancer Controller | NLB/ALB provisioning from Service/Ingress | kube-system |
Storage |
| 9 | Snapshot Controller | CSI VolumeSnapshot support | snapshot-controller |
Storage |
| 10 | Prometheus (kube-prometheus-stack) | Metrics, alerting, Grafana dashboards | monitoring |
Observability |
| 11 | Loki | Log aggregation and querying | monitoring |
Observability |
| 12 | Fluent Bit | Log collection (DaemonSet) | monitoring |
Observability |
| 13 | Blackbox Exporter | HTTP/TCP/ICMP endpoint probing | monitoring |
Observability |
| 14 | SNMP Exporter | Network device monitoring | monitoring |
Observability |
| 15 | Goldilocks | Resource right-sizing recommendations | goldilocks |
Observability |
| 16 | CloudNativePG | PostgreSQL operator | cnpg-system |
Data Services |
| 17 | Strimzi | Apache Kafka operator | strimzi-operator |
Data Services |
| 18 | Velero | Kubernetes backup and disaster recovery | velero |
Backup |
| 19 | Descheduler | Pod rebalancing across nodes | kube-system |
Backup |
| 20 | Kyverno | Admission control and policy engine | kyverno |
Policy Engine |
| 21 | Trivy Operator | Continuous vulnerability scanning | trivy-system |
Vulnerability Scanning |
| 22 | Falco | Runtime syscall threat detection | falco |
Runtime Security |
| 23 | Tracee | eBPF-based runtime forensics | tracee |
Runtime Security |
| 24 | Keycloak | Centralised identity provider (OIDC/SAML) | keycloak |
Identity Management |
| 25 | Crossplane | Kubernetes-native control plane framework (Keycloak provider) | crossplane-system |
GitOps |
AWS-specific components
The AWS EBS CSI Driver, AWS Cloud Controller Manager, and AWS Load Balancer Controller replace Rook-Ceph and Cilium L2 announcements used on Bare Metal and Proxmox. These integrate Kubernetes with native AWS services (EBS volumes, NLBs, node lifecycle).
| # | Tool | Purpose | Namespace | Page |
|---|---|---|---|---|
| 1 | Cilium | CNI, kube-proxy replacement, L2 LB announcements, Gateway API, Hubble | kube-system |
Networking |
| 2 | CoreDNS | Custom internal DNS | kube-system |
Networking |
| 3 | cert-manager | X.509 certificate automation | cert-manager |
Certificates |
| 4 | Flux | GitOps continuous deployment | flux-system |
GitOps |
| 5 | Argo Rollouts | Progressive delivery (canary/blue-green) | argo-rollouts |
GitOps |
| 6 | Rook-Ceph Operator | Ceph storage orchestrator | rook-ceph |
Storage |
| 7 | Rook-Ceph Cluster | Ceph cluster (block, object, S3-compatible storage) | rook-ceph |
Storage |
| 8 | Snapshot Controller | CSI VolumeSnapshot support | snapshot-controller |
Storage |
| 9 | Prometheus (kube-prometheus-stack) | Metrics, alerting, Grafana dashboards | monitoring |
Observability |
| 10 | Loki | Log aggregation and querying | monitoring |
Observability |
| 11 | Fluent Bit | Log collection (DaemonSet) | monitoring |
Observability |
| 12 | Blackbox Exporter | HTTP/TCP/ICMP endpoint probing | monitoring |
Observability |
| 13 | SNMP Exporter | Network device monitoring | monitoring |
Observability |
| 14 | Goldilocks | Resource right-sizing recommendations | goldilocks |
Observability |
| 15 | CloudNativePG | PostgreSQL operator | cnpg-system |
Data Services |
| 16 | Strimzi | Apache Kafka operator | strimzi-operator |
Data Services |
| 17 | Velero | Kubernetes backup and disaster recovery | velero |
Backup |
| 18 | Descheduler | Pod rebalancing across nodes | kube-system |
Backup |
| 19 | Kyverno | Admission control and policy engine | kyverno |
Policy Engine |
| 20 | Trivy Operator | Continuous vulnerability scanning | trivy-system |
Vulnerability Scanning |
| 21 | Falco | Runtime syscall threat detection | falco |
Runtime Security |
| 22 | Tracee | eBPF-based runtime forensics | tracee |
Runtime Security |
| 23 | Keycloak | Centralised identity provider (OIDC/SAML) | keycloak |
Identity Management |
| 24 | Crossplane | Kubernetes-native control plane framework (Keycloak provider) | crossplane-system |
GitOps |
Bare Metal-specific components
Rook-Ceph provides block storage (RBD), object storage (RGW), and S3-compatible endpoints — replacing AWS EBS and S3. Cilium L2 announcements replace AWS NLBs for LoadBalancer Service IPs.
| # | Tool | Purpose | Namespace | Page |
|---|---|---|---|---|
| 1 | Cilium | CNI, kube-proxy replacement, L2 LB announcements, Gateway API, Hubble | kube-system |
Networking |
| 2 | CoreDNS | Custom internal DNS | kube-system |
Networking |
| 3 | cert-manager | X.509 certificate automation | cert-manager |
Certificates |
| 4 | Flux | GitOps continuous deployment | flux-system |
GitOps |
| 5 | Argo Rollouts | Progressive delivery (canary/blue-green) | argo-rollouts |
GitOps |
| 6 | Rook-Ceph Operator | Ceph storage orchestrator | rook-ceph |
Storage |
| 7 | Rook-Ceph Cluster | Ceph cluster (block, object, S3-compatible storage) | rook-ceph |
Storage |
| 8 | Snapshot Controller | CSI VolumeSnapshot support | snapshot-controller |
Storage |
| 9 | Prometheus (kube-prometheus-stack) | Metrics, alerting, Grafana dashboards | monitoring |
Observability |
| 10 | Loki | Log aggregation and querying | monitoring |
Observability |
| 11 | Fluent Bit | Log collection (DaemonSet) | monitoring |
Observability |
| 12 | Blackbox Exporter | HTTP/TCP/ICMP endpoint probing | monitoring |
Observability |
| 13 | SNMP Exporter | Network device monitoring | monitoring |
Observability |
| 14 | Goldilocks | Resource right-sizing recommendations | goldilocks |
Observability |
| 15 | CloudNativePG | PostgreSQL operator | cnpg-system |
Data Services |
| 16 | Strimzi | Apache Kafka operator | strimzi-operator |
Data Services |
| 17 | Velero | Kubernetes backup and disaster recovery | velero |
Backup |
| 18 | Descheduler | Pod rebalancing across nodes | kube-system |
Backup |
| 19 | Kyverno | Admission control and policy engine | kyverno |
Policy Engine |
| 20 | Trivy Operator | Continuous vulnerability scanning | trivy-system |
Vulnerability Scanning |
| 21 | Falco | Runtime syscall threat detection | falco |
Runtime Security |
| 22 | Tracee | eBPF-based runtime forensics | tracee |
Runtime Security |
| 23 | Keycloak | Centralised identity provider (OIDC/SAML) | keycloak |
Identity Management |
| 24 | Crossplane | Kubernetes-native control plane framework (Keycloak provider) | crossplane-system |
GitOps |
Proxmox-specific components
Identical to Bare Metal. Rook-Ceph provides storage, Cilium L2 announcements provide LoadBalancer IPs. The underlying Proxmox storage pools (ZFS, Ceph, LVM) are consumed by Rook-Ceph as raw block devices.
Deployment Order¶
Flux deploys components in a dependency-ordered sequence. Each component only starts after
its dependencies are healthy. The order is defined in the cluster's infrastructure.yaml
using dependsOn chains. The wave structure is the same across all environments — only the
specific tools within each wave differ based on the storage and cloud integration model.
Wave 1 ──── Cilium ───── cert-manager ───── Descheduler
│
Wave 2 ──── CoreDNS
│
Wave 3 ──── Argo Rollouts ──── AWS EBS CSI Driver ──── AWS Cloud Controller Manager
│ AWS Load Balancer Controller ──── Snapshot Controller ──── Crossplane
│
Wave 4 ──── Kyverno ──── Trivy Operator ──── Falco ──── Tracee
│
Wave 5 ──── CloudNativePG ── Strimzi ── Prometheus ── Loki ── Fluent Bit
│
Wave 6 ──── Blackbox Exporter ── SNMP Exporter ── Velero ── Goldilocks ── Keycloak
Wave 1 ──── Cilium ───── cert-manager ───── Descheduler
│
Wave 2 ──── CoreDNS
│
Wave 3 ──── Argo Rollouts ──── Rook-Ceph Operator ──── Snapshot Controller ──── Crossplane
│
Wave 4 ──── Kyverno ──── Trivy Operator ──── Falco ──── Tracee
│
Wave 5 ──── Rook-Ceph Cluster ── CloudNativePG ── Strimzi ── Prometheus ── Loki ── Fluent Bit
│
Wave 6 ──── Blackbox Exporter ── SNMP Exporter ── Velero ── Goldilocks ── Keycloak
Wave 1 ──── Cilium ───── cert-manager ───── Descheduler
│
Wave 2 ──── CoreDNS
│
Wave 3 ──── Argo Rollouts ──── Rook-Ceph Operator ──── Snapshot Controller ──── Crossplane
│
Wave 4 ──── Kyverno ──── Trivy Operator ──── Falco ──── Tracee
│
Wave 5 ──── Rook-Ceph Cluster ── CloudNativePG ── Strimzi ── Prometheus ── Loki ── Fluent Bit
│
Wave 6 ──── Blackbox Exporter ── SNMP Exporter ── Velero ── Goldilocks ── Keycloak
Why this order:
- Wave 1 establishes networking (Cilium CNI) and certificate infrastructure — everything else depends on these.
- Wave 2 adds custom DNS — required for internal service name resolution.
- Wave 3 deploys the storage layer and Crossplane. On AWS, this is the EBS CSI Driver, Cloud Controller Manager, and Load Balancer Controller. On Bare Metal / Proxmox, this is the Rook-Ceph Operator. Crossplane enables declarative management of external resources (Keycloak).
- Wave 4 deploys the security stack — admission control (Kyverno), vulnerability scanning (Trivy), and runtime detection (Falco, Tracee). These must be active before application workloads arrive so that every subsequent deployment is policy-enforced and monitored from the start.
- Wave 5 creates the data layer. On Bare Metal / Proxmox, this includes the Rook-Ceph Cluster (block pools, object store, storage classes). On AWS, storage is already available from Wave 3 (EBS). All environments deploy CloudNativePG, Strimzi, and the full observability stack — now under Kyverno admission control and Falco runtime detection.
- Wave 6 adds supplementary monitoring, backups, resource optimisation, and identity management. Keycloak deploys last because it requires CloudNativePG (Wave 5) for its PostgreSQL database.
Namespace Strategy¶
| Namespace | Components |
|---|---|
kube-system |
Cilium, CoreDNS, Descheduler, AWS EBS CSI Driver, AWS Cloud Controller Manager, AWS Load Balancer Controller |
cert-manager |
cert-manager controller, webhook, cainjector |
flux-system |
Flux controllers (helm-controller, kustomize-controller, source-controller, notification-controller) |
argo-rollouts |
Argo Rollouts controller, dashboard |
snapshot-controller |
CSI snapshot controller |
kyverno |
Kyverno admission controller, policy reports |
trivy-system |
Trivy Operator scanner |
falco |
Falco runtime detection, Falcosidekick |
tracee |
Tracee eBPF runtime detection |
keycloak |
Keycloak identity provider, PostgreSQL instance |
monitoring |
Prometheus, Alertmanager, Grafana, Loki, Fluent Bit, exporters |
cnpg-system |
CloudNativePG operator |
strimzi-operator |
Strimzi Kafka operator |
velero |
Velero server, AWS S3 plugin |
goldilocks |
Goldilocks controller, dashboard |
crossplane-system |
Crossplane core, RBAC manager, Keycloak provider |
| Namespace | Components |
|---|---|
kube-system |
Cilium, CoreDNS, Descheduler |
cert-manager |
cert-manager controller, webhook, cainjector |
flux-system |
Flux controllers (helm-controller, kustomize-controller, source-controller, notification-controller) |
argo-rollouts |
Argo Rollouts controller, dashboard |
rook-ceph |
Rook operator, Ceph MONs, MGRs, OSDs, RGW |
snapshot-controller |
CSI snapshot controller |
kyverno |
Kyverno admission controller, policy reports |
trivy-system |
Trivy Operator scanner |
falco |
Falco runtime detection, Falcosidekick |
tracee |
Tracee eBPF runtime detection |
keycloak |
Keycloak identity provider, PostgreSQL instance |
monitoring |
Prometheus, Alertmanager, Grafana, Loki, Fluent Bit, exporters |
cnpg-system |
CloudNativePG operator |
strimzi-operator |
Strimzi Kafka operator |
velero |
Velero server, AWS S3 plugin |
goldilocks |
Goldilocks controller, dashboard |
crossplane-system |
Crossplane core, RBAC manager, Keycloak provider |
| Namespace | Components |
|---|---|
kube-system |
Cilium, CoreDNS, Descheduler |
cert-manager |
cert-manager controller, webhook, cainjector |
flux-system |
Flux controllers (helm-controller, kustomize-controller, source-controller, notification-controller) |
argo-rollouts |
Argo Rollouts controller, dashboard |
rook-ceph |
Rook operator, Ceph MONs, MGRs, OSDs, RGW |
snapshot-controller |
CSI snapshot controller |
kyverno |
Kyverno admission controller, policy reports |
trivy-system |
Trivy Operator scanner |
falco |
Falco runtime detection, Falcosidekick |
tracee |
Tracee eBPF runtime detection |
keycloak |
Keycloak identity provider, PostgreSQL instance |
monitoring |
Prometheus, Alertmanager, Grafana, Loki, Fluent Bit, exporters |
cnpg-system |
CloudNativePG operator |
strimzi-operator |
Strimzi Kafka operator |
velero |
Velero server, AWS S3 plugin |
goldilocks |
Goldilocks controller, dashboard |
crossplane-system |
Crossplane core, RBAC manager, Keycloak provider |
HA vs Non-HA Deployment¶
Every tool in this section provides both an HA and Non-HA configuration variant, presented as tabbed code blocks.
| Variant | When to Use |
|---|---|
| HA | Production clusters with 3+ nodes. Multi-replica, topology-spread, full security hardening. |
| Non-HA | Development, single-node, or resource-constrained clusters. Single replica, reduced resources. |
Non-HA is not production-ready
The Non-HA configurations trade redundancy for simplicity. A single replica failure causes service downtime. Use HA for any environment where availability matters.
Secret Management¶
Infrastructure secrets (registry credentials, OAuth tokens, S3 keys, TLS certificates, database passwords) are managed via SOPS with Age encryption, using Flux's built-in SOPS decryption support — no external plugins required.
How It Works¶
Flux's kustomize-controller has native SOPS support. When a Kustomization resource
includes a decryption stanza, the controller automatically decrypts any
SOPS-encrypted files before applying them to the cluster:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: infra-cert-manager
namespace: flux-system
spec:
sourceRef:
kind: GitRepository
name: rciis-devops
path: "./flux/infra/aws/cert-manager"
decryption:
provider: sops
secretRef:
name: sops-age-aws # Kubernetes Secret containing the Age private key
The Age private key is stored as a Kubernetes Secret in the flux-system namespace.
Only Kustomizations that reference decryptionSecret: sops-age-aws in the ResourceSet
will have SOPS decryption enabled — components without secrets use an empty string
(decryptionSecret: "") and skip decryption entirely.
Secret Lifecycle¶
Developer encrypts secret Encrypted YAML committed Flux pulls from Git
with SOPS + Age public key to Git (safe in plain sight) and decrypts at apply time
│ │ │
sops -e secret.yaml ──► secret.enc.yaml ──► Git ──► kustomize-controller ──► K8s Secret
Encrypted secrets live in the Flux overlay directories (e.g., flux/infra/aws/cert-manager/)
alongside their Kustomization files. See Credential Management
for the full SOPS workflow including key generation and rotation.
Prerequisites¶
Before deploying infrastructure, ensure:
- [x] Talos Linux cluster is bootstrapped and healthy (Phase 4)
- [x]
kubectlaccess is configured with admin credentials - [x] Flux is bootstrapped in the cluster (GitOps)
- [x] SOPS Age key is generated and distributed (Credential Management)
- [x] DNS zone is configured for the cluster domain
- [x] Cloudflare API token is available for cert-manager DNS-01 challenges
Next Steps¶
Once all prerequisites are met, proceed to install platform services in order:
- 5.1.1 Networking — Cilium CNI, ingress, and DNS
- 5.1.2 Certificates — cert-manager and ClusterIssuers
- 5.1.3 GitOps & Delivery — Flux configuration and Argo Rollouts
- 5.1.4 Storage — Rook-Ceph and snapshot controller
- 5.2 Security — Kyverno, Trivy, Falco, Tracee
- 5.3.1 Observability — Prometheus, Grafana, Loki, exporters
- 5.3.2 Data Services — CloudNativePG and Strimzi operators
- 5.3.3 Backup & Scheduling — Velero and Descheduler
- 5.3.4 Identity & Access Management — Keycloak
- 5.3.5 Key Management — HSM integration