Skip to content

5.0 Install Platform Services — Overview

After bootstrapping the Talos Linux cluster (Phase 4), the next step is to deploy the infrastructure platform layer. These components provide networking, storage, observability, GitOps delivery, data services, and backup capabilities that the RCIIS application stack depends on.


How Deployment Works

All infrastructure is deployed from this Git repository using Flux, a GitOps tool that continuously reconciles the cluster state with the configuration files committed here.

There are two ways to follow this guide depending on how you received this codebase:

You have been given Git access to the rciis-devops repository. All the files described in this guide already exist in the repository. Flux is installed in the cluster and watches for changes on the master branch.

You do not need to create files or run mkdir / git add / git commit commands. Those commands are provided for reference and for users building a new repository from scratch (see the next tab).

To deploy or update a component:

  1. Review and edit the relevant file in the repository (HelmRelease, values, or manifests)
  2. Commit and push to master
  3. Flux detects the change and applies it automatically (within 5 minutes)
  4. Optionally trigger an immediate sync:

    flux reconcile kustomization <name> -n flux-system --with-source
    

You are building a new GitOps repository from scratch (e.g., for a new deployment site or a forked environment). The mkdir, git add, and git commit commands shown throughout this guide are intended for you.

Follow each section top-to-bottom:

  1. Create the directory with mkdir -p
  2. Save each YAML block into the file path shown in the code block title
  3. Run git add and git commit as shown in the "Commit and Deploy" sections
  4. Push to your repository — Flux will deploy automatically

Tip

You can also fork the rciis-devops repository and modify the files in place rather than creating everything from scratch.

You received this codebase as a zip archive or read-only copy without Git access. In this case, ignore the Flux HelmRelease files and use the "Alternative: Helm CLI" blocks shown beneath each Install section. These run helm install / helm upgrade directly from your workstation.

Warning

Without Flux, there is no automated drift detection or self-healing. You are responsible for re-running Helm commands whenever configuration changes.

Repository Structure

The deployment files are organised as follows:

rciis-devops/
├── flux/
│   ├── clusters/
│   │   └── aws/
│   │       └── infrastructure.yaml      # Defines ALL components and their deploy order
│   │
│   └── infra/
│       ├── base/                         # Base HelmRelease for each component
│       │   ├── cilium.yaml               #   (shared across all environments)
│       │   ├── cert-manager.yaml
│       │   ├── prometheus.yaml
│       │   └── ...
│       │
│       └── aws/                          # Environment-specific overrides
│           ├── cilium/
│           │   ├── kustomization.yaml    #   References the base + applies patches
│           │   ├── patch.yaml            #   Environment-specific value overrides
│           │   └── values.yaml           #   Full Helm values for this environment
│           ├── cert-manager/
│           └── ...
└── apps/                                 # Application workload deployments
    └── rciis/
        └── ...

How the pieces fit together:

  1. flux/clusters/aws/infrastructure.yaml is the entry point. It lists every component, the path to its configuration, the deploy order (dependsOn), and whether it needs secret decryption.

  2. Each component has a base HelmRelease in flux/infra/base/ that defines the Helm chart, version, and repository. This file is shared across all environments.

  3. Each component has an environment overlay in flux/infra/aws/ (or proxmox/, etc.) that customises the base for that specific cluster — adding values, patches, or extra manifests.

  4. Flux reads these files from Git and applies them. If a file changes, Flux updates the cluster to match.

What Each File Type Does

File Purpose When to edit
flux/infra/base/<name>.yaml Defines which Helm chart to install, its version, and install/upgrade behaviour When upgrading chart versions
flux/infra/aws/<name>/values.yaml Configures how the chart behaves (replicas, resources, feature flags) When changing component configuration
flux/infra/aws/<name>/patch.yaml Overrides specific fields from the base HelmRelease for this environment When an environment needs different chart settings
flux/infra/aws/<name>/kustomization.yaml Ties the base and patches together using Kustomize Rarely — only when adding new overlay files

Tool Inventory

The platform deploys a common set of tools across all environments, plus environment-specific components for storage and cloud integration. Select your deployment model to see the full inventory:

# Tool Purpose Namespace Page
1 Cilium CNI, kube-proxy replacement, Gateway API, Hubble kube-system Networking
2 CoreDNS Custom internal DNS kube-system Networking
3 cert-manager X.509 certificate automation cert-manager Certificates
4 Flux GitOps continuous deployment flux-system GitOps
5 Argo Rollouts Progressive delivery (canary/blue-green) argo-rollouts GitOps
6 AWS EBS CSI Driver EBS volume provisioning and attachment kube-system Storage
7 AWS Cloud Controller Manager Node lifecycle, load balancer integration kube-system Storage
8 AWS Load Balancer Controller NLB/ALB provisioning from Service/Ingress kube-system Storage
9 Snapshot Controller CSI VolumeSnapshot support snapshot-controller Storage
10 Prometheus (kube-prometheus-stack) Metrics, alerting, Grafana dashboards monitoring Observability
11 Loki Log aggregation and querying monitoring Observability
12 Fluent Bit Log collection (DaemonSet) monitoring Observability
13 Blackbox Exporter HTTP/TCP/ICMP endpoint probing monitoring Observability
14 SNMP Exporter Network device monitoring monitoring Observability
15 Goldilocks Resource right-sizing recommendations goldilocks Observability
16 CloudNativePG PostgreSQL operator cnpg-system Data Services
17 Strimzi Apache Kafka operator strimzi-operator Data Services
18 Velero Kubernetes backup and disaster recovery velero Backup
19 Descheduler Pod rebalancing across nodes kube-system Backup
20 Kyverno Admission control and policy engine kyverno Policy Engine
21 Trivy Operator Continuous vulnerability scanning trivy-system Vulnerability Scanning
22 Falco Runtime syscall threat detection falco Runtime Security
23 Tracee eBPF-based runtime forensics tracee Runtime Security
24 Keycloak Centralised identity provider (OIDC/SAML) keycloak Identity Management
25 Crossplane Kubernetes-native control plane framework (Keycloak provider) crossplane-system GitOps

AWS-specific components

The AWS EBS CSI Driver, AWS Cloud Controller Manager, and AWS Load Balancer Controller replace Rook-Ceph and Cilium L2 announcements used on Bare Metal and Proxmox. These integrate Kubernetes with native AWS services (EBS volumes, NLBs, node lifecycle).

# Tool Purpose Namespace Page
1 Cilium CNI, kube-proxy replacement, L2 LB announcements, Gateway API, Hubble kube-system Networking
2 CoreDNS Custom internal DNS kube-system Networking
3 cert-manager X.509 certificate automation cert-manager Certificates
4 Flux GitOps continuous deployment flux-system GitOps
5 Argo Rollouts Progressive delivery (canary/blue-green) argo-rollouts GitOps
6 Rook-Ceph Operator Ceph storage orchestrator rook-ceph Storage
7 Rook-Ceph Cluster Ceph cluster (block, object, S3-compatible storage) rook-ceph Storage
8 Snapshot Controller CSI VolumeSnapshot support snapshot-controller Storage
9 Prometheus (kube-prometheus-stack) Metrics, alerting, Grafana dashboards monitoring Observability
10 Loki Log aggregation and querying monitoring Observability
11 Fluent Bit Log collection (DaemonSet) monitoring Observability
12 Blackbox Exporter HTTP/TCP/ICMP endpoint probing monitoring Observability
13 SNMP Exporter Network device monitoring monitoring Observability
14 Goldilocks Resource right-sizing recommendations goldilocks Observability
15 CloudNativePG PostgreSQL operator cnpg-system Data Services
16 Strimzi Apache Kafka operator strimzi-operator Data Services
17 Velero Kubernetes backup and disaster recovery velero Backup
18 Descheduler Pod rebalancing across nodes kube-system Backup
19 Kyverno Admission control and policy engine kyverno Policy Engine
20 Trivy Operator Continuous vulnerability scanning trivy-system Vulnerability Scanning
21 Falco Runtime syscall threat detection falco Runtime Security
22 Tracee eBPF-based runtime forensics tracee Runtime Security
23 Keycloak Centralised identity provider (OIDC/SAML) keycloak Identity Management
24 Crossplane Kubernetes-native control plane framework (Keycloak provider) crossplane-system GitOps

Bare Metal-specific components

Rook-Ceph provides block storage (RBD), object storage (RGW), and S3-compatible endpoints — replacing AWS EBS and S3. Cilium L2 announcements replace AWS NLBs for LoadBalancer Service IPs.

# Tool Purpose Namespace Page
1 Cilium CNI, kube-proxy replacement, L2 LB announcements, Gateway API, Hubble kube-system Networking
2 CoreDNS Custom internal DNS kube-system Networking
3 cert-manager X.509 certificate automation cert-manager Certificates
4 Flux GitOps continuous deployment flux-system GitOps
5 Argo Rollouts Progressive delivery (canary/blue-green) argo-rollouts GitOps
6 Rook-Ceph Operator Ceph storage orchestrator rook-ceph Storage
7 Rook-Ceph Cluster Ceph cluster (block, object, S3-compatible storage) rook-ceph Storage
8 Snapshot Controller CSI VolumeSnapshot support snapshot-controller Storage
9 Prometheus (kube-prometheus-stack) Metrics, alerting, Grafana dashboards monitoring Observability
10 Loki Log aggregation and querying monitoring Observability
11 Fluent Bit Log collection (DaemonSet) monitoring Observability
12 Blackbox Exporter HTTP/TCP/ICMP endpoint probing monitoring Observability
13 SNMP Exporter Network device monitoring monitoring Observability
14 Goldilocks Resource right-sizing recommendations goldilocks Observability
15 CloudNativePG PostgreSQL operator cnpg-system Data Services
16 Strimzi Apache Kafka operator strimzi-operator Data Services
17 Velero Kubernetes backup and disaster recovery velero Backup
18 Descheduler Pod rebalancing across nodes kube-system Backup
19 Kyverno Admission control and policy engine kyverno Policy Engine
20 Trivy Operator Continuous vulnerability scanning trivy-system Vulnerability Scanning
21 Falco Runtime syscall threat detection falco Runtime Security
22 Tracee eBPF-based runtime forensics tracee Runtime Security
23 Keycloak Centralised identity provider (OIDC/SAML) keycloak Identity Management
24 Crossplane Kubernetes-native control plane framework (Keycloak provider) crossplane-system GitOps

Proxmox-specific components

Identical to Bare Metal. Rook-Ceph provides storage, Cilium L2 announcements provide LoadBalancer IPs. The underlying Proxmox storage pools (ZFS, Ceph, LVM) are consumed by Rook-Ceph as raw block devices.


Deployment Order

Flux deploys components in a dependency-ordered sequence. Each component only starts after its dependencies are healthy. The order is defined in the cluster's infrastructure.yaml using dependsOn chains. The wave structure is the same across all environments — only the specific tools within each wave differ based on the storage and cloud integration model.

Wave 1 ──── Cilium ───── cert-manager ───── Descheduler
Wave 2 ──── CoreDNS
Wave 3 ──── Argo Rollouts ──── AWS EBS CSI Driver ──── AWS Cloud Controller Manager
               │                  AWS Load Balancer Controller ──── Snapshot Controller ──── Crossplane
Wave 4 ──── Kyverno ──── Trivy Operator ──── Falco ──── Tracee
Wave 5 ──── CloudNativePG ── Strimzi ── Prometheus ── Loki ── Fluent Bit
Wave 6 ──── Blackbox Exporter ── SNMP Exporter ── Velero ── Goldilocks ── Keycloak
Wave 1 ──── Cilium ───── cert-manager ───── Descheduler
Wave 2 ──── CoreDNS
Wave 3 ──── Argo Rollouts ──── Rook-Ceph Operator ──── Snapshot Controller ──── Crossplane
Wave 4 ──── Kyverno ──── Trivy Operator ──── Falco ──── Tracee
Wave 5 ──── Rook-Ceph Cluster ── CloudNativePG ── Strimzi ── Prometheus ── Loki ── Fluent Bit
Wave 6 ──── Blackbox Exporter ── SNMP Exporter ── Velero ── Goldilocks ── Keycloak
Wave 1 ──── Cilium ───── cert-manager ───── Descheduler
Wave 2 ──── CoreDNS
Wave 3 ──── Argo Rollouts ──── Rook-Ceph Operator ──── Snapshot Controller ──── Crossplane
Wave 4 ──── Kyverno ──── Trivy Operator ──── Falco ──── Tracee
Wave 5 ──── Rook-Ceph Cluster ── CloudNativePG ── Strimzi ── Prometheus ── Loki ── Fluent Bit
Wave 6 ──── Blackbox Exporter ── SNMP Exporter ── Velero ── Goldilocks ── Keycloak

Why this order:

  • Wave 1 establishes networking (Cilium CNI) and certificate infrastructure — everything else depends on these.
  • Wave 2 adds custom DNS — required for internal service name resolution.
  • Wave 3 deploys the storage layer and Crossplane. On AWS, this is the EBS CSI Driver, Cloud Controller Manager, and Load Balancer Controller. On Bare Metal / Proxmox, this is the Rook-Ceph Operator. Crossplane enables declarative management of external resources (Keycloak).
  • Wave 4 deploys the security stack — admission control (Kyverno), vulnerability scanning (Trivy), and runtime detection (Falco, Tracee). These must be active before application workloads arrive so that every subsequent deployment is policy-enforced and monitored from the start.
  • Wave 5 creates the data layer. On Bare Metal / Proxmox, this includes the Rook-Ceph Cluster (block pools, object store, storage classes). On AWS, storage is already available from Wave 3 (EBS). All environments deploy CloudNativePG, Strimzi, and the full observability stack — now under Kyverno admission control and Falco runtime detection.
  • Wave 6 adds supplementary monitoring, backups, resource optimisation, and identity management. Keycloak deploys last because it requires CloudNativePG (Wave 5) for its PostgreSQL database.

Namespace Strategy

Namespace Components
kube-system Cilium, CoreDNS, Descheduler, AWS EBS CSI Driver, AWS Cloud Controller Manager, AWS Load Balancer Controller
cert-manager cert-manager controller, webhook, cainjector
flux-system Flux controllers (helm-controller, kustomize-controller, source-controller, notification-controller)
argo-rollouts Argo Rollouts controller, dashboard
snapshot-controller CSI snapshot controller
kyverno Kyverno admission controller, policy reports
trivy-system Trivy Operator scanner
falco Falco runtime detection, Falcosidekick
tracee Tracee eBPF runtime detection
keycloak Keycloak identity provider, PostgreSQL instance
monitoring Prometheus, Alertmanager, Grafana, Loki, Fluent Bit, exporters
cnpg-system CloudNativePG operator
strimzi-operator Strimzi Kafka operator
velero Velero server, AWS S3 plugin
goldilocks Goldilocks controller, dashboard
crossplane-system Crossplane core, RBAC manager, Keycloak provider
Namespace Components
kube-system Cilium, CoreDNS, Descheduler
cert-manager cert-manager controller, webhook, cainjector
flux-system Flux controllers (helm-controller, kustomize-controller, source-controller, notification-controller)
argo-rollouts Argo Rollouts controller, dashboard
rook-ceph Rook operator, Ceph MONs, MGRs, OSDs, RGW
snapshot-controller CSI snapshot controller
kyverno Kyverno admission controller, policy reports
trivy-system Trivy Operator scanner
falco Falco runtime detection, Falcosidekick
tracee Tracee eBPF runtime detection
keycloak Keycloak identity provider, PostgreSQL instance
monitoring Prometheus, Alertmanager, Grafana, Loki, Fluent Bit, exporters
cnpg-system CloudNativePG operator
strimzi-operator Strimzi Kafka operator
velero Velero server, AWS S3 plugin
goldilocks Goldilocks controller, dashboard
crossplane-system Crossplane core, RBAC manager, Keycloak provider
Namespace Components
kube-system Cilium, CoreDNS, Descheduler
cert-manager cert-manager controller, webhook, cainjector
flux-system Flux controllers (helm-controller, kustomize-controller, source-controller, notification-controller)
argo-rollouts Argo Rollouts controller, dashboard
rook-ceph Rook operator, Ceph MONs, MGRs, OSDs, RGW
snapshot-controller CSI snapshot controller
kyverno Kyverno admission controller, policy reports
trivy-system Trivy Operator scanner
falco Falco runtime detection, Falcosidekick
tracee Tracee eBPF runtime detection
keycloak Keycloak identity provider, PostgreSQL instance
monitoring Prometheus, Alertmanager, Grafana, Loki, Fluent Bit, exporters
cnpg-system CloudNativePG operator
strimzi-operator Strimzi Kafka operator
velero Velero server, AWS S3 plugin
goldilocks Goldilocks controller, dashboard
crossplane-system Crossplane core, RBAC manager, Keycloak provider

HA vs Non-HA Deployment

Every tool in this section provides both an HA and Non-HA configuration variant, presented as tabbed code blocks.

Variant When to Use
HA Production clusters with 3+ nodes. Multi-replica, topology-spread, full security hardening.
Non-HA Development, single-node, or resource-constrained clusters. Single replica, reduced resources.

Non-HA is not production-ready

The Non-HA configurations trade redundancy for simplicity. A single replica failure causes service downtime. Use HA for any environment where availability matters.


Secret Management

Infrastructure secrets (registry credentials, OAuth tokens, S3 keys, TLS certificates, database passwords) are managed via SOPS with Age encryption, using Flux's built-in SOPS decryption support — no external plugins required.

How It Works

Flux's kustomize-controller has native SOPS support. When a Kustomization resource includes a decryption stanza, the controller automatically decrypts any SOPS-encrypted files before applying them to the cluster:

flux/clusters/aws/infrastructure.yaml (ResourceSet template excerpt)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infra-cert-manager
  namespace: flux-system
spec:
  sourceRef:
    kind: GitRepository
    name: rciis-devops
  path: "./flux/infra/aws/cert-manager"
  decryption:
    provider: sops
    secretRef:
      name: sops-age-aws    # Kubernetes Secret containing the Age private key

The Age private key is stored as a Kubernetes Secret in the flux-system namespace. Only Kustomizations that reference decryptionSecret: sops-age-aws in the ResourceSet will have SOPS decryption enabled — components without secrets use an empty string (decryptionSecret: "") and skip decryption entirely.

Secret Lifecycle

Developer encrypts secret     Encrypted YAML committed      Flux pulls from Git
with SOPS + Age public key    to Git (safe in plain sight)  and decrypts at apply time
         │                              │                              │
    sops -e secret.yaml ──► secret.enc.yaml ──► Git ──► kustomize-controller ──► K8s Secret

Encrypted secrets live in the Flux overlay directories (e.g., flux/infra/aws/cert-manager/) alongside their Kustomization files. See Credential Management for the full SOPS workflow including key generation and rotation.


Prerequisites

Before deploying infrastructure, ensure:

  • [x] Talos Linux cluster is bootstrapped and healthy (Phase 4)
  • [x] kubectl access is configured with admin credentials
  • [x] Flux is bootstrapped in the cluster (GitOps)
  • [x] SOPS Age key is generated and distributed (Credential Management)
  • [x] DNS zone is configured for the cluster domain
  • [x] Cloudflare API token is available for cert-manager DNS-01 challenges

Next Steps

Once all prerequisites are met, proceed to install platform services in order:

  1. 5.1.1 Networking — Cilium CNI, ingress, and DNS
  2. 5.1.2 Certificates — cert-manager and ClusterIssuers
  3. 5.1.3 GitOps & Delivery — Flux configuration and Argo Rollouts
  4. 5.1.4 Storage — Rook-Ceph and snapshot controller
  5. 5.2 Security — Kyverno, Trivy, Falco, Tracee
  6. 5.3.1 Observability — Prometheus, Grafana, Loki, exporters
  7. 5.3.2 Data Services — CloudNativePG and Strimzi operators
  8. 5.3.3 Backup & Scheduling — Velero and Descheduler
  9. 5.3.4 Identity & Access Management — Keycloak
  10. 5.3.5 Key ManagementHSM integration