3.1 Provision Compute¶
Compute provisioning on AWS has two stages: building the Talos AMI, then deploying EC2 instances via Terraform.
The Terraform project lives at terraform/cluster/aws/ and composes reusable modules from terraform/modules/aws/.
Prerequisites: Build the Talos AMI¶
There is no official Talos AMI in af-south-1. You must build and register a custom AMI before running Terraform.
Step 1: Generate the Talos Disk Image¶
Option A — Vanilla image (no extensions):
Download the official Talos disk image for AWS:
Option B — Custom image with extensions:
Edit schematic.yaml to enable the extensions you need, then submit to Image Factory:
# Submit schematic and get the schematic ID
$SCHEMATIC_ID = Invoke-RestMethod -Method Post -InFile schematic.yaml -Uri "https://factory.talos.dev/schematics"
# Download the custom disk image
Invoke-WebRequest -Uri "https://factory.talos.dev/image/$SCHEMATIC_ID/v1.12.3/aws-amd64.raw.xz" -OutFile "aws-amd64.raw.xz"
7z x aws-amd64.raw.xz
Available extensions are listed at github.com/siderolabs/extensions.
Step 2: Upload to S3¶
Step 3: Import as EBS Snapshot¶
Check import progress (optional)
If you want to monitor progress while the import is running (e.g. in a separate terminal), use describe-import-snapshot-tasks. Unlike wait, this returns immediately with the current status and percentage:
Step 4: Wait for Import and Get Snapshot ID¶
The wait snapshot-imported command blocks until the import finishes — it polls automatically and returns when the snapshot is ready:
aws ec2 wait snapshot-imported \
--region af-south-1 \
--import-task-ids $IMPORT_TASK
SNAPSHOT_ID=$(aws ec2 describe-import-snapshot-tasks \
--region af-south-1 \
--import-task-ids $IMPORT_TASK \
--query 'ImportSnapshotTasks[0].SnapshotTaskDetail.SnapshotId' \
--output text)
echo "Snapshot: $SNAPSHOT_ID"
aws ec2 wait snapshot-imported `
--region af-south-1 `
--import-task-ids $IMPORT_TASK
$SNAPSHOT_ID = aws ec2 describe-import-snapshot-tasks `
--region af-south-1 `
--import-task-ids $IMPORT_TASK `
--query 'ImportSnapshotTasks[0].SnapshotTaskDetail.SnapshotId' `
--output text
Write-Output "Snapshot: $SNAPSHOT_ID"
Step 5: Register the AMI¶
VolumeSize must match the snapshot size
The VolumeSize in --block-device-mappings must be equal to or larger than the imported snapshot. The Talos v1.12.3 snapshot is 11 GiB. Using a smaller value (e.g. 4) will fail with InvalidParameterValue.
AMI=$(aws ec2 register-image \
--region af-south-1 \
--name "talos-v1.12.3-rciis" \
--root-device-name /dev/xvda \
--block-device-mappings "DeviceName=/dev/xvda,Ebs={SnapshotId=${SNAPSHOT_ID},VolumeSize=11,VolumeType=gp3,DeleteOnTermination=true}" \
--virtualization-type hvm \
--architecture x86_64 \
--ena-support \
--query 'ImageId' --output text)
echo "AMI: $AMI"
$AMI = aws ec2 register-image `
--region af-south-1 `
--name "talos-v1.12.3-rciis" `
--root-device-name /dev/xvda `
--block-device-mappings "DeviceName=/dev/xvda,Ebs={SnapshotId=$SNAPSHOT_ID,VolumeSize=11,VolumeType=gp3,DeleteOnTermination=true}" `
--virtualization-type hvm `
--architecture x86_64 `
--ena-support `
--query 'ImageId' --output text
Write-Output "AMI: $AMI"
Deploy EC2 Instances with Terraform¶
Project Structure¶
terraform/
├── cluster/
│ ├── aws/
│ │ ├── main.tf # Root module — composes all submodules
│ │ ├── variables.tf # All input variables with defaults
│ │ └── outputs.tf # Cluster outputs (IPs, endpoints, ARNs)
│ └── envs/
│ └── aws.tfvars # Environment-specific variable overrides
└── modules/
└── aws/
├── compute/ # EC2 instances + volume attachments
├── network/ # VPC, subnets, IGW, NAT, security groups
├── loadbalancer/ # NLB + target groups + listeners
├── ebs/ # Encrypted gp3 data volumes
├── iam/ # IAM role, instance profile, CCM + LBC policies
└── sqs/ # Event queues with dead-letter queue
Step 1: Set the AMI ID in aws.tfvars¶
Open terraform/cluster/envs/aws.tfvars and set talos_ami to the AMI you registered above:
talos_ami = "ami-07545b2983cd8fb22" # Replace with your AMI ID
talos_version = "v1.12.2"
kubernetes_version = "1.34.1"
Note
When talos_ami is set, Terraform skips the automatic AMI lookup (which only works in regions with official Sidero Labs AMIs). This is required for af-south-1.
Step 2: Review and Customise Variables¶
Key variables in terraform/cluster/envs/aws.tfvars:
# Cluster identity
cluster_name = "rciis-aws"
environment = "rciis"
# AWS region and AZs
region = "af-south-1"
aws_profile = "cbt"
availability_zones = ["af-south-1a"]
# Compute sizing
control_plane_count = 1
worker_count = 1
control_plane_instance_type = "t3.medium"
worker_instance_type = "t3.xlarge"
# Storage
root_volume_size = 50
control_plane_volume_size = 0 # No separate CP data volume
worker_volume_size = 100
# Security — restrict API access to allowed IPs
allowed_admin_cidrs = [
"196.45.28.20/32",
]
Step 3: Deploy¶
cd terraform/cluster/aws
# Initialise providers (hashicorp/aws ~> 6.34, siderolabs/talos ~> 0.10.1)
terraform init
# Preview the deployment
terraform plan -var-file=./aws.tfvars
# Deploy (creates VPC, EC2, NLB, EBS, IAM, SQS)
terraform apply -var-file=./aws.tfvars
Terraform will display a plan summary and prompt for confirmation before creating resources.
What Gets Created¶
The terraform/modules/aws/compute module provisions:
Control Plane Instances (default: 3):
| Property | Value |
|---|---|
| Instance type | t3.large (configurable via control_plane_instance_type) |
| Subnet | Private (distributed across AZs round-robin) |
| Security group | Control plane SG |
| IAM profile | talos-node-role-profile |
| Root volume | gp3, encrypted, delete_on_termination = true |
| EBS optimised | Yes |
| Source/dest check | Disabled (required for Cilium) |
| IMDSv2 | Required (http_tokens = "required") |
| Data volume | Attached at /dev/xvdf (configurable size, 0 = none) |
Worker Instances (default: 5):
| Property | Value |
|---|---|
| Instance type | t3.xlarge (configurable via worker_instance_type) |
| Subnet | Private (distributed across AZs round-robin) |
| Security group | Worker SG |
| IAM profile | talos-node-role-profile |
| Root volume | gp3, encrypted, delete_on_termination = true |
| EBS optimised | Yes |
| Source/dest check | Disabled (required for Cilium) |
| IMDSv2 | Required (http_tokens = "required") |
| Data volume | Attached at /dev/xvdf (default 200 GB) |
SQS Queues (terraform/modules/aws/sqs):
| Property | Value |
|---|---|
| Events queue | <env>-events (Standard) |
| Dead letter queue | <env>-events-dlq (Standard) |
| Encryption | SSE enabled (SQS-managed) |
| Message retention | 4 days (events), 14 days (DLQ) |
| Long polling | 20s (receive_wait_time_seconds) |
| Redrive policy | 3 max receives then DLQ |
IAM (terraform/modules/aws/iam):
| Resource | Purpose |
|---|---|
talos-node-role |
EC2 instance role (assume role for ec2.amazonaws.com) |
talos-node-role-profile |
Instance profile attached to all nodes |
| CCM policy | AWS Cloud Controller Manager permissions (optional, enable_ccm) |
| LB Controller policy | AWS Load Balancer Controller permissions (optional, enable_lb_controller) |
| SSM policy | Systems Manager for debugging (optional, enable_ssm) |
Resource Tags¶
All instances are tagged with:
| Tag | Value |
|---|---|
ManagedBy |
terraform |
Environment |
Environment name |
Project |
rciis |
Cluster |
Cluster name |
NodeType |
control-plane or worker |
Schedule |
always-on (CP) or weekday-business-hours (workers) |
KubernetesCluster |
Cluster name (required by AWS CCM for resource discovery) |
Terraform Outputs¶
After terraform apply, retrieve key information:
# NLB DNS name (used as Kubernetes API endpoint)
terraform output nlb_dns_name
# Kubernetes API endpoint URL
terraform output kubernetes_api_endpoint
# Talos API endpoints
terraform output talos_api_endpoint
terraform output talos_worker_api_endpoint
# Node private IPs (for talconfig.yaml)
terraform output control_plane_private_ips
terraform output worker_private_ips
# VPC and AMI info
terraform output vpc_id
terraform output talos_ami_id
# Full deployment summary
terraform output deployment_summary
Override at Deploy Time¶
Override any variable without editing aws.tfvars:
terraform apply -var-file=../envs/aws.tfvars \
-var="control_plane_count=1" \
-var="worker_count=2" \
-var="control_plane_instance_type=t3a.small" \
-var="worker_instance_type=t3a.medium"
Teardown¶
Delete Kubernetes-created AWS resources first
If the AWS Load Balancer Controller or AWS Cloud Controller Manager created any resources inside the VPC (NLBs, ALBs, target groups, security groups), you must delete them before running terraform destroy. These resources are not managed by Terraform, and AWS will block VPC, IGW, and public subnet deletion with DependencyViolation errors.
Check for orphaned resources:
# List load balancers in the VPC
aws elbv2 describe-load-balancers --region af-south-1 \
--query "LoadBalancers[?VpcId=='$(terraform output -raw vpc_id)'].{ARN:LoadBalancerArn,Name:LoadBalancerName,Type:Type}" \
--output table
# List target groups in the VPC
aws elbv2 describe-target-groups --region af-south-1 \
--query "TargetGroups[?VpcId=='$(terraform output -raw vpc_id)'].{ARN:TargetGroupArn,Name:TargetGroupName}" \
--output table
# List security groups in the VPC (exclude the default SG)
aws ec2 describe-security-groups --region af-south-1 \
--filters "Name=vpc-id,Values=$(terraform output -raw vpc_id)" \
--query "SecurityGroups[?GroupName!='default'].{ID:GroupId,Name:GroupName,Desc:Description}" \
--output table
Delete orphaned load balancers and target groups:
# Delete each K8s-created load balancer (get ARNs from the table above)
aws elbv2 delete-load-balancer --region af-south-1 \
--load-balancer-arn <LB_ARN>
# Wait for the LB to finish draining
aws elbv2 wait load-balancers-deleted --region af-south-1 \
--load-balancer-arns <LB_ARN>
# Delete each orphaned target group
aws elbv2 delete-target-group --region af-south-1 \
--target-group-arn <TG_ARN>
# Delete each orphaned security group
aws ec2 delete-security-group --region af-south-1 \
--group-id <SG_ID>
Once all Kubernetes-created resources are cleaned up, destroy the Terraform-managed infrastructure:
Bare metal compute provisioning involves booting physical servers with Talos Linux — either via PXE network boot, ISO image, or USB boot media. There is no VM template or AMI; Talos is installed directly onto the server's disk.
Step 0: BIOS & IPMI Configuration¶
Before booting Talos, configure each server's BIOS and IPMI/BMC settings.
Configure IPMI/BMC¶
- Connect to each server's IPMI interface (iDRAC, iLO, or similar) via the IPMI network
-
Assign static IP addresses from the server inventory:
Server IPMI IP rciis-cp-01 192.168.10.31 rciis-cp-02 192.168.10.32 rciis-cp-03 192.168.10.33 rciis-wn-01 192.168.10.34 rciis-wn-02 192.168.10.35 rciis-wn-03 192.168.10.36 -
Set IPMI admin credentials and record them securely
- Verify IPMI web console is accessible from the management workstation
Configure BIOS Settings¶
Enter each server's BIOS setup and apply the following:
| Setting | Value | Reason |
|---|---|---|
| Virtualisation (VT-x / VT-d) | Enabled | Required for Kubernetes workloads |
| Boot mode | UEFI | Talos requires UEFI boot |
| Secure Boot | Disabled | Talos signs its own boot chain |
| Boot order | PXE first (or USB/ISO) | Matches your chosen boot method |
| Wake-on-LAN | Enabled (optional) | Allows remote power-on via IPMI |
Secure Boot
Talos Linux does not use the platform's Secure Boot chain. Leaving it enabled will prevent Talos from booting.
Boot Methods¶
Choose a boot method based on your environment:
| Method | Best For | Requirements |
|---|---|---|
| PXE boot | Deploying many servers | DHCP + TFTP infrastructure, IPMI access |
| ISO boot | Small deployments, testing | ISO image, physical/IPMI console access |
| USB boot | Air-gapped environments | USB drive, physical access |
Step 1: Download Talos Boot Media¶
Download from the Talos Image Factory:
Vanilla image (no extensions):
# ISO
curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.0/metal-amd64.iso
# PXE kernel + initramfs
curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.0/kernel-amd64
curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.0/initramfs-amd64.xz
Custom image with extensions (e.g., ZFS, iSCSI):
# Submit schematic to get ID
SCHEMATIC_ID=$(curl -X POST --data-binary @schematic.yaml \
https://factory.talos.dev/schematics)
# Download custom ISO
curl -LO "https://factory.talos.dev/image/${SCHEMATIC_ID}/v1.12.0/metal-amd64.iso"
Step 2: Boot via PXE¶
If using PXE, configure your DHCP/TFTP server to serve the Talos kernel and initramfs. Example dnsmasq configuration:
# /etc/dnsmasq.d/talos-pxe.conf
enable-tftp
tftp-root=/srv/tftp
# BIOS boot
dhcp-boot=pxelinux.0
# UEFI boot
dhcp-match=set:efi-x86_64,option:client-arch,7
dhcp-boot=tag:efi-x86_64,ipxe.efi
Place the Talos kernel and initramfs in /srv/tftp/ and create a PXE boot menu entry that boots into the Talos installer.
Alternatively, use iPXE with a script:
#!ipxe
kernel https://factory.talos.dev/image/<schematic-id>/v1.12.0/kernel-amd64 talos.platform=metal
initrd https://factory.talos.dev/image/<schematic-id>/v1.12.0/initramfs-amd64.xz
boot
Step 3: Boot via ISO or USB¶
For ISO-based installs:
- Write the ISO to a USB drive:
- Or mount via IPMI virtual media (iDRAC, iLO, or similar)
- Boot the server from the ISO/USB
- Talos boots into maintenance mode, waiting for a machine configuration
Step 4: Verify Connectivity¶
After all servers are booted into Talos maintenance mode, verify network connectivity from your management workstation.
Ping all node IPs:
for ip in 192.168.30.{31..36}; do
ping -c 1 -W 2 "$ip" && echo "$ip OK" || echo "$ip UNREACHABLE"
done
Verify IPMI access:
for ip in 192.168.10.{31..36}; do
curl -sk --connect-timeout 3 "https://$ip" -o /dev/null && echo "$ip IPMI OK" || echo "$ip IPMI UNREACHABLE"
done
Check NIC link status (from the Talos maintenance shell via talosctl):
Verify default gateway reachability:
Connectivity checklist:
- [ ] All node IPs respond to ping from management workstation
- [ ] All IPMI web consoles are accessible
- [ ] NIC link status shows
UPfor all connected interfaces - [ ] Nodes can reach the default gateway (192.168.30.1)
- [ ] DNS resolution works (if DNS is configured at this stage)
Tip
If a node is unreachable, check the physical cabling and switch port configuration in Set Up Network Fabric before proceeding.
What Gets Created¶
After booting, each server runs Talos in maintenance mode:
- Talos is loaded into RAM from the boot media
- No configuration is applied yet — the node waits on port 50000 for
talosctl apply-config - The machine configuration (applied in 4.2 Boot & Install) determines the install disk, network settings, and role
Hardware specifications:
| Role | Qty | CPU | RAM | OS Disk | Data Disk | NICs |
|---|---|---|---|---|---|---|
| Control plane | 3 | 8 cores | 32 GB | 2 x 512 GB SAS SSD (RAID 1) | — | 2 x 25 GbE + 1 x MGMT LAN (OOB) + 1 x IPMI |
| Worker | 3 | 16 cores | 128 GB | 2 x 512 GB SAS SSD (RAID 1) | 9 TB SAS HDD + SSD cache | 2 x 25 GbE + 1 x MGMT LAN (OOB) + 1 x IPMI |
Network interfaces
Each node has four network connections:
- 2 x 25 GbE — Production traffic (bonded or active/passive)
- 1 x MGMT LAN — Out-of-band management network
- 1 x IPMI — Baseboard management controller (iDRAC, iLO, etc.)
Server Inventory¶
Document your servers before proceeding:
| Hostname | Role | IP Address | MGMT IP | IPMI IP | Install Disk |
|---|---|---|---|---|---|
| rciis-cp-01 | control-plane | 192.168.30.31 | 192.168.20.31 | 192.168.10.31 | /dev/sda |
| rciis-cp-02 | control-plane | 192.168.30.32 | 192.168.20.32 | 192.168.10.32 | /dev/sda |
| rciis-cp-03 | control-plane | 192.168.30.33 | 192.168.20.33 | 192.168.10.33 | /dev/sda |
| rciis-wn-01 | worker | 192.168.30.34 | 192.168.20.34 | 192.168.10.34 | /dev/sda |
| rciis-wn-02 | worker | 192.168.30.35 | 192.168.20.35 | 192.168.10.35 | /dev/sda |
| rciis-wn-03 | worker | 192.168.30.36 | 192.168.20.36 | 192.168.10.36 | /dev/sda |
Proxmox compute provisioning uses Terraform to clone VMs from a pre-built Talos template. The Terraform project at terraform/cluster/ handles VM creation, network configuration via cloud-init, and Talos bootstrapping in a single terraform apply.
Prerequisites: Create the Talos VM Template¶
Before Terraform can clone VMs, you need a Talos VM template on Proxmox. This is a one-time setup.
Step 1: Download the Talos Image¶
# On the Proxmox node
cd /tmp
curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.0/nocloud-amd64.raw.xz
xz -d nocloud-amd64.raw.xz
Note
Use nocloud-amd64.raw.xz (not the AWS or metal image). The nocloud platform supports cloud-init which is how Proxmox passes network configuration to the VM.
Step 2: Create the Template VM¶
# Create a VM (use the template ID from your tfvars, e.g., 9000)
qm create 9000 --name "talos-v1.12.0-template" \
--memory 2048 --cores 2 \
--net0 virtio,bridge=vmbr0 \
--bios ovmf
# Import the disk image
qm importdisk 9000 /tmp/nocloud-amd64.raw local-lvm
# Attach the imported disk
qm set 9000 --scsi0 local-lvm:vm-9000-disk-0 \
--scsihw virtio-scsi-pci --boot order=scsi0
# Add a cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit
# Enable QEMU guest agent
qm set 9000 --agent enabled=1
# Convert to template
qm template 9000
Warning
The template_vm_id in your .tfvars file must match the ID used here (e.g., 9000). Terraform clones this template for every node.
Deploy VMs with Terraform¶
With the template ready, deploy the full cluster:
cd terraform/cluster
# Initialise providers
terraform init
# Set sensitive variables
export TF_VAR_proxmox_api_token="root@pam!IaC=<your-secret>"
# Preview the deployment
terraform plan -var-file=envs/proxmox.tfvars
# Deploy (creates VMs, applies Talos config, bootstraps cluster)
terraform apply -var-file=envs/proxmox.tfvars
What Gets Created¶
Terraform performs the following in order:
- Clones VMs from the template (one per CP + one per worker)
- Configures networking via cloud-init (static IPs, gateway, DNS)
- Generates Talos machine secrets (certificates, tokens)
- Applies Talos machine configuration to each node via
talos_machine_configuration_apply - Bootstraps the cluster via
talos_machine_bootstrapon the first CP node - Retrieves kubeconfig and talosconfig for cluster access
Control Plane VMs (Proxmox environment — 3 nodes):
| Property | Value |
|---|---|
| VM IDs | 601, 602, 603 |
| Names | rciis-proxmox-cp-01, -02, -03 |
| CPU | 2 cores (host passthrough) |
| RAM | 2048 MB (ballooning disabled) |
| OS disk | 50 GB on local-lvm (scsi0) |
| Network | vmbr0, static IPs 192.168.30.31-33/24 |
| VIP | 192.168.30.30 (kube-vip on eth0) |
| Tags | talos, kubernetes, control-plane, rciis, proxmox |
Worker VMs (Proxmox environment — 3 nodes):
| Property | Value |
|---|---|
| VM IDs | 604, 605, 606 |
| Names | rciis-proxmox-wn-01, -02, -03 |
| CPU | 4 cores (host passthrough) |
| RAM | 4096 MB (ballooning disabled) |
| OS disk | 50 GB on local-lvm (scsi0) |
| Data disk | 50 GB on local-lvm (scsi1) |
| Network | vmbr0, static IPs 192.168.30.34-36/24 |
| Tags | talos, kubernetes, worker, rciis, proxmox |
Teardown¶
Warning
This destroys all VMs and their disks. Machine secrets in the Terraform state will also be lost. Back up terraform output -raw machine_secrets before destroying.