Skip to content

3.1 Provision Compute

Compute provisioning on AWS has two stages: building the Talos AMI, then deploying EC2 instances via Terraform.

The Terraform project lives at terraform/cluster/aws/ and composes reusable modules from terraform/modules/aws/.

Prerequisites: Build the Talos AMI

There is no official Talos AMI in af-south-1. You must build and register a custom AMI before running Terraform.

Step 1: Generate the Talos Disk Image

Option A — Vanilla image (no extensions):

Download the official Talos disk image for AWS:

curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.3/aws-amd64.raw.xz
xz -d aws-amd64.raw.xz
Invoke-WebRequest -Uri "https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.3/aws-amd64.raw.xz" -OutFile "aws-amd64.raw.xz"
7z x aws-amd64.raw.xz

Option B — Custom image with extensions:

Edit schematic.yaml to enable the extensions you need, then submit to Image Factory:

# Submit schematic and get the schematic ID
SCHEMATIC_ID=$(curl -X POST --data-binary @schematic.yaml \
  https://factory.talos.dev/schematics)

# Download the custom disk image
curl -LO "https://factory.talos.dev/image/${SCHEMATIC_ID}/v1.12.3/aws-amd64.raw.xz"
xz -d aws-amd64.raw.xz
# Submit schematic and get the schematic ID
$SCHEMATIC_ID = Invoke-RestMethod -Method Post -InFile schematic.yaml -Uri "https://factory.talos.dev/schematics"

# Download the custom disk image
Invoke-WebRequest -Uri "https://factory.talos.dev/image/$SCHEMATIC_ID/v1.12.3/aws-amd64.raw.xz" -OutFile "aws-amd64.raw.xz"
7z x aws-amd64.raw.xz

Available extensions are listed at github.com/siderolabs/extensions.

Step 2: Upload to S3

  aws s3 cp aws-amd64.raw \
  s3://rciis-talos-images-af-south-1/v1.12.3/aws-amd64.raw \
  --region af-south-1
  aws s3 cp aws-amd64.raw `
  s3://rciis-talos-images-af-south-1/v1.12.3/aws-amd64.raw `
  --region af-south-1

Step 3: Import as EBS Snapshot

IMPORT_TASK=$(aws ec2 import-snapshot \
  --region af-south-1 \
  --description "Talos v1.12.3" \
  --disk-container "Format=raw,UserBucket={S3Bucket=rciis-talos-images-af-south-1,S3Key=v1.12.3/aws-amd64.raw}" \
  --query 'ImportTaskId' --output text)

echo "Import task: $IMPORT_TASK"
$IMPORT_TASK = aws ec2 import-snapshot `
  --region af-south-1 `
  --description "Talos v1.12.3" `
  --disk-container "Format=raw,UserBucket={S3Bucket=rciis-talos-images-af-south-1,S3Key=v1.12.3/aws-amd64.raw}" `
  --query 'ImportTaskId' --output text

Write-Output "Import task: $IMPORT_TASK"

Check import progress (optional)

If you want to monitor progress while the import is running (e.g. in a separate terminal), use describe-import-snapshot-tasks. Unlike wait, this returns immediately with the current status and percentage:

aws ec2 describe-import-snapshot-tasks \
  --import-task-ids $IMPORT_TASK \
  --region af-south-1 \
  --query 'ImportSnapshotTasks[0].SnapshotTaskDetail.{Status:Status,Progress:Progress,SnapshotId:SnapshotId}' \
  --output table
aws ec2 describe-import-snapshot-tasks `
  --import-task-ids $IMPORT_TASK `
  --region af-south-1 `
  --query 'ImportSnapshotTasks[0].SnapshotTaskDetail.{Status:Status,Progress:Progress,SnapshotId:SnapshotId}' `
  --output table

Step 4: Wait for Import and Get Snapshot ID

The wait snapshot-imported command blocks until the import finishes — it polls automatically and returns when the snapshot is ready:

aws ec2 wait snapshot-imported \
  --region af-south-1 \
  --import-task-ids $IMPORT_TASK

SNAPSHOT_ID=$(aws ec2 describe-import-snapshot-tasks \
  --region af-south-1 \
  --import-task-ids $IMPORT_TASK \
  --query 'ImportSnapshotTasks[0].SnapshotTaskDetail.SnapshotId' \
  --output text)

echo "Snapshot: $SNAPSHOT_ID"
aws ec2 wait snapshot-imported `
  --region af-south-1 `
  --import-task-ids $IMPORT_TASK

$SNAPSHOT_ID = aws ec2 describe-import-snapshot-tasks `
  --region af-south-1 `
  --import-task-ids $IMPORT_TASK `
  --query 'ImportSnapshotTasks[0].SnapshotTaskDetail.SnapshotId' `
  --output text

Write-Output "Snapshot: $SNAPSHOT_ID"

Step 5: Register the AMI

VolumeSize must match the snapshot size

The VolumeSize in --block-device-mappings must be equal to or larger than the imported snapshot. The Talos v1.12.3 snapshot is 11 GiB. Using a smaller value (e.g. 4) will fail with InvalidParameterValue.

AMI=$(aws ec2 register-image \
  --region af-south-1 \
  --name "talos-v1.12.3-rciis" \
  --root-device-name /dev/xvda \
  --block-device-mappings "DeviceName=/dev/xvda,Ebs={SnapshotId=${SNAPSHOT_ID},VolumeSize=11,VolumeType=gp3,DeleteOnTermination=true}" \
  --virtualization-type hvm \
  --architecture x86_64 \
  --ena-support \
  --query 'ImageId' --output text)

echo "AMI: $AMI"
$AMI = aws ec2 register-image `
  --region af-south-1 `
  --name "talos-v1.12.3-rciis" `
  --root-device-name /dev/xvda `
  --block-device-mappings "DeviceName=/dev/xvda,Ebs={SnapshotId=$SNAPSHOT_ID,VolumeSize=11,VolumeType=gp3,DeleteOnTermination=true}" `
  --virtualization-type hvm `
  --architecture x86_64 `
  --ena-support `
  --query 'ImageId' --output text

Write-Output "AMI: $AMI"

Deploy EC2 Instances with Terraform

Project Structure

terraform/
├── cluster/
│   ├── aws/
│   │   ├── main.tf          # Root module — composes all submodules
│   │   ├── variables.tf     # All input variables with defaults
│   │   └── outputs.tf       # Cluster outputs (IPs, endpoints, ARNs)
│   └── envs/
│       └── aws.tfvars       # Environment-specific variable overrides
└── modules/
    └── aws/
        ├── compute/         # EC2 instances + volume attachments
        ├── network/         # VPC, subnets, IGW, NAT, security groups
        ├── loadbalancer/    # NLB + target groups + listeners
        ├── ebs/             # Encrypted gp3 data volumes
        ├── iam/             # IAM role, instance profile, CCM + LBC policies
        └── sqs/             # Event queues with dead-letter queue

Step 1: Set the AMI ID in aws.tfvars

Open terraform/cluster/envs/aws.tfvars and set talos_ami to the AMI you registered above:

terraform/cluster/envs/aws.tfvars
talos_ami          = "ami-07545b2983cd8fb22"  # Replace with your AMI ID
talos_version      = "v1.12.2"
kubernetes_version = "1.34.1"

Note

When talos_ami is set, Terraform skips the automatic AMI lookup (which only works in regions with official Sidero Labs AMIs). This is required for af-south-1.

Step 2: Review and Customise Variables

Key variables in terraform/cluster/envs/aws.tfvars:

terraform/cluster/envs/aws.tfvars
# Cluster identity
cluster_name = "rciis-aws"
environment  = "rciis"

# AWS region and AZs
region             = "af-south-1"
aws_profile        = "cbt"
availability_zones = ["af-south-1a"]

# Compute sizing
control_plane_count         = 1
worker_count                = 1
control_plane_instance_type = "t3.medium"
worker_instance_type        = "t3.xlarge"

# Storage
root_volume_size          = 50
control_plane_volume_size = 0    # No separate CP data volume
worker_volume_size        = 100

# Security — restrict API access to allowed IPs
allowed_admin_cidrs = [
  "196.45.28.20/32",
]

Step 3: Deploy

Deploy the cluster infrastructure
cd terraform/cluster/aws

# Initialise providers (hashicorp/aws ~> 6.34, siderolabs/talos ~> 0.10.1)
terraform init

# Preview the deployment
terraform plan -var-file=./aws.tfvars

# Deploy (creates VPC, EC2, NLB, EBS, IAM, SQS)
terraform apply -var-file=./aws.tfvars

Terraform will display a plan summary and prompt for confirmation before creating resources.

What Gets Created

The terraform/modules/aws/compute module provisions:

Control Plane Instances (default: 3):

Property Value
Instance type t3.large (configurable via control_plane_instance_type)
Subnet Private (distributed across AZs round-robin)
Security group Control plane SG
IAM profile talos-node-role-profile
Root volume gp3, encrypted, delete_on_termination = true
EBS optimised Yes
Source/dest check Disabled (required for Cilium)
IMDSv2 Required (http_tokens = "required")
Data volume Attached at /dev/xvdf (configurable size, 0 = none)

Worker Instances (default: 5):

Property Value
Instance type t3.xlarge (configurable via worker_instance_type)
Subnet Private (distributed across AZs round-robin)
Security group Worker SG
IAM profile talos-node-role-profile
Root volume gp3, encrypted, delete_on_termination = true
EBS optimised Yes
Source/dest check Disabled (required for Cilium)
IMDSv2 Required (http_tokens = "required")
Data volume Attached at /dev/xvdf (default 200 GB)

SQS Queues (terraform/modules/aws/sqs):

Property Value
Events queue <env>-events (Standard)
Dead letter queue <env>-events-dlq (Standard)
Encryption SSE enabled (SQS-managed)
Message retention 4 days (events), 14 days (DLQ)
Long polling 20s (receive_wait_time_seconds)
Redrive policy 3 max receives then DLQ

IAM (terraform/modules/aws/iam):

Resource Purpose
talos-node-role EC2 instance role (assume role for ec2.amazonaws.com)
talos-node-role-profile Instance profile attached to all nodes
CCM policy AWS Cloud Controller Manager permissions (optional, enable_ccm)
LB Controller policy AWS Load Balancer Controller permissions (optional, enable_lb_controller)
SSM policy Systems Manager for debugging (optional, enable_ssm)

Resource Tags

All instances are tagged with:

Tag Value
ManagedBy terraform
Environment Environment name
Project rciis
Cluster Cluster name
NodeType control-plane or worker
Schedule always-on (CP) or weekday-business-hours (workers)
KubernetesCluster Cluster name (required by AWS CCM for resource discovery)

Terraform Outputs

After terraform apply, retrieve key information:

# NLB DNS name (used as Kubernetes API endpoint)
terraform output nlb_dns_name

# Kubernetes API endpoint URL
terraform output kubernetes_api_endpoint

# Talos API endpoints
terraform output talos_api_endpoint
terraform output talos_worker_api_endpoint

# Node private IPs (for talconfig.yaml)
terraform output control_plane_private_ips
terraform output worker_private_ips

# VPC and AMI info
terraform output vpc_id
terraform output talos_ami_id

# Full deployment summary
terraform output deployment_summary

Override at Deploy Time

Override any variable without editing aws.tfvars:

terraform apply -var-file=../envs/aws.tfvars \
  -var="control_plane_count=1" \
  -var="worker_count=2" \
  -var="control_plane_instance_type=t3a.small" \
  -var="worker_instance_type=t3a.medium"

Teardown

Delete Kubernetes-created AWS resources first

If the AWS Load Balancer Controller or AWS Cloud Controller Manager created any resources inside the VPC (NLBs, ALBs, target groups, security groups), you must delete them before running terraform destroy. These resources are not managed by Terraform, and AWS will block VPC, IGW, and public subnet deletion with DependencyViolation errors.

Check for orphaned resources:

# List load balancers in the VPC
aws elbv2 describe-load-balancers --region af-south-1 \
  --query "LoadBalancers[?VpcId=='$(terraform output -raw vpc_id)'].{ARN:LoadBalancerArn,Name:LoadBalancerName,Type:Type}" \
  --output table

# List target groups in the VPC
aws elbv2 describe-target-groups --region af-south-1 \
  --query "TargetGroups[?VpcId=='$(terraform output -raw vpc_id)'].{ARN:TargetGroupArn,Name:TargetGroupName}" \
  --output table

# List security groups in the VPC (exclude the default SG)
aws ec2 describe-security-groups --region af-south-1 \
  --filters "Name=vpc-id,Values=$(terraform output -raw vpc_id)" \
  --query "SecurityGroups[?GroupName!='default'].{ID:GroupId,Name:GroupName,Desc:Description}" \
  --output table

Delete orphaned load balancers and target groups:

# Delete each K8s-created load balancer (get ARNs from the table above)
aws elbv2 delete-load-balancer --region af-south-1 \
  --load-balancer-arn <LB_ARN>

# Wait for the LB to finish draining
aws elbv2 wait load-balancers-deleted --region af-south-1 \
  --load-balancer-arns <LB_ARN>

# Delete each orphaned target group
aws elbv2 delete-target-group --region af-south-1 \
  --target-group-arn <TG_ARN>

# Delete each orphaned security group
aws ec2 delete-security-group --region af-south-1 \
  --group-id <SG_ID>

Once all Kubernetes-created resources are cleaned up, destroy the Terraform-managed infrastructure:

cd terraform/cluster/aws
terraform destroy -var-file=../envs/aws.tfvars

Bare metal compute provisioning involves booting physical servers with Talos Linux — either via PXE network boot, ISO image, or USB boot media. There is no VM template or AMI; Talos is installed directly onto the server's disk.

Step 0: BIOS & IPMI Configuration

Before booting Talos, configure each server's BIOS and IPMI/BMC settings.

Configure IPMI/BMC

  1. Connect to each server's IPMI interface (iDRAC, iLO, or similar) via the IPMI network
  2. Assign static IP addresses from the server inventory:

    Server IPMI IP
    rciis-cp-01 192.168.10.31
    rciis-cp-02 192.168.10.32
    rciis-cp-03 192.168.10.33
    rciis-wn-01 192.168.10.34
    rciis-wn-02 192.168.10.35
    rciis-wn-03 192.168.10.36
  3. Set IPMI admin credentials and record them securely

  4. Verify IPMI web console is accessible from the management workstation

Configure BIOS Settings

Enter each server's BIOS setup and apply the following:

Setting Value Reason
Virtualisation (VT-x / VT-d) Enabled Required for Kubernetes workloads
Boot mode UEFI Talos requires UEFI boot
Secure Boot Disabled Talos signs its own boot chain
Boot order PXE first (or USB/ISO) Matches your chosen boot method
Wake-on-LAN Enabled (optional) Allows remote power-on via IPMI

Secure Boot

Talos Linux does not use the platform's Secure Boot chain. Leaving it enabled will prevent Talos from booting.

Boot Methods

Choose a boot method based on your environment:

Method Best For Requirements
PXE boot Deploying many servers DHCP + TFTP infrastructure, IPMI access
ISO boot Small deployments, testing ISO image, physical/IPMI console access
USB boot Air-gapped environments USB drive, physical access

Step 1: Download Talos Boot Media

Download from the Talos Image Factory:

Vanilla image (no extensions):

# ISO
curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.0/metal-amd64.iso

# PXE kernel + initramfs
curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.0/kernel-amd64
curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.0/initramfs-amd64.xz

Custom image with extensions (e.g., ZFS, iSCSI):

# Submit schematic to get ID
SCHEMATIC_ID=$(curl -X POST --data-binary @schematic.yaml \
  https://factory.talos.dev/schematics)

# Download custom ISO
curl -LO "https://factory.talos.dev/image/${SCHEMATIC_ID}/v1.12.0/metal-amd64.iso"

Step 2: Boot via PXE

If using PXE, configure your DHCP/TFTP server to serve the Talos kernel and initramfs. Example dnsmasq configuration:

# /etc/dnsmasq.d/talos-pxe.conf
enable-tftp
tftp-root=/srv/tftp

# BIOS boot
dhcp-boot=pxelinux.0

# UEFI boot
dhcp-match=set:efi-x86_64,option:client-arch,7
dhcp-boot=tag:efi-x86_64,ipxe.efi

Place the Talos kernel and initramfs in /srv/tftp/ and create a PXE boot menu entry that boots into the Talos installer.

Alternatively, use iPXE with a script:

#!ipxe
kernel https://factory.talos.dev/image/<schematic-id>/v1.12.0/kernel-amd64 talos.platform=metal
initrd https://factory.talos.dev/image/<schematic-id>/v1.12.0/initramfs-amd64.xz
boot

Step 3: Boot via ISO or USB

For ISO-based installs:

  1. Write the ISO to a USB drive:
    sudo dd if=metal-amd64.iso of=/dev/sdX bs=4M status=progress
    
  2. Or mount via IPMI virtual media (iDRAC, iLO, or similar)
  3. Boot the server from the ISO/USB
  4. Talos boots into maintenance mode, waiting for a machine configuration

Step 4: Verify Connectivity

After all servers are booted into Talos maintenance mode, verify network connectivity from your management workstation.

Ping all node IPs:

for ip in 192.168.30.{31..36}; do
  ping -c 1 -W 2 "$ip" && echo "$ip OK" || echo "$ip UNREACHABLE"
done

Verify IPMI access:

for ip in 192.168.10.{31..36}; do
  curl -sk --connect-timeout 3 "https://$ip" -o /dev/null && echo "$ip IPMI OK" || echo "$ip IPMI UNREACHABLE"
done

Check NIC link status (from the Talos maintenance shell via talosctl):

# For each node
talosctl -n 192.168.30.31 get links --insecure

Verify default gateway reachability:

talosctl -n 192.168.30.31 netstat --insecure

Connectivity checklist:

  • [ ] All node IPs respond to ping from management workstation
  • [ ] All IPMI web consoles are accessible
  • [ ] NIC link status shows UP for all connected interfaces
  • [ ] Nodes can reach the default gateway (192.168.30.1)
  • [ ] DNS resolution works (if DNS is configured at this stage)

Tip

If a node is unreachable, check the physical cabling and switch port configuration in Set Up Network Fabric before proceeding.

What Gets Created

After booting, each server runs Talos in maintenance mode:

  • Talos is loaded into RAM from the boot media
  • No configuration is applied yet — the node waits on port 50000 for talosctl apply-config
  • The machine configuration (applied in 4.2 Boot & Install) determines the install disk, network settings, and role

Hardware specifications:

Role Qty CPU RAM OS Disk Data Disk NICs
Control plane 3 8 cores 32 GB 2 x 512 GB SAS SSD (RAID 1) 2 x 25 GbE + 1 x MGMT LAN (OOB) + 1 x IPMI
Worker 3 16 cores 128 GB 2 x 512 GB SAS SSD (RAID 1) 9 TB SAS HDD + SSD cache 2 x 25 GbE + 1 x MGMT LAN (OOB) + 1 x IPMI

Network interfaces

Each node has four network connections:

  • 2 x 25 GbE — Production traffic (bonded or active/passive)
  • 1 x MGMT LAN — Out-of-band management network
  • 1 x IPMI — Baseboard management controller (iDRAC, iLO, etc.)

Server Inventory

Document your servers before proceeding:

Hostname Role IP Address MGMT IP IPMI IP Install Disk
rciis-cp-01 control-plane 192.168.30.31 192.168.20.31 192.168.10.31 /dev/sda
rciis-cp-02 control-plane 192.168.30.32 192.168.20.32 192.168.10.32 /dev/sda
rciis-cp-03 control-plane 192.168.30.33 192.168.20.33 192.168.10.33 /dev/sda
rciis-wn-01 worker 192.168.30.34 192.168.20.34 192.168.10.34 /dev/sda
rciis-wn-02 worker 192.168.30.35 192.168.20.35 192.168.10.35 /dev/sda
rciis-wn-03 worker 192.168.30.36 192.168.20.36 192.168.10.36 /dev/sda

Proxmox compute provisioning uses Terraform to clone VMs from a pre-built Talos template. The Terraform project at terraform/cluster/ handles VM creation, network configuration via cloud-init, and Talos bootstrapping in a single terraform apply.

Prerequisites: Create the Talos VM Template

Before Terraform can clone VMs, you need a Talos VM template on Proxmox. This is a one-time setup.

Step 1: Download the Talos Image

# On the Proxmox node
cd /tmp
curl -LO https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.12.0/nocloud-amd64.raw.xz
xz -d nocloud-amd64.raw.xz

Note

Use nocloud-amd64.raw.xz (not the AWS or metal image). The nocloud platform supports cloud-init which is how Proxmox passes network configuration to the VM.

Step 2: Create the Template VM

# Create a VM (use the template ID from your tfvars, e.g., 9000)
qm create 9000 --name "talos-v1.12.0-template" \
  --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 \
  --bios ovmf

# Import the disk image
qm importdisk 9000 /tmp/nocloud-amd64.raw local-lvm

# Attach the imported disk
qm set 9000 --scsi0 local-lvm:vm-9000-disk-0 \
  --scsihw virtio-scsi-pci --boot order=scsi0

# Add a cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit

# Enable QEMU guest agent
qm set 9000 --agent enabled=1

# Convert to template
qm template 9000

Warning

The template_vm_id in your .tfvars file must match the ID used here (e.g., 9000). Terraform clones this template for every node.

Deploy VMs with Terraform

With the template ready, deploy the full cluster:

cd terraform/cluster

# Initialise providers
terraform init

# Set sensitive variables
export TF_VAR_proxmox_api_token="root@pam!IaC=<your-secret>"

# Preview the deployment
terraform plan -var-file=envs/proxmox.tfvars

# Deploy (creates VMs, applies Talos config, bootstraps cluster)
terraform apply -var-file=envs/proxmox.tfvars

What Gets Created

Terraform performs the following in order:

  1. Clones VMs from the template (one per CP + one per worker)
  2. Configures networking via cloud-init (static IPs, gateway, DNS)
  3. Generates Talos machine secrets (certificates, tokens)
  4. Applies Talos machine configuration to each node via talos_machine_configuration_apply
  5. Bootstraps the cluster via talos_machine_bootstrap on the first CP node
  6. Retrieves kubeconfig and talosconfig for cluster access

Control Plane VMs (Proxmox environment — 3 nodes):

Property Value
VM IDs 601, 602, 603
Names rciis-proxmox-cp-01, -02, -03
CPU 2 cores (host passthrough)
RAM 2048 MB (ballooning disabled)
OS disk 50 GB on local-lvm (scsi0)
Network vmbr0, static IPs 192.168.30.31-33/24
VIP 192.168.30.30 (kube-vip on eth0)
Tags talos, kubernetes, control-plane, rciis, proxmox

Worker VMs (Proxmox environment — 3 nodes):

Property Value
VM IDs 604, 605, 606
Names rciis-proxmox-wn-01, -02, -03
CPU 4 cores (host passthrough)
RAM 4096 MB (ballooning disabled)
OS disk 50 GB on local-lvm (scsi0)
Data disk 50 GB on local-lvm (scsi1)
Network vmbr0, static IPs 192.168.30.34-36/24
Tags talos, kubernetes, worker, rciis, proxmox

Teardown

terraform destroy -var-file=envs/proxmox.tfvars

Warning

This destroys all VMs and their disks. Machine secrets in the Terraform state will also be lost. Back up terraform output -raw machine_secrets before destroying.