3.2 Set Up Network Fabric¶
The network fabric provides connectivity between all Kubernetes nodes, outbound internet access for image pulls, and a stable endpoint for the Kubernetes API. The requirements are:
- All nodes must be able to reach each other on the required ports (see Firewall Rules)
- Control plane nodes need a stable endpoint (VIP or load balancer) for the Kubernetes API
- All nodes need outbound internet access (directly or via NAT/proxy)
- DNS resolution must work for both internal and external domains
The network module (terraform/modules/aws/network) creates the full VPC networking layer for the cluster. You do not run this module independently -- it is composed into terraform/cluster/aws/main.tf and deployed as part of the full terraform apply. This section explains what the module creates, how it works, and where to customise it.
Architecture¶
Internet
│
┌───┴───┐
│ IGW │
└───┬───┘
│
┌───┴──────────────────────────────────┐
│ Public Subnets (one per AZ) │
│ - NAT Gateways │
│ - Network Load Balancer │
│ - map_public_ip_on_launch = true │
└───┬──────────────────────────────────┘
│ (NAT)
┌───┴──────────────────────────────────┐
│ Private Subnets (one per AZ) │
│ - Control Plane EC2 instances │
│ - Worker EC2 instances │
│ - map_public_ip_on_launch = false │
└──────────────────────────────────────┘
Step 1: Configure Network Variables¶
Open terraform/cluster/envs/aws.tfvars and set the network parameters.
VPC CIDR¶
The vpc_cidr defines the overall address space for the VPC. All subnets must fall within this range:
variable "vpc_cidr" {
description = "VPC CIDR block"
type = string
default = "10.0.0.0/16"
}
Change this if 10.0.0.0/16 conflicts with your existing networks. The demo environment uses 10.2.0.0/16:
Availability Zones and Subnets¶
You must define one public and one private subnet per availability zone. These three lists must have the same length:
availability_zones = ["af-south-1a"]
public_subnet_cidrs = ["10.2.1.0/24"]
private_subnet_cidrs = ["10.2.11.0/24"]
Single AZ (demo / cost saving):
availability_zones = ["af-south-1a"]
public_subnet_cidrs = ["10.2.1.0/24"]
private_subnet_cidrs = ["10.2.11.0/24"]
Multi-AZ (production / HA):
availability_zones = ["af-south-1a", "af-south-1b", "af-south-1c"]
public_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
private_subnet_cidrs = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
NAT Gateway Strategy¶
NAT gateways give private subnet nodes outbound internet access. Choose a strategy:
enable_nat_gateway = true
single_nat_gateway = true # false = one per AZ (HA)
| Setting | Behaviour | Cost |
|---|---|---|
single_nat_gateway = true |
One NAT gateway shared across all AZs | ~$41.61/mo |
single_nat_gateway = false |
One NAT gateway per AZ (HA -- survives AZ failure) | ~$41.61/mo per AZ |
Step 2: Understand the Module¶
The module is at terraform/modules/aws/network/. Here is what it creates.
VPC¶
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(var.tags, {
Name = "${var.environment}-talos-vpc"
})
}
cidr_block-- reads fromvar.vpc_cidrenable_dns_hostnames/enable_dns_support-- required for internal DNS resolution within the VPC- The VPC is tagged with
kubernetes.io/cluster/<name> = "owned"whencluster_nameis set, enabling AWS CCM and LB Controller resource discovery
Internet Gateway¶
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
In Terraform the IGW is directly attached to the VPC via vpc_id.
Public Subnets¶
One public subnet is created per availability zone. These host the NAT gateways and NLB:
resource "aws_subnet" "public" {
count = length(var.public_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.public_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index % length(var.availability_zones)]
map_public_ip_on_launch = true
tags = merge(var.tags, {
Name = "${var.environment}-talos-public-${var.availability_zones[count.index]}"
Type = "public"
"kubernetes.io/role/elb" = "1"
})
}
Key points:
- The
countmeta-argument iterates overvar.public_subnet_cidrs, creating one subnet per entry count.indexselects the matching AZ fromvar.availability_zonesusing modulomap_public_ip_on_launch = true-- instances in public subnets get public IPs- The
kubernetes.io/role/elbtag tells the AWS LB Controller which subnets to use for public load balancers
Private Subnets¶
Private subnets host the Kubernetes nodes. They follow the same pattern but with map_public_ip_on_launch = false:
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index % length(var.availability_zones)]
tags = merge(var.tags, {
Name = "${var.environment}-talos-private-${var.availability_zones[count.index]}"
Type = "private"
"kubernetes.io/role/internal-elb" = "1"
})
}
kubernetes.io/role/internal-elbtag tells the AWS LB Controller which subnets to use for internal load balancers
Elastic IPs and NAT Gateways¶
The number of Elastic IPs and NAT gateways depends on the single_nat_gateway setting:
resource "aws_eip" "nat" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.availability_zones)) : 0
domain = "vpc"
depends_on = [aws_internet_gateway.main]
}
resource "aws_nat_gateway" "main" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.availability_zones)) : 0
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
depends_on = [aws_internet_gateway.main]
}
Each NAT gateway is placed in its corresponding public subnet and receives a dedicated Elastic IP.
Route Tables¶
Two types of route tables are created:
Public route table -- routes internet traffic through the IGW:
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
Private route table(s) -- routes internet traffic through the NAT gateway. When single_nat_gateway = true, one route table is shared; otherwise, one per AZ:
resource "aws_route_table" "private" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.availability_zones)) : 1
vpc_id = aws_vpc.main.id
dynamic "route" {
for_each = var.enable_nat_gateway ? [1] : []
content {
cidr_block = "0.0.0.0/0"
nat_gateway_id = var.single_nat_gateway ? aws_nat_gateway.main[0].id : aws_nat_gateway.main[count.index].id
}
}
}
Security Groups¶
The network module also creates security groups in security_groups.tf. Two security groups are created -- one for control plane nodes and one for workers -- with rules for:
| Rule | Port(s) | Source | Notes |
|---|---|---|---|
| Kubernetes API | 6443 | VPC CIDR | Control plane SG |
| Talos API | 50000 | VPC CIDR | Both SGs |
| etcd peer | 2379-2380 | CP SG self | Control plane SG |
| Kubelet API | 10250 | CP SG, self, VPC | Both SGs |
| Cilium GENEVE | 6081/udp | CP + Worker SGs | Conditional on cni_type = "cilium" |
| Cilium health | 4240 | CP + Worker SGs | Conditional on cni_type = "cilium" |
| NodePort range | 30000-32767 | Worker SG, VPC | Conditional on enable_nodeport |
| ICMP | all | VPC CIDR | Both SGs |
| All egress | all | 0.0.0.0/0 | Both SGs |
The root module (terraform/cluster/aws/main.tf) adds additional security group rules for:
- Admin access: K8s API (6443) and Talos API (50000) from
allowed_admin_cidrs - Cilium ClusterMesh: ports 2379, 4240, 4244, 8472/udp, ICMP from
clustermesh_peer_cidrs
Step 3: Module Outputs¶
The network module exports references that other modules consume. Terraform resolves these automatically via module references in main.tf:
| Output | Consumed By |
|---|---|
vpc_id |
compute, loadbalancer modules |
vpc_cidr |
Security group rules (VPC-wide CIDR rules) |
public_subnet_ids |
loadbalancer module (NLB placement) |
private_subnet_ids |
compute module (EC2 instance placement) |
control_plane_security_group_id |
compute module, admin access rules in root module |
worker_security_group_id |
compute module, admin access rules in root module |
nat_gateway_public_ips |
Cluster output (for firewall whitelisting) |
Step 4: Deploy¶
The network module is deployed as part of the full infrastructure:
cd terraform/cluster/aws
terraform init
terraform plan -var-file=../envs/aws.tfvars
terraform apply -var-file=../envs/aws.tfvars
Terraform resolves all cross-module references and deploys resources in dependency order. Network resources (VPC, subnets, IGW, NAT, routes, security groups) are created before compute and load balancer resources that depend on them.
Customisation Summary¶
| What to Change | Where | Variable |
|---|---|---|
| VPC address space | aws.tfvars |
vpc_cidr |
| Number of AZs | aws.tfvars |
availability_zones, public_subnet_cidrs, private_subnet_cidrs |
| NAT gateway strategy | aws.tfvars |
single_nat_gateway |
| Disable NAT entirely | aws.tfvars |
enable_nat_gateway = false |
| CNI type (Cilium/Calico) | aws.tfvars |
cni_type |
| NodePort access | aws.tfvars |
enable_nodeport |
| Admin IP restrictions | aws.tfvars |
allowed_admin_cidrs |
| ClusterMesh peer CIDRs | aws.tfvars |
clustermesh_peer_cidrs |
The network fabric for bare metal is your physical and logical network infrastructure — switches, VLANs, routers, and cabling.
Step 0: Rack Servers & Cable Network Interfaces¶
Before configuring switches and VLANs, physically install and cable all servers.
Rack installation checklist:
- [ ] Mount servers in assigned rack units (document U positions)
- [ ] Label each server on the front and rear panels with its hostname
- [ ] Connect redundant power supplies to separate PDUs (A + B feeds)
- [ ] Verify power LEDs on all servers before cabling network
Cable the 4 NICs per node:
| NIC | Speed | Connect To | Port Type | Purpose |
|---|---|---|---|---|
| NIC 1 (eth0) | 25 GbE | Core switch | VLAN trunk (VLAN 10 + 30) | Production traffic (primary) |
| NIC 2 (eth1) | 25 GbE | Core switch | VLAN trunk (VLAN 10 + 30) | Production traffic (secondary / bond) |
| NIC 3 (MGMT) | 1 GbE | OOB management switch | Access | Out-of-band management |
| NIC 4 (IPMI) | 1 GbE | IPMI / BMC network | Access | Baseboard management controller |
Verify before proceeding
Check that link lights are active on all ports before moving on to switch configuration. A missing link light now saves hours of debugging later.
Refer to the server inventory table in Provision Compute for hostname-to-IP mappings.
Network Architecture¶
Internet
│
┌───┴────────────────────────────┐
│ Router / Firewall │
│ (NAT, firewall rules) │
└───┬────────────────────────────┘
│
┌───┴────────────────────────────┐
│ Core Switch │
│ ├── VLAN 10: Management │
│ │ (IPMI/BMC interfaces) │
│ ├── VLAN 30: Kubernetes │
│ │ (Talos node interfaces) │
│ └── VLAN 1: Default │
│ (Admin workstations) │
└────────────────────────────────┘
VLAN Planning¶
| VLAN | Subnet | Purpose | Nodes |
|---|---|---|---|
| 10 | 192.168.10.0/24 | Management (IPMI/BMC) | Server BMCs |
| 30 | 192.168.30.0/24 | Kubernetes (data plane) | All Talos nodes |
| 1 | 192.168.1.0/24 | Default (admin access) | Workstations, PXE server |
Switch Configuration¶
- Create VLANs on your managed switch
- Configure trunk ports between switches to carry all VLANs
- Configure access ports for each server NIC:
- BMC/IPMI NIC → VLAN 10 (management)
- Data NIC → VLAN 30 (Kubernetes)
- Configure the gateway (router/firewall) with interfaces on each VLAN
IP Addressing¶
Assign static IPs to all Talos nodes. These will be configured in the Talos machine config:
| Hostname | IP Address | Gateway | Role |
|---|---|---|---|
| rciis-cp-01 | 192.168.30.31/24 | 192.168.30.1 | Control plane |
| rciis-cp-02 | 192.168.30.32/24 | 192.168.30.1 | Control plane |
| rciis-cp-03 | 192.168.30.33/24 | 192.168.30.1 | Control plane |
| rciis-wn-01 | 192.168.30.34/24 | 192.168.30.1 | Worker |
| rciis-wn-02 | 192.168.30.35/24 | 192.168.30.1 | Worker |
| rciis-wn-03 | 192.168.30.36/24 | 192.168.30.1 | Worker |
| VIP | 192.168.30.30 | — | Kubernetes API (kube-vip or HAProxy) |
Outbound Access¶
Talos nodes need outbound access to:
- ghcr.io, docker.io, quay.io — container images
- factory.talos.dev — Talos installer images
- NTP servers — time synchronisation
Configure NAT or a proxy on your gateway for the Kubernetes VLAN.
Proxmox networking for the Talos cluster uses Proxmox's Linux bridge (vmbr0) and optionally VLANs. The Terraform project configures each VM's network via cloud-init.
Network Architecture¶
Internet / Upstream Router
│
┌───┴────────────────────────────┐
│ Proxmox Host │
│ ├── vmbr0 (Linux bridge) │
│ │ └── Physical NIC (eno1) │
│ │ │
│ ├── CP VMs (192.168.30.31-33) │
│ ├── Worker VMs (.34-.36) │
│ └── VIP: 192.168.30.30 │
└────────────────────────────────┘
Bridge Configuration¶
The default Proxmox bridge vmbr0 is used. Verify it exists on your Proxmox node:
You should see something like:
auto vmbr0
iface vmbr0 inet static
address 192.168.30.225/24
gateway 192.168.30.1
bridge-ports eno1
bridge-stp off
bridge-fd 0
Terraform Network Configuration¶
The VM module configures networking via these variables in your .tfvars:
# Network bridge on Proxmox
network_bridge = "vmbr0"
# Optional VLAN tag (null = untagged)
# network_vlan_id = 30
# Gateway for all nodes
ipv4_gateway = "192.168.30.1"
# DNS resolvers
dns_servers = ["1.1.1.1", "192.168.10.17"]
# Static IPs (CIDR notation)
control_plane_ips = [
"192.168.30.31/24",
"192.168.30.32/24",
"192.168.30.33/24"
]
worker_ips = [
"192.168.30.34/24",
"192.168.30.35/24",
"192.168.30.36/24"
]
Cloud-init passes these IP addresses to each VM at creation time.
VLAN Support¶
To place VMs on a specific VLAN, set network_vlan_id:
The Proxmox bridge must be VLAN-aware for this to work. Enable it in /etc/network/interfaces: