3.3 Set Up Load Balancing¶
The Kubernetes API (port 6443) and Talos API (port 50000) must be reachable via a stable endpoint that distributes traffic across control plane nodes. This requires a Layer 4 (TCP) load balancer or virtual IP with health checks.
Requirements¶
| Listener | Backend Port | Health Check | Protocol | Backend Nodes |
|---|---|---|---|---|
| Kubernetes API | 6443 | TCP connect on 6443 | TCP | All control plane nodes |
| Talos API | 50000 | TCP connect on 50000 | TCP | All control plane nodes |
The load balancer endpoint becomes the cluster_endpoint that all tools (kubectl, talosctl, worker nodes) use to reach the control plane.
The load balancer module (terraform/modules/aws/loadbalancer) creates a Network Load Balancer (NLB) that provides access to the Kubernetes API and Talos API on the control plane nodes. This is deployed as part of terraform apply via terraform/cluster/aws/main.tf.
Step 1: Configure Load Balancer Variables¶
Open terraform/cluster/envs/aws.tfvars and set the NLB parameters:
nlb_internal = false # true = internal, false = internet-facing
enable_deletion_protection = false
enable_cross_zone_load_balancing = false # Single AZ, not needed
Internal NLB (nlb_internal = true) -- requires VPN or bastion host to reach the control plane from outside the VPC. Use this for production.
Internet-facing NLB (nlb_internal = false) -- the NLB gets a public DNS name. Combine with allowed_admin_cidrs in the security group to restrict access. Use this for demo/testing:
nlb_internal = false
enable_deletion_protection = false
enable_cross_zone_load_balancing = false
allowed_admin_cidrs = [
"196.45.28.20/32",
]
Health Check Tuning¶
variable "health_check_interval" {
description = "NLB health check interval in seconds"
type = number
default = 10
}
variable "health_check_timeout" {
description = "NLB health check timeout in seconds"
type = number
default = 5
}
variable "healthy_threshold" {
description = "Consecutive successful checks before marking target healthy"
type = number
default = 2
}
variable "unhealthy_threshold" {
description = "Consecutive failed checks before marking target unhealthy"
type = number
default = 2
}
Step 2: Understand the Module¶
The module is at terraform/modules/aws/loadbalancer/. The root module (main.tf) passes in the VPC ID and public subnet IDs from the network module.
Network Load Balancer¶
resource "aws_lb" "kubernetes_api" {
name = "${var.environment}-talos-api-nlb"
internal = var.internal
load_balancer_type = "network"
subnets = var.subnet_ids
enable_deletion_protection = var.enable_deletion_protection
enable_cross_zone_load_balancing = var.enable_cross_zone_load_balancing
}
Key points:
internal-- toggles between internal and internet-facing based onvar.internal(mapped fromnlb_internalin the root module)subnets-- placed in public subnets (frommodule.network.public_subnet_ids)
Target Groups and Listeners¶
The loadbalancer module creates two target groups and listeners:
| Listener Port | Target Port | Protocol | Target Group | Purpose |
|---|---|---|---|---|
| 6443 | 6443 | TCP | <env>-talos-api-tg |
Kubernetes API |
| 50000 | 50000 | TCP | <env>-talos-apid-tg |
Talos API (control plane) |
The root module adds a third listener for worker Talos API access:
| Listener Port | Target Port | Protocol | Target Group | Purpose |
|---|---|---|---|---|
| 50001 | 50000 | TCP | <env>-talos-wk-apid-tg |
Talos API (workers) |
All target groups use:
preserve_client_ip = true-- the NLB preserves the original source IPderegistration_delay = 30-- allows in-flight requests to complete before removing a target- TCP health checks on the respective service port
Target Group Attachments¶
Target group attachments are defined in the root module (terraform/cluster/aws/main.tf) as separate resources, so that EC2 instances can be replaced without affecting the NLB:
resource "aws_lb_target_group_attachment" "cp_kubernetes_api" {
count = var.control_plane_count
target_group_arn = module.loadbalancer.kubernetes_api_target_group_arn
target_id = module.compute.control_plane_instance_ids[count.index]
port = 6443
}
resource "aws_lb_target_group_attachment" "cp_talos_api" {
count = var.control_plane_count
target_group_arn = module.loadbalancer.talos_api_target_group_arn
target_id = module.compute.control_plane_instance_ids[count.index]
port = 50000
}
Step 3: Module Outputs¶
The loadbalancer module exports:
| Output | Description |
|---|---|
load_balancer_dns_name |
NLB DNS name for external access |
load_balancer_arn |
NLB ARN (used for additional listeners in root module) |
kubernetes_api_endpoint |
https://<nlb-dns>:6443 |
talos_api_endpoint |
<nlb-dns>:50000 |
kubernetes_api_target_group_arn |
For target group attachments |
talos_api_target_group_arn |
For target group attachments |
These are surfaced as root module outputs and used when configuring talosctl and kubectl after deployment:
# Get the NLB DNS name
terraform output nlb_dns_name
# Configure talosctl to use the NLB endpoint
talosctl config endpoint $(terraform output -raw nlb_dns_name)
# The kubeconfig will use the NLB DNS as the server URL
# https://<nlb-dns>:6443
terraform output kubernetes_api_endpoint
Customisation Summary¶
| What to Change | Where | Variable |
|---|---|---|
| Internal vs internet-facing | aws.tfvars |
nlb_internal |
| Deletion protection | aws.tfvars |
enable_deletion_protection |
| Cross-zone load balancing | aws.tfvars |
enable_cross_zone_load_balancing |
| Health check interval | aws.tfvars |
health_check_interval |
| Health check timeout | aws.tfvars |
health_check_timeout |
| Healthy/unhealthy thresholds | aws.tfvars |
healthy_threshold, unhealthy_threshold |
Warning
The deregistration delay (30s) and client IP preservation are hardcoded in the loadbalancer module. To change these, edit terraform/modules/aws/loadbalancer/main.tf directly.
For bare metal, use HAProxy + Keepalived for a highly available load balancer, or a single HAProxy instance for simpler setups.
Option 1: HAProxy + Keepalived (Recommended for HA)¶
Deploy HAProxy on two dedicated servers (or VMs) with Keepalived managing a floating VIP. If one HAProxy fails, the VIP moves to the other.
HAProxy Configuration¶
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
maxconn 4096
defaults
mode tcp
timeout connect 10s
timeout client 30s
timeout server 30s
option tcp-check
frontend kubernetes_api
bind *:6443
default_backend kubernetes_api_backend
backend kubernetes_api_backend
balance roundrobin
option tcp-check
server cp-01 192.168.30.31:6443 check
server cp-02 192.168.30.32:6443 check
server cp-03 192.168.30.33:6443 check
frontend talos_api
bind *:50000
default_backend talos_api_backend
backend talos_api_backend
balance roundrobin
option tcp-check
server cp-01 192.168.30.31:50000 check
server cp-02 192.168.30.32:50000 check
server cp-03 192.168.30.33:50000 check
Keepalived Configuration¶
On each HAProxy server, configure Keepalived to manage the VIP:
# /etc/keepalived/keepalived.conf (on primary)
vrrp_instance VI_1 {
state MASTER # BACKUP on secondary
interface eth0
virtual_router_id 51
priority 100 # 90 on secondary
advert_int 1
virtual_ipaddress {
192.168.30.30/24
}
track_script {
chk_haproxy
}
}
vrrp_script chk_haproxy {
script "pidof haproxy"
interval 2
weight 2
}
The VIP (192.168.30.30) becomes your cluster endpoint:
talosctl config endpoint 192.168.30.30
kubectl config set-cluster rciis --server=https://192.168.30.30:6443
Option 2: Single HAProxy (Non-HA)¶
For smaller environments, a single HAProxy instance works. Use the same configuration as above without Keepalived. The HAProxy server's IP becomes the cluster endpoint.
Option 3: DNS Round-Robin (Simplest)¶
For development or testing, create DNS A records pointing to all control plane IPs:
Warning
DNS round-robin provides no health checking. If a CP node goes down, clients may still be directed to it until DNS TTL expires.
For Proxmox deployments, Terraform only provisions VMs with static IPs via cloud-init. Talos configuration — including load balancing — is applied separately using talosctl.
The recommended approach is to use Talos built-in VIP for a lightweight virtual IP that requires no external infrastructure.
Option 1: Talos Built-in VIP (Recommended)¶
Talos has native VIP support. When configured in the machine config, one control plane node holds the VIP at any time. If that node fails, another CP node takes over automatically via GARP.
Configure the VIP in your Talos machine config (applied via talosctl apply-config):
machine:
network:
interfaces:
- interface: eth0
addresses:
- 192.168.30.31/24 # Node's own IP
routes:
- network: 0.0.0.0/0
gateway: 192.168.30.1
vip:
ip: 192.168.30.30 # Shared VIP
Apply to each control plane node, changing the addresses field per node:
The VIP becomes the cluster endpoint used by all clients:
# talosctl uses the VIP
talosctl config endpoint 192.168.30.30
# kubectl uses the VIP
# https://192.168.30.30:6443
Both port 6443 (Kubernetes API) and port 50000 (Talos API) are available on the VIP. Traffic is forwarded to whichever CP node currently holds the VIP.
VIP Requirements¶
- The VIP must be an unused IP on the same subnet as the control plane nodes
- The VIP must not be assigned to any other device
- ARP must not be filtered on the network (GARP is used for failover)
- All control plane nodes must be on the same Layer 2 network
Option 2: Single Control Plane (No VIP)¶
For single control plane setups, use the node's IP directly as the cluster endpoint:
No VIP or load balancer is needed, but there is no failover.
Option 3: External Load Balancer¶
For environments where Talos VIP is not suitable (e.g., nodes on different L2 segments), use an external load balancer such as HAProxy on the Proxmox host or a network appliance. See the Bare Metal tab for HAProxy configuration examples.