Skip to content

4.4 Verify the Cluster

After bootstrapping in 4.3, verify that all components are healthy before deploying workloads.


Quick Health Check

Run the health check against a control plane node:

talosctl -n 192.168.200.11 health --server=false
kubectl get nodes

AWS Clusters

For AWS-based clusters, use the control plane node's private IP:

talosctl -n <cp-private-ip> health

All nodes should appear in Ready status.


Node Readiness

Confirm every node has joined the cluster and is reporting Ready:

kubectl get nodes -o wide

If a node is stuck in NotReady, inspect its conditions:

kubectl describe node <node-name>

Etcd Health

Check that the etcd cluster is healthy and all members are connected:

talosctl -n 192.168.200.11 etcd members
talosctl -n 192.168.200.11 etcd status

All control plane nodes should appear as etcd members with no alarms.


Pod Status

Verify that all system pods are running:

kubectl get pods -A

Key namespaces to check:

  • kube-system — core Kubernetes components (API server, scheduler, controller-manager, kube-proxy, CoreDNS)
  • tigera-operator / calico-system — if using Calico CNI
  • kube-system (cilium pods) — if using Cilium CNI

All pods should be in Running or Completed status with no restarts.


Troubleshooting

kubectl logs / exec Returns Authorization Error

Symptom: Commands such as kubectl logs or kubectl exec fail with:

Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)

Cause: Fresh Talos clusters may be missing the system:kubelet-api-admin ClusterRoleBinding that grants the API server's kubelet client certificate permission to proxy to nodes.

Fix: Create the missing ClusterRoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:kubelet-api-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:kubelet-api-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: apiserver-kubelet-client
kubectl apply -f kubelet-api-admin-crb.yaml

Warning

Without this binding, any operation that requires the API server to reach the kubelet (logs, exec, port-forward, metrics) will fail.


DNS Resolution Fails From Pods After Node Restart

Symptom: Pods cannot resolve service DNS names (temporary failure in name resolution). CoreDNS pods are running and the kube-dns service IP appears correctly in pod /etc/resolv.conf.

Cause: Cilium's eBPF DNS proxy state becomes stale after node instance type changes or restarts. The cached eBPF maps no longer match the new network topology.

Fix: Restart the Cilium daemonset to rebuild eBPF maps:

kubectl rollout restart daemonset cilium -n kube-system

Tip

This rebuilds eBPF maps without disrupting running pods. Existing connections are preserved; only DNS proxy state is reset.


KafkaTopic Replication Factor Error

Symptom: A KafkaTopic custom resource reports an error and the Strimzi entity-operator fails to start or continuously restarts.

Cause: The topic's spec.replicas value exceeds the number of available Kafka brokers. Kafka cannot satisfy a replication factor higher than its broker count.

Fix: Set spec.replicas to be less than or equal to the number of brokers. For single-broker development or testing clusters:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: example-topic
spec:
  replicas: 1
  partitions: 3

Warning

A replication factor of 1 means no redundancy. Use this only in non-production environments. Production clusters should have at least 3 brokers with replicas: 3.


Next Steps

Once all checks pass, proceed to deploying infrastructure components and workloads via FluxCD.