4.4 Verify the Cluster¶
After bootstrapping in 4.3, verify that all components are healthy before deploying workloads.
Quick Health Check¶
Run the health check against a control plane node:
AWS Clusters
For AWS-based clusters, use the control plane node's private IP:
All nodes should appear in Ready status.
Node Readiness¶
Confirm every node has joined the cluster and is reporting Ready:
If a node is stuck in NotReady, inspect its conditions:
Etcd Health¶
Check that the etcd cluster is healthy and all members are connected:
All control plane nodes should appear as etcd members with no alarms.
Pod Status¶
Verify that all system pods are running:
Key namespaces to check:
kube-system— core Kubernetes components (API server, scheduler, controller-manager, kube-proxy, CoreDNS)tigera-operator/calico-system— if using Calico CNIkube-system(cilium pods) — if using Cilium CNI
All pods should be in Running or Completed status with no restarts.
Troubleshooting¶
kubectl logs / exec Returns Authorization Error¶
Symptom: Commands such as kubectl logs or kubectl exec fail with:
Cause: Fresh Talos clusters may be missing the system:kubelet-api-admin ClusterRoleBinding that grants the API server's kubelet client certificate permission to proxy to nodes.
Fix: Create the missing ClusterRoleBinding:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:kubelet-api-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:kubelet-api-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: apiserver-kubelet-client
Warning
Without this binding, any operation that requires the API server to reach the kubelet (logs, exec, port-forward, metrics) will fail.
DNS Resolution Fails From Pods After Node Restart¶
Symptom: Pods cannot resolve service DNS names (temporary failure in name resolution). CoreDNS pods are running and the kube-dns service IP appears correctly in pod /etc/resolv.conf.
Cause: Cilium's eBPF DNS proxy state becomes stale after node instance type changes or restarts. The cached eBPF maps no longer match the new network topology.
Fix: Restart the Cilium daemonset to rebuild eBPF maps:
Tip
This rebuilds eBPF maps without disrupting running pods. Existing connections are preserved; only DNS proxy state is reset.
KafkaTopic Replication Factor Error¶
Symptom: A KafkaTopic custom resource reports an error and the Strimzi entity-operator fails to start or continuously restarts.
Cause: The topic's spec.replicas value exceeds the number of available Kafka brokers. Kafka cannot satisfy a replication factor higher than its broker count.
Fix: Set spec.replicas to be less than or equal to the number of brokers. For single-broker development or testing clusters:
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: example-topic
spec:
replicas: 1
partitions: 3
Warning
A replication factor of 1 means no redundancy. Use this only in non-production environments. Production clusters should have at least 3 brokers with replicas: 3.
Next Steps¶
Once all checks pass, proceed to deploying infrastructure components and workloads via FluxCD.