11 KiB
Talos Troubleshooting Guide
This guide provides systematic approaches to diagnosing and resolving common Talos cluster issues.
General Troubleshooting Methodology
1. Gather Information
# Node status and health
talosctl -n <IP> health
talosctl -n <IP> version
talosctl -n <IP> get members
# System resources
talosctl -n <IP> memory
talosctl -n <IP> disks
talosctl -n <IP> processes | head -20
# Service status
talosctl -n <IP> services
2. Check Logs
# Kernel logs (system-level issues)
talosctl -n <IP> dmesg | tail -100
# Service logs
talosctl -n <IP> logs machined
talosctl -n <IP> logs kubelet
talosctl -n <IP> logs containerd
# System events
talosctl -n <IP> events --since=1h
3. Network Connectivity
# Discovery and membership
talosctl get affiliates
talosctl get members
# Network interfaces
talosctl -n <IP> interfaces
talosctl -n <IP> get addresses
# Control plane connectivity
kubectl get nodes
talosctl -n <IP1>,<IP2>,<IP3> etcd status
Bootstrap and Initial Setup Issues
Cluster Bootstrap Failures
Symptoms: Bootstrap command fails or times out Diagnosis:
# Check etcd service state
talosctl -n <IP> service etcd
# Check if node is trying to join instead of bootstrap
talosctl -n <IP> logs etcd | grep -i bootstrap
# Verify machine configuration
talosctl -n <IP> get machineconfig -o yaml
Common Causes & Solutions:
- Wrong node type: Ensure using
controlplane, not deprecatedinit - Network issues: Verify control plane endpoint connectivity
- Configuration errors: Check machine configuration validity
- Previous bootstrap: etcd data exists from previous attempts
Resolution:
# Reset node if previous bootstrap data exists
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
# Re-apply configuration and bootstrap
talosctl apply-config --nodes <IP> --file controlplane.yaml
talosctl bootstrap --nodes <IP>
Node Join Issues
Symptoms: New nodes don't join cluster Diagnosis:
# Check discovery
talosctl get affiliates
talosctl get members
# Check bootstrap token
kubectl get secrets -n kube-system | grep bootstrap-token
# Check kubelet logs
talosctl -n <IP> logs kubelet | grep -i certificate
Common Solutions:
# Regenerate bootstrap token if expired
kubeadm token create --print-join-command
# Verify discovery service connectivity
talosctl -n <IP> get affiliates --namespace=cluster-raw
# Check machine configuration matches cluster
talosctl -n <IP> get machineconfig -o yaml
Control Plane Issues
etcd Problems
etcd Won't Start:
# Check etcd service status and logs
talosctl -n <IP> service etcd
talosctl -n <IP> logs etcd
# Check etcd data directory
talosctl -n <IP> list /var/lib/etcd
# Check disk space and permissions
talosctl -n <IP> df
etcd Quorum Loss:
# Check member status
talosctl -n <IP1>,<IP2>,<IP3> etcd status
talosctl -n <IP> etcd members
# Identify healthy members
for ip in IP1 IP2 IP3; do
echo "=== Node $ip ==="
talosctl -n $ip service etcd
done
Solution for Quorum Loss:
- If majority available: Remove failed members, add replacements
- If majority lost: Follow disaster recovery procedure
API Server Issues
API Server Not Responding:
# Check API server pod status
kubectl get pods -n kube-system | grep apiserver
# Check API server configuration
talosctl -n <IP> get apiserverconfig -o yaml
# Check control plane endpoint
curl -k https://<control-plane-endpoint>:6443/healthz
Common Solutions:
# Restart kubelet to reload static pods
talosctl -n <IP> service kubelet restart
# Check for configuration issues
talosctl -n <IP> logs kubelet | grep apiserver
# Verify etcd connectivity
talosctl -n <IP> etcd status
Node-Level Issues
Kubelet Problems
Kubelet Service Issues:
# Check kubelet status and logs
talosctl -n <IP> service kubelet
talosctl -n <IP> logs kubelet | tail -50
# Check kubelet configuration
talosctl -n <IP> get kubeletconfig -o yaml
# Check container runtime
talosctl -n <IP> service containerd
Common Kubelet Issues:
- Certificate problems: Check certificate expiration and rotation
- Container runtime issues: Verify containerd health
- Resource constraints: Check memory and disk space
- Network connectivity: Verify API server connectivity
Container Runtime Issues
Containerd Problems:
# Check containerd service
talosctl -n <IP> service containerd
talosctl -n <IP> logs containerd
# List containers
talosctl -n <IP> containers
talosctl -n <IP> containers -k # Kubernetes containers
# Check containerd configuration
talosctl -n <IP> read /etc/cri/conf.d/cri.toml
Common Solutions:
# Restart containerd
talosctl -n <IP> service containerd restart
# Check disk space for container images
talosctl -n <IP> df
# Clean up unused containers/images
# (This happens automatically via kubelet GC)
Network Issues
Network Connectivity Problems
Node-to-Node Connectivity:
# Test basic network connectivity
talosctl -n <IP1> interfaces
talosctl -n <IP1> get routes
# Test specific connectivity
talosctl -n <IP1> read /etc/resolv.conf
# Check network configuration
talosctl -n <IP> get networkconfig -o yaml
DNS Resolution Issues:
# Check DNS configuration
talosctl -n <IP> read /etc/resolv.conf
# Test DNS resolution
talosctl -n <IP> exec --kubernetes coredns-pod -- nslookup kubernetes.default.svc.cluster.local
Discovery Service Issues
Discovery Not Working:
# Check discovery configuration
talosctl get discoveryconfig -o yaml
# Check affiliate discovery
talosctl get affiliates
talosctl get affiliates --namespace=cluster-raw
# Test discovery service connectivity
curl -v https://discovery.talos.dev/
KubeSpan Issues (if enabled):
# Check KubeSpan configuration
talosctl get kubespanconfig -o yaml
# Check peer status
talosctl get kubespanpeerspecs
talosctl get kubespanpeerstatuses
# Check WireGuard interface
talosctl -n <IP> interfaces | grep kubespan
Upgrade Issues
OS Upgrade Problems
Upgrade Fails or Hangs:
# Check upgrade status
talosctl -n <IP> dmesg | grep -i upgrade
talosctl -n <IP> events | grep -i upgrade
# Use staged upgrade for filesystem lock issues
talosctl upgrade --nodes <IP> --image <image> --stage
# Monitor upgrade progress
talosctl upgrade --nodes <IP> --image <image> --wait --debug
Boot Issues After Upgrade:
# Check boot logs
talosctl -n <IP> dmesg | head -100
# System automatically rolls back on boot failure
# Check current version
talosctl -n <IP> version
# Manual rollback if needed
talosctl rollback --nodes <IP>
Kubernetes Upgrade Issues
K8s Upgrade Failures:
# Check upgrade status
talosctl --nodes <controlplane> upgrade-k8s --to <version> --dry-run
# Check individual component status
kubectl get pods -n kube-system
talosctl -n <IP> get apiserverconfig -o yaml
Version Mismatch Issues:
# Check version consistency
kubectl get nodes -o wide
talosctl -n <IP1>,<IP2>,<IP3> version
# Check component versions
kubectl get pods -n kube-system -o wide
Resource and Performance Issues
Memory and Storage Problems
Out of Memory:
# Check memory usage
talosctl -n <IP> memory
talosctl -n <IP> processes --sort-by=memory | head -20
# Check for memory pressure
kubectl describe node <node-name> | grep -A 10 Conditions
# Check OOM events
talosctl -n <IP> dmesg | grep -i "out of memory"
Disk Space Issues:
# Check disk usage
talosctl -n <IP> df
talosctl -n <IP> disks
# Check specific directories
talosctl -n <IP> list /var/lib/containerd
talosctl -n <IP> list /var/lib/etcd
# Clean up if needed (automatic GC usually handles this)
kubectl describe node <node-name> | grep -A 5 "Disk Pressure"
Performance Issues
Slow Cluster Response:
# Check API server response time
time kubectl get nodes
# Check etcd performance
talosctl -n <IP> etcd status
# Look for high DB size vs IN USE ratio (fragmentation)
# Check system load
talosctl -n <IP> cpu
talosctl -n <IP> memory
High CPU/Memory Usage:
# Identify resource-heavy processes
talosctl -n <IP> processes --sort-by=cpu | head -10
talosctl -n <IP> processes --sort-by=memory | head -10
# Check cgroup usage
talosctl -n <IP> cgroups --preset memory
talosctl -n <IP> cgroups --preset cpu
Configuration Issues
Machine Configuration Problems
Invalid Configuration:
# Validate configuration before applying
talosctl validate -f machineconfig.yaml
# Check current configuration
talosctl -n <IP> get machineconfig -o yaml
# Compare with expected configuration
diff <(talosctl -n <IP> get mc v1alpha1 -o yaml) expected-config.yaml
Configuration Drift:
# Check configuration version
talosctl -n <IP> get machineconfig
# Re-apply configuration if needed
talosctl apply-config --nodes <IP> --file corrected-config.yaml --dry-run
talosctl apply-config --nodes <IP> --file corrected-config.yaml
Emergency Procedures
Node Unresponsive
Complete Node Failure:
- Physical access required: Power cycle or hardware reset
- Check hardware: Memory, disk, network interface status
- Boot issues: May require bootable recovery media
Partial Connectivity:
# Try different network interfaces if multiple available
talosctl -e <alternate-ip> -n <IP> health
# Check if specific services are running
talosctl -n <IP> service machined
talosctl -n <IP> service apid
Cluster-Wide Failures
All Control Plane Nodes Down:
- Assess scope: Determine if data corruption or hardware failure
- Recovery strategy: Use etcd backup if available
- Rebuild process: May require complete cluster rebuild
Follow disaster recovery procedures as documented in etcd-management.md.
Emergency Reset Procedures
Single Node Reset:
# Graceful reset (preserves some data)
talosctl -n <IP> reset
# Force reset (wipes all data)
talosctl -n <IP> reset --graceful=false --reboot
# Selective wipe (preserve STATE partition)
talosctl -n <IP> reset --system-labels-to-wipe=EPHEMERAL
Cluster Reset (DESTRUCTIVE):
# Reset all nodes (DANGER: DATA LOSS)
for ip in IP1 IP2 IP3; do
talosctl -n $ip reset --graceful=false --reboot
done
Monitoring and Alerting
Key Metrics to Monitor
- Node resource usage (CPU, memory, disk)
- etcd health and performance
- Control plane component status
- Network connectivity
- Certificate expiration
- Discovery service connectivity
Log Locations for External Monitoring
- Kernel logs:
talosctl dmesg - Service logs:
talosctl logs <service> - System events:
talosctl events - Kubernetes events:
kubectl get events
This troubleshooting guide provides systematic approaches to identify and resolve the most common issues encountered in Talos cluster operations.