Files
wild-cloud-poc/docs/agent-context/talos-v1.11/troubleshooting-guide.md

480 lines
11 KiB
Markdown

# Talos Troubleshooting Guide
This guide provides systematic approaches to diagnosing and resolving common Talos cluster issues.
## General Troubleshooting Methodology
### 1. Gather Information
```bash
# Node status and health
talosctl -n <IP> health
talosctl -n <IP> version
talosctl -n <IP> get members
# System resources
talosctl -n <IP> memory
talosctl -n <IP> disks
talosctl -n <IP> processes | head -20
# Service status
talosctl -n <IP> services
```
### 2. Check Logs
```bash
# Kernel logs (system-level issues)
talosctl -n <IP> dmesg | tail -100
# Service logs
talosctl -n <IP> logs machined
talosctl -n <IP> logs kubelet
talosctl -n <IP> logs containerd
# System events
talosctl -n <IP> events --since=1h
```
### 3. Network Connectivity
```bash
# Discovery and membership
talosctl get affiliates
talosctl get members
# Network interfaces
talosctl -n <IP> interfaces
talosctl -n <IP> get addresses
# Control plane connectivity
kubectl get nodes
talosctl -n <IP1>,<IP2>,<IP3> etcd status
```
## Bootstrap and Initial Setup Issues
### Cluster Bootstrap Failures
**Symptoms**: Bootstrap command fails or times out
**Diagnosis**:
```bash
# Check etcd service state
talosctl -n <IP> service etcd
# Check if node is trying to join instead of bootstrap
talosctl -n <IP> logs etcd | grep -i bootstrap
# Verify machine configuration
talosctl -n <IP> get machineconfig -o yaml
```
**Common Causes & Solutions**:
1. **Wrong node type**: Ensure using `controlplane`, not deprecated `init`
2. **Network issues**: Verify control plane endpoint connectivity
3. **Configuration errors**: Check machine configuration validity
4. **Previous bootstrap**: etcd data exists from previous attempts
**Resolution**:
```bash
# Reset node if previous bootstrap data exists
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
# Re-apply configuration and bootstrap
talosctl apply-config --nodes <IP> --file controlplane.yaml
talosctl bootstrap --nodes <IP>
```
### Node Join Issues
**Symptoms**: New nodes don't join cluster
**Diagnosis**:
```bash
# Check discovery
talosctl get affiliates
talosctl get members
# Check bootstrap token
kubectl get secrets -n kube-system | grep bootstrap-token
# Check kubelet logs
talosctl -n <IP> logs kubelet | grep -i certificate
```
**Common Solutions**:
```bash
# Regenerate bootstrap token if expired
kubeadm token create --print-join-command
# Verify discovery service connectivity
talosctl -n <IP> get affiliates --namespace=cluster-raw
# Check machine configuration matches cluster
talosctl -n <IP> get machineconfig -o yaml
```
## Control Plane Issues
### etcd Problems
**etcd Won't Start**:
```bash
# Check etcd service status and logs
talosctl -n <IP> service etcd
talosctl -n <IP> logs etcd
# Check etcd data directory
talosctl -n <IP> list /var/lib/etcd
# Check disk space and permissions
talosctl -n <IP> df
```
**etcd Quorum Loss**:
```bash
# Check member status
talosctl -n <IP1>,<IP2>,<IP3> etcd status
talosctl -n <IP> etcd members
# Identify healthy members
for ip in IP1 IP2 IP3; do
echo "=== Node $ip ==="
talosctl -n $ip service etcd
done
```
**Solution for Quorum Loss**:
1. If majority available: Remove failed members, add replacements
2. If majority lost: Follow disaster recovery procedure
### API Server Issues
**API Server Not Responding**:
```bash
# Check API server pod status
kubectl get pods -n kube-system | grep apiserver
# Check API server configuration
talosctl -n <IP> get apiserverconfig -o yaml
# Check control plane endpoint
curl -k https://<control-plane-endpoint>:6443/healthz
```
**Common Solutions**:
```bash
# Restart kubelet to reload static pods
talosctl -n <IP> service kubelet restart
# Check for configuration issues
talosctl -n <IP> logs kubelet | grep apiserver
# Verify etcd connectivity
talosctl -n <IP> etcd status
```
## Node-Level Issues
### Kubelet Problems
**Kubelet Service Issues**:
```bash
# Check kubelet status and logs
talosctl -n <IP> service kubelet
talosctl -n <IP> logs kubelet | tail -50
# Check kubelet configuration
talosctl -n <IP> get kubeletconfig -o yaml
# Check container runtime
talosctl -n <IP> service containerd
```
**Common Kubelet Issues**:
1. **Certificate problems**: Check certificate expiration and rotation
2. **Container runtime issues**: Verify containerd health
3. **Resource constraints**: Check memory and disk space
4. **Network connectivity**: Verify API server connectivity
### Container Runtime Issues
**Containerd Problems**:
```bash
# Check containerd service
talosctl -n <IP> service containerd
talosctl -n <IP> logs containerd
# List containers
talosctl -n <IP> containers
talosctl -n <IP> containers -k # Kubernetes containers
# Check containerd configuration
talosctl -n <IP> read /etc/cri/conf.d/cri.toml
```
**Common Solutions**:
```bash
# Restart containerd
talosctl -n <IP> service containerd restart
# Check disk space for container images
talosctl -n <IP> df
# Clean up unused containers/images
# (This happens automatically via kubelet GC)
```
## Network Issues
### Network Connectivity Problems
**Node-to-Node Connectivity**:
```bash
# Test basic network connectivity
talosctl -n <IP1> interfaces
talosctl -n <IP1> get routes
# Test specific connectivity
talosctl -n <IP1> read /etc/resolv.conf
# Check network configuration
talosctl -n <IP> get networkconfig -o yaml
```
**DNS Resolution Issues**:
```bash
# Check DNS configuration
talosctl -n <IP> read /etc/resolv.conf
# Test DNS resolution
talosctl -n <IP> exec --kubernetes coredns-pod -- nslookup kubernetes.default.svc.cluster.local
```
### Discovery Service Issues
**Discovery Not Working**:
```bash
# Check discovery configuration
talosctl get discoveryconfig -o yaml
# Check affiliate discovery
talosctl get affiliates
talosctl get affiliates --namespace=cluster-raw
# Test discovery service connectivity
curl -v https://discovery.talos.dev/
```
**KubeSpan Issues** (if enabled):
```bash
# Check KubeSpan configuration
talosctl get kubespanconfig -o yaml
# Check peer status
talosctl get kubespanpeerspecs
talosctl get kubespanpeerstatuses
# Check WireGuard interface
talosctl -n <IP> interfaces | grep kubespan
```
## Upgrade Issues
### OS Upgrade Problems
**Upgrade Fails or Hangs**:
```bash
# Check upgrade status
talosctl -n <IP> dmesg | grep -i upgrade
talosctl -n <IP> events | grep -i upgrade
# Use staged upgrade for filesystem lock issues
talosctl upgrade --nodes <IP> --image <image> --stage
# Monitor upgrade progress
talosctl upgrade --nodes <IP> --image <image> --wait --debug
```
**Boot Issues After Upgrade**:
```bash
# Check boot logs
talosctl -n <IP> dmesg | head -100
# System automatically rolls back on boot failure
# Check current version
talosctl -n <IP> version
# Manual rollback if needed
talosctl rollback --nodes <IP>
```
### Kubernetes Upgrade Issues
**K8s Upgrade Failures**:
```bash
# Check upgrade status
talosctl --nodes <controlplane> upgrade-k8s --to <version> --dry-run
# Check individual component status
kubectl get pods -n kube-system
talosctl -n <IP> get apiserverconfig -o yaml
```
**Version Mismatch Issues**:
```bash
# Check version consistency
kubectl get nodes -o wide
talosctl -n <IP1>,<IP2>,<IP3> version
# Check component versions
kubectl get pods -n kube-system -o wide
```
## Resource and Performance Issues
### Memory and Storage Problems
**Out of Memory**:
```bash
# Check memory usage
talosctl -n <IP> memory
talosctl -n <IP> processes --sort-by=memory | head -20
# Check for memory pressure
kubectl describe node <node-name> | grep -A 10 Conditions
# Check OOM events
talosctl -n <IP> dmesg | grep -i "out of memory"
```
**Disk Space Issues**:
```bash
# Check disk usage
talosctl -n <IP> df
talosctl -n <IP> disks
# Check specific directories
talosctl -n <IP> list /var/lib/containerd
talosctl -n <IP> list /var/lib/etcd
# Clean up if needed (automatic GC usually handles this)
kubectl describe node <node-name> | grep -A 5 "Disk Pressure"
```
### Performance Issues
**Slow Cluster Response**:
```bash
# Check API server response time
time kubectl get nodes
# Check etcd performance
talosctl -n <IP> etcd status
# Look for high DB size vs IN USE ratio (fragmentation)
# Check system load
talosctl -n <IP> cpu
talosctl -n <IP> memory
```
**High CPU/Memory Usage**:
```bash
# Identify resource-heavy processes
talosctl -n <IP> processes --sort-by=cpu | head -10
talosctl -n <IP> processes --sort-by=memory | head -10
# Check cgroup usage
talosctl -n <IP> cgroups --preset memory
talosctl -n <IP> cgroups --preset cpu
```
## Configuration Issues
### Machine Configuration Problems
**Invalid Configuration**:
```bash
# Validate configuration before applying
talosctl validate -f machineconfig.yaml
# Check current configuration
talosctl -n <IP> get machineconfig -o yaml
# Compare with expected configuration
diff <(talosctl -n <IP> get mc v1alpha1 -o yaml) expected-config.yaml
```
**Configuration Drift**:
```bash
# Check configuration version
talosctl -n <IP> get machineconfig
# Re-apply configuration if needed
talosctl apply-config --nodes <IP> --file corrected-config.yaml --dry-run
talosctl apply-config --nodes <IP> --file corrected-config.yaml
```
## Emergency Procedures
### Node Unresponsive
**Complete Node Failure**:
1. **Physical access required**: Power cycle or hardware reset
2. **Check hardware**: Memory, disk, network interface status
3. **Boot issues**: May require bootable recovery media
**Partial Connectivity**:
```bash
# Try different network interfaces if multiple available
talosctl -e <alternate-ip> -n <IP> health
# Check if specific services are running
talosctl -n <IP> service machined
talosctl -n <IP> service apid
```
### Cluster-Wide Failures
**All Control Plane Nodes Down**:
1. **Assess scope**: Determine if data corruption or hardware failure
2. **Recovery strategy**: Use etcd backup if available
3. **Rebuild process**: May require complete cluster rebuild
**Follow disaster recovery procedures** as documented in etcd-management.md.
### Emergency Reset Procedures
**Single Node Reset**:
```bash
# Graceful reset (preserves some data)
talosctl -n <IP> reset
# Force reset (wipes all data)
talosctl -n <IP> reset --graceful=false --reboot
# Selective wipe (preserve STATE partition)
talosctl -n <IP> reset --system-labels-to-wipe=EPHEMERAL
```
**Cluster Reset** (DESTRUCTIVE):
```bash
# Reset all nodes (DANGER: DATA LOSS)
for ip in IP1 IP2 IP3; do
talosctl -n $ip reset --graceful=false --reboot
done
```
## Monitoring and Alerting
### Key Metrics to Monitor
- Node resource usage (CPU, memory, disk)
- etcd health and performance
- Control plane component status
- Network connectivity
- Certificate expiration
- Discovery service connectivity
### Log Locations for External Monitoring
- Kernel logs: `talosctl dmesg`
- Service logs: `talosctl logs <service>`
- System events: `talosctl events`
- Kubernetes events: `kubectl get events`
This troubleshooting guide provides systematic approaches to identify and resolve the most common issues encountered in Talos cluster operations.