Adds docs useful for AI assistant context.

2025-09-28 15:25:43 -07:00
parent 838903e27d
commit 9877d73814
16 changed files with 5887 additions and 0 deletions
--- a/docs/agent-context/talos-v1.11/troubleshooting-guide.md
+++ b/docs/agent-context/talos-v1.11/troubleshooting-guide.md
@@ -0,0 +1,480 @@
+# Talos Troubleshooting Guide
+
+This guide provides systematic approaches to diagnosing and resolving common Talos cluster issues.
+
+## General Troubleshooting Methodology
+
+### 1. Gather Information
+```bash
+# Node status and health
+talosctl -n <IP> health
+talosctl -n <IP> version
+talosctl -n <IP> get members
+
+# System resources
+talosctl -n <IP> memory
+talosctl -n <IP> disks
+talosctl -n <IP> processes | head -20
+
+# Service status
+talosctl -n <IP> services
+```
+
+### 2. Check Logs
+```bash
+# Kernel logs (system-level issues)
+talosctl -n <IP> dmesg | tail -100
+
+# Service logs
+talosctl -n <IP> logs machined
+talosctl -n <IP> logs kubelet
+talosctl -n <IP> logs containerd
+
+# System events
+talosctl -n <IP> events --since=1h
+```
+
+### 3. Network Connectivity
+```bash
+# Discovery and membership
+talosctl get affiliates
+talosctl get members
+
+# Network interfaces
+talosctl -n <IP> interfaces
+talosctl -n <IP> get addresses
+
+# Control plane connectivity
+kubectl get nodes
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+```
+
+## Bootstrap and Initial Setup Issues
+
+### Cluster Bootstrap Failures
+
+**Symptoms**: Bootstrap command fails or times out
+**Diagnosis**:
+```bash
+# Check etcd service state
+talosctl -n <IP> service etcd
+
+# Check if node is trying to join instead of bootstrap
+talosctl -n <IP> logs etcd | grep -i bootstrap
+
+# Verify machine configuration
+talosctl -n <IP> get machineconfig -o yaml
+```
+
+**Common Causes & Solutions**:
+1. **Wrong node type**: Ensure using `controlplane`, not deprecated `init`
+2. **Network issues**: Verify control plane endpoint connectivity
+3. **Configuration errors**: Check machine configuration validity
+4. **Previous bootstrap**: etcd data exists from previous attempts
+
+**Resolution**:
+```bash
+# Reset node if previous bootstrap data exists
+talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
+
+# Re-apply configuration and bootstrap
+talosctl apply-config --nodes <IP> --file controlplane.yaml
+talosctl bootstrap --nodes <IP>
+```
+
+### Node Join Issues
+
+**Symptoms**: New nodes don't join cluster
+**Diagnosis**:
+```bash
+# Check discovery
+talosctl get affiliates
+talosctl get members
+
+# Check bootstrap token
+kubectl get secrets -n kube-system | grep bootstrap-token
+
+# Check kubelet logs
+talosctl -n <IP> logs kubelet | grep -i certificate
+```
+
+**Common Solutions**:
+```bash
+# Regenerate bootstrap token if expired
+kubeadm token create --print-join-command
+
+# Verify discovery service connectivity
+talosctl -n <IP> get affiliates --namespace=cluster-raw
+
+# Check machine configuration matches cluster
+talosctl -n <IP> get machineconfig -o yaml
+```
+
+## Control Plane Issues
+
+### etcd Problems
+
+**etcd Won't Start**:
+```bash
+# Check etcd service status and logs
+talosctl -n <IP> service etcd
+talosctl -n <IP> logs etcd
+
+# Check etcd data directory
+talosctl -n <IP> list /var/lib/etcd
+
+# Check disk space and permissions
+talosctl -n <IP> df
+```
+
+**etcd Quorum Loss**:
+```bash
+# Check member status
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+talosctl -n <IP> etcd members
+
+# Identify healthy members
+for ip in IP1 IP2 IP3; do
+  echo "=== Node $ip ==="
+  talosctl -n $ip service etcd
+done
+```
+
+**Solution for Quorum Loss**:
+1. If majority available: Remove failed members, add replacements
+2. If majority lost: Follow disaster recovery procedure
+
+### API Server Issues
+
+**API Server Not Responding**:
+```bash
+# Check API server pod status
+kubectl get pods -n kube-system | grep apiserver
+
+# Check API server configuration
+talosctl -n <IP> get apiserverconfig -o yaml
+
+# Check control plane endpoint
+curl -k https://<control-plane-endpoint>:6443/healthz
+```
+
+**Common Solutions**:
+```bash
+# Restart kubelet to reload static pods
+talosctl -n <IP> service kubelet restart
+
+# Check for configuration issues
+talosctl -n <IP> logs kubelet | grep apiserver
+
+# Verify etcd connectivity
+talosctl -n <IP> etcd status
+```
+
+## Node-Level Issues
+
+### Kubelet Problems
+
+**Kubelet Service Issues**:
+```bash
+# Check kubelet status and logs
+talosctl -n <IP> service kubelet
+talosctl -n <IP> logs kubelet | tail -50
+
+# Check kubelet configuration
+talosctl -n <IP> get kubeletconfig -o yaml
+
+# Check container runtime
+talosctl -n <IP> service containerd
+```
+
+**Common Kubelet Issues**:
+1. **Certificate problems**: Check certificate expiration and rotation
+2. **Container runtime issues**: Verify containerd health
+3. **Resource constraints**: Check memory and disk space
+4. **Network connectivity**: Verify API server connectivity
+
+### Container Runtime Issues
+
+**Containerd Problems**:
+```bash
+# Check containerd service
+talosctl -n <IP> service containerd
+talosctl -n <IP> logs containerd
+
+# List containers
+talosctl -n <IP> containers
+talosctl -n <IP> containers -k  # Kubernetes containers
+
+# Check containerd configuration
+talosctl -n <IP> read /etc/cri/conf.d/cri.toml
+```
+
+**Common Solutions**:
+```bash
+# Restart containerd
+talosctl -n <IP> service containerd restart
+
+# Check disk space for container images
+talosctl -n <IP> df
+
+# Clean up unused containers/images
+# (This happens automatically via kubelet GC)
+```
+
+## Network Issues
+
+### Network Connectivity Problems
+
+**Node-to-Node Connectivity**:
+```bash
+# Test basic network connectivity
+talosctl -n <IP1> interfaces
+talosctl -n <IP1> get routes
+
+# Test specific connectivity
+talosctl -n <IP1> read /etc/resolv.conf
+
+# Check network configuration
+talosctl -n <IP> get networkconfig -o yaml
+```
+
+**DNS Resolution Issues**:
+```bash
+# Check DNS configuration
+talosctl -n <IP> read /etc/resolv.conf
+
+# Test DNS resolution
+talosctl -n <IP> exec --kubernetes coredns-pod -- nslookup kubernetes.default.svc.cluster.local
+```
+
+### Discovery Service Issues
+
+**Discovery Not Working**:
+```bash
+# Check discovery configuration
+talosctl get discoveryconfig -o yaml
+
+# Check affiliate discovery
+talosctl get affiliates
+talosctl get affiliates --namespace=cluster-raw
+
+# Test discovery service connectivity
+curl -v https://discovery.talos.dev/
+```
+
+**KubeSpan Issues** (if enabled):
+```bash
+# Check KubeSpan configuration
+talosctl get kubespanconfig -o yaml
+
+# Check peer status
+talosctl get kubespanpeerspecs
+talosctl get kubespanpeerstatuses
+
+# Check WireGuard interface
+talosctl -n <IP> interfaces | grep kubespan
+```
+
+## Upgrade Issues
+
+### OS Upgrade Problems
+
+**Upgrade Fails or Hangs**:
+```bash
+# Check upgrade status
+talosctl -n <IP> dmesg | grep -i upgrade
+talosctl -n <IP> events | grep -i upgrade
+
+# Use staged upgrade for filesystem lock issues
+talosctl upgrade --nodes <IP> --image <image> --stage
+
+# Monitor upgrade progress
+talosctl upgrade --nodes <IP> --image <image> --wait --debug
+```
+
+**Boot Issues After Upgrade**:
+```bash
+# Check boot logs
+talosctl -n <IP> dmesg | head -100
+
+# System automatically rolls back on boot failure
+# Check current version
+talosctl -n <IP> version
+
+# Manual rollback if needed
+talosctl rollback --nodes <IP>
+```
+
+### Kubernetes Upgrade Issues
+
+**K8s Upgrade Failures**:
+```bash
+# Check upgrade status
+talosctl --nodes <controlplane> upgrade-k8s --to <version> --dry-run
+
+# Check individual component status
+kubectl get pods -n kube-system
+talosctl -n <IP> get apiserverconfig -o yaml
+```
+
+**Version Mismatch Issues**:
+```bash
+# Check version consistency
+kubectl get nodes -o wide
+talosctl -n <IP1>,<IP2>,<IP3> version
+
+# Check component versions
+kubectl get pods -n kube-system -o wide
+```
+
+## Resource and Performance Issues
+
+### Memory and Storage Problems
+
+**Out of Memory**:
+```bash
+# Check memory usage
+talosctl -n <IP> memory
+talosctl -n <IP> processes --sort-by=memory | head -20
+
+# Check for memory pressure
+kubectl describe node <node-name> | grep -A 10 Conditions
+
+# Check OOM events
+talosctl -n <IP> dmesg | grep -i "out of memory"
+```
+
+**Disk Space Issues**:
+```bash
+# Check disk usage
+talosctl -n <IP> df
+talosctl -n <IP> disks
+
+# Check specific directories
+talosctl -n <IP> list /var/lib/containerd
+talosctl -n <IP> list /var/lib/etcd
+
+# Clean up if needed (automatic GC usually handles this)
+kubectl describe node <node-name> | grep -A 5 "Disk Pressure"
+```
+
+### Performance Issues
+
+**Slow Cluster Response**:
+```bash
+# Check API server response time
+time kubectl get nodes
+
+# Check etcd performance
+talosctl -n <IP> etcd status
+# Look for high DB size vs IN USE ratio (fragmentation)
+
+# Check system load
+talosctl -n <IP> cpu
+talosctl -n <IP> memory
+```
+
+**High CPU/Memory Usage**:
+```bash
+# Identify resource-heavy processes
+talosctl -n <IP> processes --sort-by=cpu | head -10
+talosctl -n <IP> processes --sort-by=memory | head -10
+
+# Check cgroup usage
+talosctl -n <IP> cgroups --preset memory
+talosctl -n <IP> cgroups --preset cpu
+```
+
+## Configuration Issues
+
+### Machine Configuration Problems
+
+**Invalid Configuration**:
+```bash
+# Validate configuration before applying
+talosctl validate -f machineconfig.yaml
+
+# Check current configuration
+talosctl -n <IP> get machineconfig -o yaml
+
+# Compare with expected configuration
+diff <(talosctl -n <IP> get mc v1alpha1 -o yaml) expected-config.yaml
+```
+
+**Configuration Drift**:
+```bash
+# Check configuration version
+talosctl -n <IP> get machineconfig
+
+# Re-apply configuration if needed
+talosctl apply-config --nodes <IP> --file corrected-config.yaml --dry-run
+talosctl apply-config --nodes <IP> --file corrected-config.yaml
+```
+
+## Emergency Procedures
+
+### Node Unresponsive
+
+**Complete Node Failure**:
+1. **Physical access required**: Power cycle or hardware reset
+2. **Check hardware**: Memory, disk, network interface status
+3. **Boot issues**: May require bootable recovery media
+
+**Partial Connectivity**:
+```bash
+# Try different network interfaces if multiple available
+talosctl -e <alternate-ip> -n <IP> health
+
+# Check if specific services are running
+talosctl -n <IP> service machined
+talosctl -n <IP> service apid
+```
+
+### Cluster-Wide Failures
+
+**All Control Plane Nodes Down**:
+1. **Assess scope**: Determine if data corruption or hardware failure
+2. **Recovery strategy**: Use etcd backup if available
+3. **Rebuild process**: May require complete cluster rebuild
+
+**Follow disaster recovery procedures** as documented in etcd-management.md.
+
+### Emergency Reset Procedures
+
+**Single Node Reset**:
+```bash
+# Graceful reset (preserves some data)
+talosctl -n <IP> reset
+
+# Force reset (wipes all data)
+talosctl -n <IP> reset --graceful=false --reboot
+
+# Selective wipe (preserve STATE partition)
+talosctl -n <IP> reset --system-labels-to-wipe=EPHEMERAL
+```
+
+**Cluster Reset** (DESTRUCTIVE):
+```bash
+# Reset all nodes (DANGER: DATA LOSS)
+for ip in IP1 IP2 IP3; do
+  talosctl -n $ip reset --graceful=false --reboot
+done
+```
+
+## Monitoring and Alerting
+
+### Key Metrics to Monitor
+- Node resource usage (CPU, memory, disk)
+- etcd health and performance
+- Control plane component status
+- Network connectivity
+- Certificate expiration
+- Discovery service connectivity
+
+### Log Locations for External Monitoring
+- Kernel logs: `talosctl dmesg`
+- Service logs: `talosctl logs <service>`
+- System events: `talosctl events`
+- Kubernetes events: `kubectl get events`
+
+This troubleshooting guide provides systematic approaches to identify and resolve the most common issues encountered in Talos cluster operations.