Initial commit.

This commit is contained in:
2025-10-11 18:08:04 +00:00
commit 8947da88eb
43 changed files with 7850 additions and 0 deletions

135
ai/talos-v1.11/README.md Normal file
View File

@@ -0,0 +1,135 @@
# Talos v1.11 Agent Context Documentation
This directory contains comprehensive documentation extracted from the official Talos v1.11 documentation, organized specifically to help AI agents become expert Talos cluster administrators.
## Documentation Structure
### Core Operations
- **[cluster-operations.md](cluster-operations.md)** - Essential cluster operations including upgrades, node management, and configuration
- **[cli-essentials.md](cli-essentials.md)** - Key talosctl commands and usage patterns for daily administration
### System Understanding
- **[architecture-and-components.md](architecture-and-components.md)** - Deep dive into Talos architecture, components, and design principles
- **[discovery-and-networking.md](discovery-and-networking.md)** - Cluster discovery mechanisms and network configuration
### Specialized Operations
- **[etcd-management.md](etcd-management.md)** - etcd operations, maintenance, backup, and disaster recovery
- **[bare-metal-administration.md](bare-metal-administration.md)** - Bare metal specific configurations, security, and hardware management
- **[troubleshooting-guide.md](troubleshooting-guide.md)** - Systematic approaches to diagnosing and resolving common issues
## Quick Reference
### Essential Commands for New Agents
```bash
# Cluster health check
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
# Node information
talosctl get members
talosctl -n <IP> version
# Service status
talosctl -n <IP> services
talosctl -n <IP> service kubelet
# System resources
talosctl -n <IP> memory
talosctl -n <IP> disks
# Logs and events
talosctl -n <IP> dmesg | tail -50
talosctl -n <IP> logs kubelet
talosctl -n <IP> events --since=1h
```
### Critical Procedures
- **Bootstrap**: `talosctl bootstrap --nodes <first-controlplane-ip>`
- **Backup etcd**: `talosctl -n <IP> etcd snapshot db.snapshot`
- **Upgrade OS**: `talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x`
- **Upgrade K8s**: `talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1`
### Emergency Commands
- **Node reset**: `talosctl -n <IP> reset`
- **Force reset**: `talosctl -n <IP> reset --graceful=false --reboot`
- **Disaster recovery**: `talosctl -n <IP> bootstrap --recover-from=./db.snapshot`
- **Rollback**: `talosctl rollback --nodes <IP>`
### Bare Metal Specific Commands
- **Check hardware**: `talosctl -n <IP> disks`, `talosctl -n <IP> read /proc/cpuinfo`
- **Network interfaces**: `talosctl -n <IP> get addresses`, `talosctl -n <IP> get routes`
- **Extensions**: `talosctl -n <IP> get extensions`
- **Encryption status**: `talosctl -n <IP> get encryptionconfig -o yaml`
- **Hardware monitoring**: `talosctl -n <IP> dmesg | grep -i error`
## Key Concepts for Agents
### Architecture Fundamentals
- **Immutable OS**: Single image, atomic updates, A-B rollback system
- **API-driven**: All management through gRPC API, no SSH/shell access
- **Controller pattern**: Kubernetes-style resource controllers for system management
- **Minimal attack surface**: Only services necessary for Kubernetes
### Control Plane Design
- **etcd quorum**: Requires majority for operations (3-node=2, 5-node=3)
- **Bootstrap process**: One-time initialization of etcd cluster
- **HA considerations**: Odd numbers of nodes, avoid even numbers
- **Upgrade strategy**: Rolling upgrades with automatic rollback on failure
### Network and Discovery
- **Service discovery**: Encrypted discovery service for cluster membership
- **KubeSpan**: Optional WireGuard mesh networking
- **mTLS everywhere**: All Talos API communication secured
- **Discovery registries**: Service (default) and Kubernetes (deprecated)
### Bare Metal Considerations
- **META configuration**: Network config embedded in disk images
- **Hardware compatibility**: Driver support and firmware requirements
- **Disk encryption**: LUKS2 with TPM, static keys, or node ID
- **SecureBoot**: UKI images with embedded signatures
- **System extensions**: Hardware-specific drivers and tools
- **Performance tuning**: CPU governors, IOMMU, memory management
## Common Administration Patterns
### Daily Operations
1. Check cluster health across all nodes
2. Monitor resource usage and capacity
3. Review system events and logs
4. Verify etcd health and backup status
5. Monitor discovery service connectivity
### Maintenance Windows
1. Plan upgrade sequence (workers first, then control plane)
2. Create etcd backup before major changes
3. Apply configuration changes with dry-run first
4. Monitor upgrade progress and be ready to rollback
5. Verify cluster functionality after changes
### Troubleshooting Workflow
1. **Gather information**: Health, version, resources, logs
2. **Check connectivity**: Network, discovery, API endpoints
3. **Examine services**: Status of critical services
4. **Review logs**: System events, service logs, kernel messages
5. **Apply fixes**: Configuration patches, service restarts, node resets
## Best Practices for Agents
### Configuration Management
- Use reproducible configuration workflow (secrets + patches)
- Always dry-run configuration changes first
- Store machine configurations in version control
- Test configuration changes in non-production first
### Operational Safety
- Take etcd snapshots before major changes
- Upgrade one node at a time
- Monitor upgrade progress and have rollback ready
- Test disaster recovery procedures regularly
### Performance Optimization
- Monitor etcd fragmentation and defragment when needed
- Scale vertically before horizontally for control plane
- Use appropriate hardware for etcd (fast storage, low network latency)
- Monitor resource usage trends and capacity planning
This documentation provides the essential knowledge needed to effectively administer Talos Linux clusters, organized by operational context and complexity level.

View File

@@ -0,0 +1,248 @@
# Talos Architecture and Components Guide
This guide provides deep understanding of Talos Linux architecture and system components for effective cluster administration.
## Core Architecture Principles
Talos is designed to be:
- **Atomic**: Distributed as a single, versioned, signed, immutable image
- **Modular**: Composed of separate components with defined gRPC interfaces
- **Minimal**: Focused init system that runs only services necessary for Kubernetes
## File System Architecture
### Partition Layout
- **EFI**: Stores EFI boot data
- **BIOS**: Used for GRUB's second stage boot
- **BOOT**: Contains boot loader, initramfs, and kernel data
- **META**: Stores node metadata (node IDs, etc.)
- **STATE**: Stores machine configuration, node identity, cluster discovery, KubeSpan data
- **EPHEMERAL**: Stores ephemeral state, mounted at `/var`
### Root File System Structure
Three-layer design:
1. **Base Layer**: Read-only squashfs mounted as loop device (immutable base)
2. **Runtime Layer**: tmpfs filesystems for runtime needs (`/dev`, `/proc`, `/run`, `/sys`, `/tmp`, `/system`)
3. **Overlay Layer**: overlayfs for persistent data backed by XFS at `/var`
#### Special Directories
- `/system`: Internal files that need to be writable (recreated each boot)
- Example: `/system/etc/hosts` bind-mounted over `/etc/hosts`
- `/var`: Owned by Kubernetes, contains persistent data:
- etcd data (control plane nodes)
- kubelet data
- containerd data
- Survives reboots and upgrades, wiped on reset
## Core Components
### machined (PID 1)
**Role**: Talos replacement for traditional init process
**Functions**:
- Machine configuration management
- API handling
- Resource and controller management
- Service lifecycle management
**Managed Services**:
- containerd
- etcd (control plane nodes)
- kubelet
- networkd
- trustd
- udevd
**Architecture**: Uses controller-runtime pattern similar to Kubernetes controllers
### apid (API Gateway)
**Role**: gRPC API endpoint for all Talos interactions
**Functions**:
- Routes requests to appropriate components
- Provides proxy capabilities for multi-node operations
- Handles authentication and authorization
**Usage Patterns**:
```bash
# Direct node communication
talosctl -e <node-ip> <command>
# Proxy through endpoint to specific nodes
talosctl -e <endpoint> -n <target-nodes> <command>
# Multi-node operations
talosctl -e <endpoint> -n <node1>,<node2>,<node3> <command>
```
### trustd (Trust Management)
**Role**: Establishes and maintains trust within the system
**Functions**:
- Root of Trust implementation
- PKI data distribution for control plane bootstrap
- Certificate management
- Secure file placement operations
### containerd (Container Runtime)
**Role**: Industry-standard container runtime
**Namespaces**:
- `system`: Talos services
- `k8s.io`: Kubernetes services
### udevd (Device Management)
**Role**: Device file manager (eudev implementation)
**Functions**:
- Kernel device notification handling
- Device node management in `/dev`
- Hardware discovery and setup
## Control Plane Architecture
### etcd Cluster Design
**Critical Concepts**:
- **Quorum**: Majority of members must agree on leader
- **Membership**: Formal etcd cluster membership required
- **Consensus**: Uses Raft protocol for distributed consensus
**Quorum Requirements**:
- 3 nodes: Requires 2 for quorum (tolerates 1 failure)
- 5 nodes: Requires 3 for quorum (tolerates 2 failures)
- Even numbers are worse than odd (4 nodes still only tolerates 1 failure)
### Control Plane Components
**Running as Static Pods on Control Plane Nodes**:
#### kube-apiserver
- Kubernetes API endpoint
- Connects to local etcd instance
- Handles all API operations
#### kube-controller-manager
- Runs control loops
- Manages cluster state reconciliation
- Handles node lifecycle, replication, etc.
#### kube-scheduler
- Pod placement decisions
- Resource-aware scheduling
- Constraint satisfaction
### Bootstrap Process
1. **etcd Bootstrap**: One node chosen as bootstrap node, initializes etcd cluster
2. **Static Pods**: Control plane components start as static pods via kubelet
3. **API Availability**: Control plane endpoint becomes available
4. **Manifest Injection**: Bootstrap manifests (join tokens, RBAC, etc.) injected
5. **Cluster Formation**: Other control plane nodes join etcd cluster
6. **HA Control Plane**: All control plane nodes run full component set
## Resource System Architecture
### Controller-Runtime Pattern
Talos uses Kubernetes-style controller pattern:
- **Resources**: Typed configuration and state objects
- **Controllers**: Reconcile desired vs actual state
- **Events**: Reactive architecture for state changes
### Resource Namespaces
- `config`: Machine configuration resources
- `cluster`: Cluster membership and discovery
- `controlplane`: Control plane component configurations
- `secrets`: Certificate and key management
- `network`: Network configuration and state
### Key Resources
```bash
# Machine configuration
talosctl get machineconfig
talosctl get machinetype
# Cluster membership
talosctl get members
talosctl get affiliates
talosctl get identities
# Control plane
talosctl get apiserverconfig
talosctl get controllermanagerconfig
talosctl get schedulerconfig
# Network
talosctl get addresses
talosctl get routes
talosctl get nodeaddresses
```
## Network Architecture
### Network Stack
- **CNI**: Container Network Interface for pod networking
- **Host Networking**: Node-to-node communication
- **Service Discovery**: Built-in cluster member discovery
- **KubeSpan**: Optional WireGuard mesh networking
### Discovery Service Integration
- **Service Registry**: External discovery service (default: discovery.talos.dev)
- **Kubernetes Registry**: Deprecated, uses Kubernetes Node resources
- **Encrypted Communication**: All discovery data encrypted before transmission
## Security Architecture
### Immutable Base
- Read-only root filesystem
- Signed and verified boot process
- Atomic updates with rollback capability
### Process Isolation
- Minimal attack surface
- No shell access
- No arbitrary user services
- Container-based workload isolation
### Network Security
- Mutual TLS (mTLS) for all API communication
- Certificate-based node authentication
- Optional WireGuard mesh networking (KubeSpan)
- Encrypted service discovery
### Kernel Hardening
Configured according to Kernel Self Protection Project (KSPP) recommendations:
- Stack protection
- Control flow integrity
- Memory protection features
- Attack surface reduction
## Extension Points
### Machine Configuration
- Declarative configuration management
- Patch-based configuration updates
- Runtime configuration validation
### System Extensions
- Kernel modules
- System services (limited)
- Network configuration
- Storage configuration
### Kubernetes Integration
- Automatic kubelet configuration
- Bootstrap manifest management
- Certificate lifecycle management
- Node lifecycle automation
## Performance Characteristics
### etcd Performance
- Performance decreases with cluster size
- Network latency affects consensus performance
- Storage I/O directly impacts etcd performance
### Resource Requirements
- **Control Plane Nodes**: Higher memory for etcd, CPU for control plane
- **Worker Nodes**: Resources scale with workload requirements
- **Network**: Low latency crucial for etcd performance
### Scaling Patterns
- **Horizontal Scaling**: Add worker nodes for capacity
- **Vertical Scaling**: Increase control plane node resources for performance
- **Control Plane Scaling**: Odd numbers (3, 5) for availability
This architecture enables Talos to provide a secure, minimal, and operationally simple platform for running Kubernetes clusters while maintaining the reliability and performance characteristics needed for production workloads.

View File

@@ -0,0 +1,506 @@
# Bare Metal Talos Administration Guide
This guide covers bare metal specific operations, configurations, and best practices for Talos Linux clusters.
## META-Based Network Configuration
Talos supports META-based network configuration for bare metal deployments where configuration is embedded in the disk image.
### Basic META Configuration
```yaml
# META configuration for bare metal networking
machine:
network:
interfaces:
- interface: eth0
addresses:
- 192.168.1.100/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
mtu: 1500
nameservers:
- 8.8.8.8
- 1.1.1.1
```
### Advanced Network Configurations
#### VLAN Configuration
```yaml
machine:
network:
interfaces:
- interface: eth0.100 # VLAN 100
vlan:
parentDevice: eth0
vid: 100
addresses:
- 192.168.100.10/24
routes:
- network: 192.168.100.0/24
```
#### Interface Bonding
```yaml
machine:
network:
interfaces:
- interface: bond0
bond:
mode: 802.3ad
lacpRate: fast
xmitHashPolicy: layer3+4
miimon: 100
updelay: 200
downdelay: 200
interfaces:
- eth0
- eth1
addresses:
- 192.168.1.100/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
```
#### Bridge Configuration
```yaml
machine:
network:
interfaces:
- interface: br0
bridge:
stp:
enabled: false
interfaces:
- eth0
- eth1
addresses:
- 192.168.1.100/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
```
### Network Troubleshooting Commands
```bash
# Check interface configuration
talosctl -n <IP> get addresses
talosctl -n <IP> get routes
talosctl -n <IP> get links
# Check network configuration
talosctl -n <IP> get networkconfig -o yaml
# Test network connectivity
talosctl -n <IP> list /sys/class/net
talosctl -n <IP> read /proc/net/dev
```
## Disk Encryption for Bare Metal
### LUKS2 Encryption Configuration
```yaml
machine:
systemDiskEncryption:
state:
provider: luks2
keys:
- slot: 0
static:
passphrase: "your-secure-passphrase"
ephemeral:
provider: luks2
keys:
- slot: 0
nodeID: {}
```
### TPM-Based Encryption
```yaml
machine:
systemDiskEncryption:
state:
provider: luks2
keys:
- slot: 0
tpm: {}
ephemeral:
provider: luks2
keys:
- slot: 0
tpm: {}
```
### Key Management Operations
```bash
# Check encryption status
talosctl -n <IP> get encryptionconfig -o yaml
# Rotate encryption keys
talosctl -n <IP> apply-config --file updated-config.yaml --mode staged
```
## SecureBoot Implementation
### UKI (Unified Kernel Image) Setup
SecureBoot requires UKI format images with embedded signatures.
#### Generate SecureBoot Keys
```bash
# Generate platform key (PK)
talosctl gen secureboot uki --platform-key-path platform.key --platform-cert-path platform.crt
# Generate PCR signing key
talosctl gen secureboot pcr --pcr-key-path pcr.key --pcr-cert-path pcr.crt
# Generate database entries
talosctl gen secureboot database --enrolled-certificate platform.crt
```
#### Machine Configuration for SecureBoot
```yaml
machine:
secureboot:
enabled: true
uklPath: /boot/vmlinuz
systemDiskEncryption:
state:
provider: luks2
keys:
- slot: 0
tpm:
pcrTargets:
- 0
- 1
- 7
```
### UEFI Configuration
- Enable SecureBoot in UEFI firmware
- Enroll platform keys and certificates
- Configure TPM 2.0 for PCR measurements
- Set boot order for UKI images
## Hardware-Specific Configurations
### Performance Tuning for Bare Metal
#### CPU Governor Configuration
```yaml
machine:
sysfs:
"devices.system.cpu.cpu0.cpufreq.scaling_governor": "performance"
"devices.system.cpu.cpu1.cpufreq.scaling_governor": "performance"
```
#### Hardware Vulnerability Mitigations
```yaml
machine:
kernel:
args:
- mitigations=off # For maximum performance (less secure)
# or
- mitigations=auto # Default balanced approach
```
#### IOMMU Configuration
```yaml
machine:
kernel:
args:
- intel_iommu=on
- iommu=pt
```
### Memory Management
```yaml
machine:
kernel:
args:
- hugepages=1024 # 1GB hugepages
- transparent_hugepage=never
```
## Ingress Firewall for Bare Metal
### Basic Firewall Configuration
```yaml
machine:
network:
firewall:
defaultAction: block
rules:
- name: allow-talos-api
portSelector:
ports:
- 50000
- 50001
ingress:
- subnet: 192.168.1.0/24
- name: allow-kubernetes-api
portSelector:
ports:
- 6443
ingress:
- subnet: 0.0.0.0/0
- name: allow-etcd
portSelector:
ports:
- 2379
- 2380
ingress:
- subnet: 192.168.1.0/24
```
### Advanced Firewall Rules
```yaml
machine:
network:
firewall:
defaultAction: block
rules:
- name: allow-ssh-management
portSelector:
ports:
- 22
ingress:
- subnet: 10.0.1.0/24 # Management network only
- name: allow-monitoring
portSelector:
ports:
- 9100 # Node exporter
- 10250 # kubelet metrics
ingress:
- subnet: 192.168.1.0/24
```
## System Extensions for Bare Metal
### Common Bare Metal Extensions
```yaml
machine:
install:
extensions:
- image: ghcr.io/siderolabs/iscsi-tools:latest
- image: ghcr.io/siderolabs/util-linux-tools:latest
- image: ghcr.io/siderolabs/drbd:latest
```
### Storage Extensions
```yaml
machine:
install:
extensions:
- image: ghcr.io/siderolabs/zfs:latest
- image: ghcr.io/siderolabs/nut-client:latest
- image: ghcr.io/siderolabs/smartmontools:latest
```
### Checking Extension Status
```bash
# List installed extensions
talosctl -n <IP> get extensions
# Check extension services
talosctl -n <IP> get extensionserviceconfigs
```
## Static Pod Configuration for Bare Metal
### Local Storage Static Pods
```yaml
machine:
pods:
- name: local-storage-provisioner
namespace: kube-system
image: rancher/local-path-provisioner:v0.0.24
args:
- --config-path=/etc/config/config.json
env:
- name: POD_NAMESPACE
value: kube-system
volumeMounts:
- name: config
mountPath: /etc/config
- name: local-storage
mountPath: /opt/local-path-provisioner
volumes:
- name: config
hostPath:
path: /etc/local-storage
- name: local-storage
hostPath:
path: /var/lib/local-storage
```
### Hardware Monitoring Static Pods
```yaml
machine:
pods:
- name: node-exporter
namespace: monitoring
image: prom/node-exporter:latest
args:
- --path.rootfs=/host
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
securityContext:
runAsNonRoot: true
runAsUser: 65534
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: rootfs
mountPath: /host
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
```
## Bare Metal Boot Asset Management
### PXE Boot Configuration
For network booting, configure DHCP/TFTP with appropriate boot assets:
```bash
# Download kernel and initramfs for PXE
curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/vmlinuz-amd64
curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/initramfs-amd64.xz
```
### USB Boot Asset Creation
```bash
# Write installer image to USB
sudo dd if=metal-amd64.iso of=/dev/sdX bs=4M status=progress
```
### Image Factory Integration
For custom bare metal images:
```bash
# Generate schematic for bare metal with extensions
curl -X POST --data-binary @schematic.yaml \
https://factory.talos.dev/schematics
# Download custom installer
curl -LO https://factory.talos.dev/image/<schematic-id>/v1.11.0/metal-amd64.iso
```
## Hardware Compatibility and Drivers
### Check Hardware Support
```bash
# Check PCI devices
talosctl -n <IP> read /proc/bus/pci/devices
# Check USB devices
talosctl -n <IP> read /proc/bus/usb/devices
# Check loaded kernel modules
talosctl -n <IP> read /proc/modules
# Check hardware information
talosctl -n <IP> read /proc/cpuinfo
talosctl -n <IP> read /proc/meminfo
```
### Common Hardware Issues
#### Network Interface Issues
```bash
# Check interface status
talosctl -n <IP> list /sys/class/net/
# Check driver information
talosctl -n <IP> read /sys/class/net/eth0/device/driver
# Check firmware loading
talosctl -n <IP> dmesg | grep firmware
```
#### Storage Controller Issues
```bash
# Check block devices
talosctl -n <IP> disks
# Check SMART status (if smartmontools extension installed)
talosctl -n <IP> list /dev/disk/by-id/
```
## Bare Metal Monitoring and Maintenance
### Hardware Health Monitoring
```bash
# Check system temperatures (if available)
talosctl -n <IP> read /sys/class/thermal/thermal_zone0/temp
# Check power supply status
talosctl -n <IP> read /sys/class/power_supply/*/status
# Monitor system events for hardware issues
talosctl -n <IP> dmesg | grep -i error
talosctl -n <IP> dmesg | grep -i "machine check"
```
### Performance Monitoring
```bash
# Check CPU performance
talosctl -n <IP> read /proc/cpuinfo | grep MHz
talosctl -n <IP> cgroups --preset cpu
# Check memory performance
talosctl -n <IP> memory
talosctl -n <IP> cgroups --preset memory
# Check I/O performance
talosctl -n <IP> read /proc/diskstats
```
## Security Hardening for Bare Metal
### BIOS/UEFI Security
- Enable SecureBoot
- Disable unused boot devices
- Set administrator passwords
- Enable TPM 2.0
- Disable legacy boot modes
### Physical Security
- Secure physical access to servers
- Use chassis intrusion detection
- Implement network port security
- Consider hardware-based attestation
### Network Security
```yaml
machine:
network:
firewall:
defaultAction: block
rules:
# Only allow necessary services
- name: allow-cluster-traffic
portSelector:
ports:
- 6443 # Kubernetes API
- 2379 # etcd client
- 2380 # etcd peer
- 10250 # kubelet API
- 50000 # Talos API
ingress:
- subnet: 192.168.1.0/24
```
This bare metal guide provides comprehensive coverage of hardware-specific configurations, performance optimization, security hardening, and operational practices for Talos Linux on physical servers.

View File

@@ -0,0 +1,382 @@
# Talosctl CLI Essentials
This guide covers essential talosctl commands and usage patterns for effective Talos cluster administration.
## Command Structure and Context
### Basic Command Pattern
```bash
talosctl [global-flags] <command> [command-flags] [arguments]
# Examples:
talosctl -n <IP> get members
talosctl --nodes <IP1>,<IP2> service kubelet
talosctl -e <endpoint> -n <target-nodes> upgrade --image <image>
```
### Global Flags
- `-e, --endpoints`: API endpoints to connect to
- `-n, --nodes`: Target nodes for commands (defaults to first endpoint if omitted)
- `--talosconfig`: Path to Talos configuration file
- `--context`: Configuration context to use
### Configuration Management
```bash
# Use specific config file
export TALOSCONFIG=/path/to/talosconfig
# List available contexts
talosctl config contexts
# Switch context
talosctl config context <context-name>
# View current config
talosctl config info
```
## Cluster Management Commands
### Bootstrap and Node Management
```bash
# Bootstrap etcd cluster on first control plane node
talosctl bootstrap --nodes <first-controlplane-ip>
# Apply machine configuration
talosctl apply-config --nodes <IP> --file <config.yaml>
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
# Reset node (wipe and reboot)
talosctl reset --nodes <IP>
talosctl reset --nodes <IP> --graceful=false --reboot
# Reboot node
talosctl reboot --nodes <IP>
# Shutdown node
talosctl shutdown --nodes <IP>
```
### Configuration Patching
```bash
# Patch machine configuration
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
# Patch with file
talosctl -n <IP> patch mc --patch @patch.yaml --mode reboot
# Edit machine config interactively
talosctl -n <IP> edit mc --mode staged
```
## System Information and Monitoring
### Node Status and Health
```bash
# Cluster member information
talosctl get members
talosctl get affiliates
talosctl get identities
# Node health check
talosctl -n <IP> health
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
# System information
talosctl -n <IP> version
talosctl -n <IP> get machineconfig
talosctl -n <IP> get machinetype
```
### Resource Monitoring
```bash
# CPU and memory usage
talosctl -n <IP> cpu
talosctl -n <IP> memory
# Disk usage and information
talosctl -n <IP> disks
talosctl -n <IP> df
# Network interfaces
talosctl -n <IP> interfaces
talosctl -n <IP> get addresses
talosctl -n <IP> get routes
# Process information
talosctl -n <IP> processes
talosctl -n <IP> cgroups --preset memory
talosctl -n <IP> cgroups --preset cpu
```
### Service Management
```bash
# List all services
talosctl -n <IP> services
# Check specific service status
talosctl -n <IP> service kubelet
talosctl -n <IP> service containerd
talosctl -n <IP> service etcd
# Restart service
talosctl -n <IP> service kubelet restart
# Start/stop service
talosctl -n <IP> service <service-name> start
talosctl -n <IP> service <service-name> stop
```
## Logging and Diagnostics
### Log Retrieval
```bash
# Kernel logs
talosctl -n <IP> dmesg
talosctl -n <IP> dmesg -f # Follow mode
talosctl -n <IP> dmesg --tail=100
# Service logs
talosctl -n <IP> logs kubelet
talosctl -n <IP> logs containerd
talosctl -n <IP> logs etcd
talosctl -n <IP> logs machined
# Follow logs
talosctl -n <IP> logs kubelet -f
```
### System Events
```bash
# Monitor system events
talosctl -n <IP> events
talosctl -n <IP> events --tail
# Filter events
talosctl -n <IP> events --since=1h
talosctl -n <IP> events --grep=error
```
## File System and Container Operations
### File Operations
```bash
# List files/directories
talosctl -n <IP> list /var/log
talosctl -n <IP> list /etc/kubernetes
# Copy files to/from node
talosctl -n <IP> copy /local/file /remote/path
talosctl -n <IP> cp /var/log/containers/app.log ./app.log
# Read file contents
talosctl -n <IP> read /etc/resolv.conf
talosctl -n <IP> cat /var/log/audit/audit.log
```
### Container Operations
```bash
# List containers
talosctl -n <IP> containers
talosctl -n <IP> containers -k # Kubernetes containers
# Container logs
talosctl -n <IP> logs --kubernetes <container-name>
# Execute in container
talosctl -n <IP> exec --kubernetes <pod-name> -- <command>
```
## Kubernetes Integration
### Kubernetes Cluster Operations
```bash
# Get kubeconfig
talosctl kubeconfig
talosctl kubeconfig --nodes <controlplane-ip>
talosctl kubeconfig --force --nodes <controlplane-ip>
# Bootstrap manifests
talosctl -n <IP> get manifests
talosctl -n <IP> get manifests -o yaml | yq eval-all '.spec | .[] | splitDoc' - > manifests.yaml
# Upgrade Kubernetes
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
```
### Resource Inspection
```bash
# Control plane component configs
talosctl -n <IP> get apiserverconfig -o yaml
talosctl -n <IP> get controllermanagerconfig -o yaml
talosctl -n <IP> get schedulerconfig -o yaml
# etcd configuration
talosctl -n <IP> get etcdconfig -o yaml
```
## etcd Management
### etcd Operations
```bash
# etcd cluster status
talosctl -n <IP1>,<IP2>,<IP3> etcd status
# etcd members
talosctl -n <IP> etcd members
# etcd snapshots
talosctl -n <IP> etcd snapshot db.snapshot
# etcd maintenance
talosctl -n <IP> etcd defrag
talosctl -n <IP> etcd alarm list
talosctl -n <IP> etcd alarm disarm
# Leadership management
talosctl -n <IP> etcd forfeit-leadership
```
### Disaster Recovery
```bash
# Bootstrap from snapshot
talosctl -n <IP> bootstrap --recover-from=./db.snapshot
talosctl -n <IP> bootstrap --recover-from=./db.snapshot --recover-skip-hash-check
```
## Upgrade and Maintenance
### OS Upgrades
```bash
# Upgrade Talos OS
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
# Monitor upgrade progress
talosctl upgrade --nodes <IP> --image <image> --wait
talosctl upgrade --nodes <IP> --image <image> --wait --debug
# Rollback
talosctl rollback --nodes <IP>
```
## Resource System Commands
### Resource Management
```bash
# List resource types
talosctl get rd
# Get specific resources
talosctl get <resource-type>
talosctl get <resource-type> -o yaml
talosctl get <resource-type> --namespace=<namespace>
# Watch resources
talosctl get <resource-type> --watch
# Common resource types
talosctl get machineconfig
talosctl get members
talosctl get services
talosctl get networkconfig
talosctl get secrets
```
## Local Development
### Local Cluster Management
```bash
# Create local cluster
talosctl cluster create
talosctl cluster create --controlplanes 3 --workers 2
# Destroy local cluster
talosctl cluster destroy
# Show local cluster status
talosctl cluster show
```
## Advanced Usage Patterns
### Multi-Node Operations
```bash
# Run command on multiple nodes
talosctl -e <endpoint> -n <node1>,<node2>,<node3> <command>
# Different endpoint and target nodes
talosctl -e <public-endpoint> -n <internal-node1>,<internal-node2> <command>
```
### Output Formatting
```bash
# JSON output
talosctl -n <IP> get members -o json
# YAML output
talosctl -n <IP> get machineconfig -o yaml
# Table output (default)
talosctl -n <IP> get members -o table
# Custom column output
talosctl -n <IP> get members -o columns=HOSTNAME,MACHINE\ TYPE,OS
```
### Filtering and Selection
```bash
# Filter resources
talosctl get members --search <hostname>
talosctl get services --search kubelet
# Namespace filtering
talosctl get secrets --namespace=secrets
talosctl get affiliates --namespace=cluster-raw
```
## Common Command Workflows
### Initial Cluster Setup
```bash
# 1. Generate configurations
talosctl gen config cluster-name https://cluster-endpoint:6443
# 2. Apply to nodes
talosctl apply-config --nodes <controlplane-1> --file controlplane.yaml
talosctl apply-config --nodes <worker-1> --file worker.yaml
# 3. Bootstrap cluster
talosctl bootstrap --nodes <controlplane-1>
# 4. Get kubeconfig
talosctl kubeconfig --nodes <controlplane-1>
```
### Cluster Health Check
```bash
# Check all aspects of cluster health
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
talosctl -n <IP1>,<IP2>,<IP3> etcd status
talosctl -n <IP1>,<IP2>,<IP3> service kubelet
kubectl get nodes
kubectl get pods --all-namespaces
```
### Node Troubleshooting
```bash
# System diagnostics
talosctl -n <IP> dmesg | tail -100
talosctl -n <IP> services | grep -v Running
talosctl -n <IP> logs kubelet | tail -50
talosctl -n <IP> events --since=1h
# Resource usage
talosctl -n <IP> memory
talosctl -n <IP> df
talosctl -n <IP> processes | head -20
```
This CLI reference provides the essential commands and patterns needed for day-to-day Talos cluster administration and troubleshooting.

View File

@@ -0,0 +1,239 @@
# Talos Cluster Operations Guide
This guide covers essential cluster operations for Talos Linux v1.11 administrators.
## Upgrading Operations
### Talos OS Upgrades
Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image.
#### Upgrade Process
```bash
# Upgrade a single node
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
# Use --stage flag if upgrade fails due to open files
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
# Monitor upgrade progress
talosctl dmesg -f
talosctl upgrade --wait --debug
```
#### Upgrade Sequence
1. Node cordons itself in Kubernetes
2. Node drains existing workloads
3. Internal processes shut down
4. Filesystems unmount
5. Disk verification and image upgrade
6. Bootloader set to boot once with new image
7. Node reboots
8. Node rejoins cluster and uncordons
#### Rollback
```bash
talosctl rollback --nodes <IP>
```
### Kubernetes Upgrades
Kubernetes upgrades are separate from OS upgrades and non-disruptive.
#### Automated Upgrade (Recommended)
```bash
# Check what will be upgraded
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
# Perform upgrade
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
```
#### Manual Component Upgrades
For manual control, patch each component individually:
**API Server:**
```bash
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]'
```
**Controller Manager:**
```bash
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]'
```
**Scheduler:**
```bash
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]'
```
**Kubelet:**
```bash
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]'
```
## Node Management
### Adding Control Plane Nodes
1. Apply machine configuration to new node
2. Node automatically joins etcd cluster via control plane endpoint
3. Control plane components start automatically
### Removing Control Plane Nodes
```bash
# Recommended approach - reset then delete
talosctl -n <IP.of.node.to.remove> reset
kubectl delete node <node-name>
```
### Adding Worker Nodes
1. Apply worker machine configuration
2. Node automatically joins via bootstrap token
### Removing Worker Nodes
```bash
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>
talosctl -n <IP> reset
```
## Configuration Management
### Applying Configuration Changes
```bash
# Apply config with automatic mode detection
talosctl apply-config --nodes <IP> --file <config.yaml>
# Apply with specific modes
talosctl apply-config --nodes <IP> --file <config.yaml> --mode no-reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode staged
# Dry run to preview changes
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
```
### Configuration Patching
```bash
# Patch machine configuration
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
# Patch with file
talosctl -n <IP> patch mc --patch @patch.yaml
```
### Retrieving Current Configuration
```bash
# Get machine configuration
talosctl -n <IP> get mc v1alpha1 -o yaml
# Get effective configuration
talosctl -n <IP> get machineconfig -o yaml
```
## Cluster Health Monitoring
### Node Status
```bash
# Check node status
talosctl -n <IP> get members
talosctl -n <IP> health
# Check system services
talosctl -n <IP> services
talosctl -n <IP> service <service-name>
```
### Resource Monitoring
```bash
# System resources
talosctl -n <IP> memory
talosctl -n <IP> cpu
talosctl -n <IP> disks
# Process information
talosctl -n <IP> processes
talosctl -n <IP> cgroups --preset memory
```
### Log Monitoring
```bash
# Kernel logs
talosctl -n <IP> dmesg
talosctl -n <IP> dmesg -f # Follow mode
# Service logs
talosctl -n <IP> logs <service-name>
talosctl -n <IP> logs kubelet
```
## Control Plane Best Practices
### Cluster Sizing Recommendations
- **3 nodes**: Sufficient for most use cases, tolerates 1 node failure
- **5 nodes**: Better availability (tolerates 2 node failures), higher resource cost
- **Avoid even numbers**: 2 or 4 nodes provide worse availability than odd numbers
### Node Replacement Strategy
- **Failed node**: Remove first, then add replacement
- **Healthy node**: Add replacement first, then remove old node
### Performance Considerations
- etcd performance decreases as cluster scales
- 5-node cluster commits ~5% fewer writes than 3-node cluster
- Vertically scale nodes for performance, don't add more nodes
## Machine Configuration Versioning
### Reproducible Configuration Workflow
Store only:
- `secrets.yaml` (generated once at cluster creation)
- Patch files (YAML/JSON patches describing differences from defaults)
Generate configs when needed:
```bash
# Generate fresh configs with existing secrets
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml
# Apply patches to generated configs
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml --config-patch @patch.yaml
```
This prevents configuration drift after automated upgrades.
## Troubleshooting Common Issues
### Upgrade Failures
- **Invalid installer image**: Check image reference and network connectivity
- **Filesystem unmount failure**: Use `--stage` flag
- **Boot failure**: System automatically rolls back to previous version
- **Workload issues**: Use `talosctl rollback` to revert
### Node Join Issues
- Verify network connectivity to control plane endpoint
- Check discovery service configuration
- Validate machine configuration syntax
- Ensure bootstrap process completed on initial control plane node
### Control Plane Quorum Loss
- Identify healthy nodes with `talosctl etcd status`
- Follow disaster recovery procedures if quorum cannot be restored
- Use etcd snapshots for cluster recovery
## Security Considerations
### Certificate Rotation
Talos automatically rotates certificates, but monitor expiration:
```bash
talosctl -n <IP> get secrets
```
### Pod Security
Control plane nodes are tainted by default to prevent workload scheduling. This protects:
- Control plane from resource starvation
- Credentials from workload exposure
### Network Security
- All API communication uses mutual TLS (mTLS)
- Discovery service data is encrypted before transmission
- WireGuard (KubeSpan) provides mesh networking security

View File

@@ -0,0 +1,344 @@
# Discovery and Networking Guide
This guide covers Talos cluster discovery mechanisms, network configuration, and connectivity troubleshooting.
## Cluster Discovery System
Talos includes built-in node discovery that allows cluster members to find each other and maintain membership information.
### Discovery Registries
#### Service Registry (Default)
- **External Service**: Uses public discovery service at `https://discovery.talos.dev/`
- **Encryption**: All data encrypted with AES-GCM before transmission
- **Functionality**: Works without dependency on etcd/Kubernetes
- **Advantages**: Available even when control plane is down
#### Kubernetes Registry (Deprecated)
- **Data Source**: Uses Kubernetes Node resources and annotations
- **Limitation**: Incompatible with Kubernetes 1.32+ due to AuthorizeNodeWithSelectors
- **Status**: Disabled by default, deprecated
### Discovery Configuration
```yaml
cluster:
discovery:
enabled: true
registries:
service:
disabled: false # Default
kubernetes:
disabled: true # Deprecated, disabled by default
```
**To disable service registry**:
```yaml
cluster:
discovery:
enabled: true
registries:
service:
disabled: true
```
## Discovery Data Flow
### Service Registry Process
1. **Data Encryption**: Node encrypts affiliate data with cluster key
2. **Endpoint Encryption**: Endpoints separately encrypted for deduplication
3. **Data Submission**: Node submits own data + observed peer endpoints
4. **Server Processing**: Discovery service aggregates and deduplicates data
5. **Data Distribution**: Encrypted updates sent to all cluster members
6. **Local Processing**: Nodes decrypt data for cluster discovery and KubeSpan
### Data Protection
- **Cluster Isolation**: Cluster ID used as key selector
- **End-to-End Encryption**: Discovery service cannot decrypt affiliate data
- **Memory-Only Storage**: Data stored in memory with encrypted snapshots
- **No Sensitive Exposure**: Service only sees encrypted blobs and cluster metadata
## Discovery Resources
### Node Identity
```bash
# View node's unique identity
talosctl get identities -o yaml
```
**Output**:
```yaml
spec:
nodeId: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd
```
**Identity Characteristics**:
- Base62 encoded random 32 bytes
- URL-safe encoding
- Preserved in STATE partition (`node-identity.yaml`)
- Survives reboots and upgrades
- Regenerated on reset/wipe
### Affiliates (Proposed Members)
```bash
# View discovered affiliates (proposed cluster members)
talosctl get affiliates
```
**Output**:
```
ID VERSION HOSTNAME MACHINE TYPE ADDRESSES
2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 2 talos-default-controlplane-2 controlplane ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]
```
### Members (Approved Members)
```bash
# View cluster members
talosctl get members
```
**Output**:
```
ID VERSION HOSTNAME MACHINE TYPE OS ADDRESSES
talos-default-controlplane-1 2 talos-default-controlplane-1 controlplane Talos (v1.11.0) ["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"]
```
### Raw Registry Data
```bash
# View data from specific registries
talosctl get affiliates --namespace=cluster-raw
```
**Output shows registry sources**:
```
ID VERSION HOSTNAME
k8s/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 3 talos-default-controlplane-2
service/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 23 talos-default-controlplane-2
```
## Network Architecture
### Network Layers
#### Host Networking
- **Node-to-Node**: Direct IP connectivity between cluster nodes
- **Control Plane**: API server communication via control plane endpoint
- **Discovery**: HTTPS connection to discovery service (port 443)
#### Container Networking
- **CNI**: Container Network Interface for pod networking
- **Service Mesh**: Optional service mesh implementations
- **Network Policies**: Kubernetes network policy enforcement
#### Optional: KubeSpan (WireGuard Mesh)
- **Mesh Networking**: Full mesh WireGuard connections
- **Discovery Integration**: Uses discovery service for peer coordination
- **Encryption**: WireGuard public keys distributed via discovery
- **Use Cases**: Multi-cloud, hybrid, NAT traversal
### Network Configuration Patterns
#### Basic Network Setup
```yaml
machine:
network:
interfaces:
- interface: eth0
dhcp: true
```
#### Static IP Configuration
```yaml
machine:
network:
interfaces:
- interface: eth0
addresses:
- 192.168.1.100/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
mtu: 1500
nameservers:
- 8.8.8.8
- 1.1.1.1
```
#### Multiple Interface Configuration
```yaml
machine:
network:
interfaces:
- interface: eth0 # Management interface
dhcp: true
- interface: eth1 # Kubernetes traffic
addresses:
- 10.0.1.100/24
routes:
- network: 10.0.0.0/16
gateway: 10.0.1.1
```
## KubeSpan Configuration
### Basic KubeSpan Setup
```yaml
machine:
network:
kubespan:
enabled: true
```
### Advanced KubeSpan Configuration
```yaml
machine:
network:
kubespan:
enabled: true
advertiseKubernetesNetworks: true
allowDownPeerBypass: true
mtu: 1420 # Account for WireGuard overhead
filters:
endpoints:
- 0.0.0.0/0 # Allow all endpoints
```
**KubeSpan Features**:
- Automatic peer discovery via discovery service
- NAT traversal capabilities
- Encrypted mesh networking
- Kubernetes network advertisement
- Fault tolerance with peer bypass
## Network Troubleshooting
### Discovery Issues
#### Check Discovery Service Connectivity
```bash
# Test connectivity to discovery service
talosctl get affiliates
# Check discovery configuration
talosctl get discoveryconfig -o yaml
# Monitor discovery events
talosctl events --tail
```
#### Common Discovery Problems
1. **No Affiliates Discovered**:
- Check discovery service connectivity
- Verify cluster ID matches across nodes
- Confirm discovery is enabled
2. **Partial Affiliate List**:
- Network connectivity issues between nodes
- Discovery service regional availability
- Firewall blocking discovery traffic
3. **Discovery Service Unreachable**:
- Network connectivity to discovery.talos.dev:443
- Corporate firewall/proxy configuration
- DNS resolution issues
### Network Connectivity Testing
#### Basic Network Tests
```bash
# Test network interfaces
talosctl get addresses
talosctl get routes
talosctl get nodeaddresses
# Check network configuration
talosctl get networkconfig -o yaml
# Test connectivity
talosctl -n <IP> ping <target-ip>
```
#### Inter-Node Connectivity
```bash
# Test control plane endpoint
talosctl health --control-plane-nodes <IP1>,<IP2>,<IP3>
# Check etcd connectivity
talosctl -n <IP> etcd members
# Test Kubernetes API
kubectl get nodes
```
#### KubeSpan Troubleshooting
```bash
# Check KubeSpan status
talosctl get kubespanpeerspecs
talosctl get kubespanpeerstatuses
# Monitor WireGuard connections
talosctl -n <IP> interfaces
# Check KubeSpan logs
talosctl -n <IP> logs controller-runtime | grep kubespan
```
### Network Performance Optimization
#### Network Interface Tuning
```yaml
machine:
network:
interfaces:
- interface: eth0
mtu: 9000 # Jumbo frames if supported
dhcp: true
```
#### KubeSpan Performance
- Adjust MTU for WireGuard overhead (typically -80 bytes)
- Consider endpoint filters for large clusters
- Monitor WireGuard peer connection stability
## Security Considerations
### Discovery Security
- **Encrypted Communication**: All discovery data encrypted end-to-end
- **Cluster Isolation**: Cluster ID prevents cross-cluster data access
- **No Sensitive Data**: Only encrypted metadata transmitted
- **Network Security**: HTTPS transport with certificate validation
### Network Security
- **mTLS**: All Talos API communication uses mutual TLS
- **Certificate Rotation**: Automatic certificate lifecycle management
- **Network Policies**: Implement Kubernetes network policies for workloads
- **Firewall Rules**: Restrict network access to necessary ports only
### Required Network Ports
- **6443**: Kubernetes API server
- **2379-2380**: etcd client/peer communication
- **10250**: kubelet API
- **50000**: Talos API (apid)
- **443**: Discovery service (outbound)
- **51820**: KubeSpan WireGuard (if enabled)
## Operational Best Practices
### Monitoring
- Monitor discovery service connectivity
- Track cluster member changes
- Alert on network partitions
- Monitor KubeSpan peer status
### Backup and Recovery
- Document network configuration
- Backup discovery service configuration
- Test network recovery procedures
- Plan for discovery service outages
### Scaling Considerations
- Discovery service scales to thousands of nodes
- KubeSpan mesh scales to hundreds of nodes efficiently
- Consider network segmentation for large clusters
- Plan for multi-region deployments
This networking foundation enables Talos clusters to maintain connectivity and membership across various network topologies while providing security and performance optimization options.

View File

@@ -0,0 +1,287 @@
# etcd Management and Disaster Recovery Guide
This guide covers etcd database operations, maintenance, and disaster recovery procedures for Talos Linux clusters.
## etcd Health Monitoring
### Basic Health Checks
```bash
# Check etcd status across all control plane nodes
talosctl -n <IP1>,<IP2>,<IP3> etcd status
# Check etcd alarms
talosctl -n <IP> etcd alarm list
# Check etcd members
talosctl -n <IP> etcd members
# Check service status
talosctl -n <IP> service etcd
```
### Understanding etcd Status Output
```
NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS
172.20.0.2 a49c021e76e707db 17 MB 4.5 MB (26.10%) ecebb05b59a776f1 53391 4 53391 false
```
**Key Metrics**:
- **DB SIZE**: Total database size on disk
- **IN USE**: Actual data size (fragmentation = DB SIZE - IN USE)
- **LEADER**: Current etcd cluster leader
- **RAFT INDEX**: Consensus log position
- **LEARNER**: Whether node is still joining cluster
## Space Quota Management
### Default Configuration
- Default space quota: 2 GiB
- Recommended maximum: 8 GiB
- Database locks when quota exceeded
### Quota Exceeded Handling
**Symptoms**:
```bash
talosctl -n <IP> etcd alarm list
# Output: ALARM: NOSPACE
```
**Resolution**:
1. Increase quota in machine configuration:
```yaml
cluster:
etcd:
extraArgs:
quota-backend-bytes: 4294967296 # 4 GiB
```
2. Apply configuration and reboot:
```bash
talosctl -n <IP> apply-config --file updated-config.yaml --mode reboot
```
3. Clear the alarm:
```bash
talosctl -n <IP> etcd alarm disarm
```
## Database Defragmentation
### When to Defragment
- In use/DB size ratio < 0.5 (heavily fragmented)
- Database size exceeds quota but actual data is small
- Performance degradation due to fragmentation
### Defragmentation Process
```bash
# Check fragmentation status
talosctl -n <IP1>,<IP2>,<IP3> etcd status
# Defragment single node (resource-intensive operation)
talosctl -n <IP1> etcd defrag
# Verify defragmentation results
talosctl -n <IP1> etcd status
```
**Important Notes**:
- Defragment one node at a time
- Operation blocks reads/writes during execution
- Can significantly improve performance if heavily fragmented
### Post-Defragmentation Verification
After successful defragmentation, DB size should closely match IN USE size:
```
NODE MEMBER DB SIZE IN USE
172.20.0.2 a49c021e76e707db 4.5 MB 4.5 MB (100.00%)
```
## Backup Operations
### Regular Snapshots
```bash
# Create consistent snapshot
talosctl -n <IP> etcd snapshot db.snapshot
```
**Output Example**:
```
etcd snapshot saved to "db.snapshot" (2015264 bytes)
snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136
```
### Disaster Snapshots
When etcd cluster is unhealthy and normal snapshot fails:
```bash
# Copy database directly (may be inconsistent)
talosctl -n <IP> cp /var/lib/etcd/member/snap/db .
```
### Automated Backup Strategy
- Schedule regular snapshots (daily/hourly based on change frequency)
- Store snapshots in multiple locations
- Test restore procedures regularly
- Document recovery procedures
## Disaster Recovery
### Pre-Recovery Assessment
**Check if Recovery is Necessary**:
```bash
# Query etcd health on all control plane nodes
talosctl -n <IP1>,<IP2>,<IP3> service etcd
# Check member list consistency
talosctl -n <IP1> etcd members
talosctl -n <IP2> etcd members
talosctl -n <IP3> etcd members
```
**Recovery is needed when**:
- Quorum is lost (majority of nodes down)
- etcd data corruption
- Complete cluster failure
### Recovery Prerequisites
1. **Latest etcd snapshot** (preferably consistent)
2. **Machine configuration backup**:
```bash
talosctl -n <IP> get mc v1alpha1 -o yaml | yq eval '.spec' -
```
3. **No init-type nodes** (deprecated, incompatible with recovery)
### Recovery Procedure
#### Step 1: Prepare Control Plane Nodes
```bash
# If nodes have hardware issues, replace them with same configuration
# If nodes are running but etcd is corrupted, wipe EPHEMERAL partition:
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
```
#### Step 2: Verify etcd State
All etcd services should be in "Preparing" state:
```bash
talosctl -n <IP> service etcd
# Expected: STATE: Preparing
```
#### Step 3: Bootstrap from Snapshot
```bash
# Bootstrap cluster from snapshot
talosctl -n <IP> bootstrap --recover-from=./db.snapshot
# For direct database copies, skip hash check:
talosctl -n <IP> bootstrap --recover-from=./db --recover-skip-hash-check
```
#### Step 4: Verify Recovery
**Monitor kernel logs** for recovery progress:
```bash
talosctl -n <IP> dmesg -f
```
**Expected log entries**:
```
recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot"}
```
**Verify cluster health**:
```bash
# etcd should become healthy on bootstrap node
talosctl -n <IP> service etcd
# Kubernetes control plane should start
kubectl get nodes
# Other control plane nodes should join automatically
talosctl -n <IP1>,<IP2>,<IP3> etcd status
```
## etcd Version Management
### Downgrade Process (v3.6 to v3.5)
**Prerequisites**:
- Healthy cluster running v3.6.x
- Recent backup snapshot
- Downgrade only one minor version at a time
#### Step 1: Validate Downgrade
```bash
talosctl -n <IP1> etcd downgrade validate 3.5
```
#### Step 2: Enable Downgrade
```bash
talosctl -n <IP1> etcd downgrade enable 3.5
```
#### Step 3: Verify Schema Migration
```bash
# Check storage version migrated to 3.5
talosctl -n <IP1>,<IP2>,<IP3> etcd status
# Verify STORAGE column shows 3.5.0
```
#### Step 4: Patch Machine Configuration
```bash
# Transfer leadership if node is leader
talosctl -n <IP1> etcd forfeit-leadership
# Create patch file
cat > etcd-patch.yaml <<EOF
cluster:
etcd:
image: gcr.io/etcd-development/etcd:v3.5.22
EOF
# Apply patch with reboot
talosctl -n <IP1> patch machineconfig --patch @etcd-patch.yaml --mode reboot
```
#### Step 5: Repeat for All Control Plane Nodes
Continue patching remaining control plane nodes one by one.
## Operational Best Practices
### Monitoring
- Monitor database size and fragmentation regularly
- Set up alerts for space quota approaching limits
- Track etcd performance metrics (request latency, leader changes)
- Monitor disk I/O and network latency
### Maintenance Windows
- Schedule defragmentation during low-traffic periods
- Coordinate with application teams for maintenance windows
- Test backup/restore procedures in non-production environments
### Performance Optimization
- Use fast storage (NVMe SSDs preferred)
- Minimize network latency between control plane nodes
- Monitor and tune etcd configuration based on workload
### Security
- Encrypt etcd data at rest
- Secure backup storage with appropriate access controls
- Regularly rotate certificates
- Monitor for unauthorized access attempts
## Troubleshooting Common Issues
### Split Brain Prevention
- Ensure odd number of control plane nodes
- Monitor network connectivity between nodes
- Use dedicated network for control plane communication when possible
### Performance Issues
- Check disk I/O latency
- Monitor memory usage
- Consider vertical scaling before adding nodes
- Review etcd request patterns and optimize applications
### Backup/Restore Issues
- Test restore procedures regularly
- Verify backup integrity
- Ensure consistent network and storage configuration
- Document and practice disaster recovery procedures

View File

@@ -0,0 +1,480 @@
# Talos Troubleshooting Guide
This guide provides systematic approaches to diagnosing and resolving common Talos cluster issues.
## General Troubleshooting Methodology
### 1. Gather Information
```bash
# Node status and health
talosctl -n <IP> health
talosctl -n <IP> version
talosctl -n <IP> get members
# System resources
talosctl -n <IP> memory
talosctl -n <IP> disks
talosctl -n <IP> processes | head -20
# Service status
talosctl -n <IP> services
```
### 2. Check Logs
```bash
# Kernel logs (system-level issues)
talosctl -n <IP> dmesg | tail -100
# Service logs
talosctl -n <IP> logs machined
talosctl -n <IP> logs kubelet
talosctl -n <IP> logs containerd
# System events
talosctl -n <IP> events --since=1h
```
### 3. Network Connectivity
```bash
# Discovery and membership
talosctl get affiliates
talosctl get members
# Network interfaces
talosctl -n <IP> interfaces
talosctl -n <IP> get addresses
# Control plane connectivity
kubectl get nodes
talosctl -n <IP1>,<IP2>,<IP3> etcd status
```
## Bootstrap and Initial Setup Issues
### Cluster Bootstrap Failures
**Symptoms**: Bootstrap command fails or times out
**Diagnosis**:
```bash
# Check etcd service state
talosctl -n <IP> service etcd
# Check if node is trying to join instead of bootstrap
talosctl -n <IP> logs etcd | grep -i bootstrap
# Verify machine configuration
talosctl -n <IP> get machineconfig -o yaml
```
**Common Causes & Solutions**:
1. **Wrong node type**: Ensure using `controlplane`, not deprecated `init`
2. **Network issues**: Verify control plane endpoint connectivity
3. **Configuration errors**: Check machine configuration validity
4. **Previous bootstrap**: etcd data exists from previous attempts
**Resolution**:
```bash
# Reset node if previous bootstrap data exists
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
# Re-apply configuration and bootstrap
talosctl apply-config --nodes <IP> --file controlplane.yaml
talosctl bootstrap --nodes <IP>
```
### Node Join Issues
**Symptoms**: New nodes don't join cluster
**Diagnosis**:
```bash
# Check discovery
talosctl get affiliates
talosctl get members
# Check bootstrap token
kubectl get secrets -n kube-system | grep bootstrap-token
# Check kubelet logs
talosctl -n <IP> logs kubelet | grep -i certificate
```
**Common Solutions**:
```bash
# Regenerate bootstrap token if expired
kubeadm token create --print-join-command
# Verify discovery service connectivity
talosctl -n <IP> get affiliates --namespace=cluster-raw
# Check machine configuration matches cluster
talosctl -n <IP> get machineconfig -o yaml
```
## Control Plane Issues
### etcd Problems
**etcd Won't Start**:
```bash
# Check etcd service status and logs
talosctl -n <IP> service etcd
talosctl -n <IP> logs etcd
# Check etcd data directory
talosctl -n <IP> list /var/lib/etcd
# Check disk space and permissions
talosctl -n <IP> df
```
**etcd Quorum Loss**:
```bash
# Check member status
talosctl -n <IP1>,<IP2>,<IP3> etcd status
talosctl -n <IP> etcd members
# Identify healthy members
for ip in IP1 IP2 IP3; do
echo "=== Node $ip ==="
talosctl -n $ip service etcd
done
```
**Solution for Quorum Loss**:
1. If majority available: Remove failed members, add replacements
2. If majority lost: Follow disaster recovery procedure
### API Server Issues
**API Server Not Responding**:
```bash
# Check API server pod status
kubectl get pods -n kube-system | grep apiserver
# Check API server configuration
talosctl -n <IP> get apiserverconfig -o yaml
# Check control plane endpoint
curl -k https://<control-plane-endpoint>:6443/healthz
```
**Common Solutions**:
```bash
# Restart kubelet to reload static pods
talosctl -n <IP> service kubelet restart
# Check for configuration issues
talosctl -n <IP> logs kubelet | grep apiserver
# Verify etcd connectivity
talosctl -n <IP> etcd status
```
## Node-Level Issues
### Kubelet Problems
**Kubelet Service Issues**:
```bash
# Check kubelet status and logs
talosctl -n <IP> service kubelet
talosctl -n <IP> logs kubelet | tail -50
# Check kubelet configuration
talosctl -n <IP> get kubeletconfig -o yaml
# Check container runtime
talosctl -n <IP> service containerd
```
**Common Kubelet Issues**:
1. **Certificate problems**: Check certificate expiration and rotation
2. **Container runtime issues**: Verify containerd health
3. **Resource constraints**: Check memory and disk space
4. **Network connectivity**: Verify API server connectivity
### Container Runtime Issues
**Containerd Problems**:
```bash
# Check containerd service
talosctl -n <IP> service containerd
talosctl -n <IP> logs containerd
# List containers
talosctl -n <IP> containers
talosctl -n <IP> containers -k # Kubernetes containers
# Check containerd configuration
talosctl -n <IP> read /etc/cri/conf.d/cri.toml
```
**Common Solutions**:
```bash
# Restart containerd
talosctl -n <IP> service containerd restart
# Check disk space for container images
talosctl -n <IP> df
# Clean up unused containers/images
# (This happens automatically via kubelet GC)
```
## Network Issues
### Network Connectivity Problems
**Node-to-Node Connectivity**:
```bash
# Test basic network connectivity
talosctl -n <IP1> interfaces
talosctl -n <IP1> get routes
# Test specific connectivity
talosctl -n <IP1> read /etc/resolv.conf
# Check network configuration
talosctl -n <IP> get networkconfig -o yaml
```
**DNS Resolution Issues**:
```bash
# Check DNS configuration
talosctl -n <IP> read /etc/resolv.conf
# Test DNS resolution
talosctl -n <IP> exec --kubernetes coredns-pod -- nslookup kubernetes.default.svc.cluster.local
```
### Discovery Service Issues
**Discovery Not Working**:
```bash
# Check discovery configuration
talosctl get discoveryconfig -o yaml
# Check affiliate discovery
talosctl get affiliates
talosctl get affiliates --namespace=cluster-raw
# Test discovery service connectivity
curl -v https://discovery.talos.dev/
```
**KubeSpan Issues** (if enabled):
```bash
# Check KubeSpan configuration
talosctl get kubespanconfig -o yaml
# Check peer status
talosctl get kubespanpeerspecs
talosctl get kubespanpeerstatuses
# Check WireGuard interface
talosctl -n <IP> interfaces | grep kubespan
```
## Upgrade Issues
### OS Upgrade Problems
**Upgrade Fails or Hangs**:
```bash
# Check upgrade status
talosctl -n <IP> dmesg | grep -i upgrade
talosctl -n <IP> events | grep -i upgrade
# Use staged upgrade for filesystem lock issues
talosctl upgrade --nodes <IP> --image <image> --stage
# Monitor upgrade progress
talosctl upgrade --nodes <IP> --image <image> --wait --debug
```
**Boot Issues After Upgrade**:
```bash
# Check boot logs
talosctl -n <IP> dmesg | head -100
# System automatically rolls back on boot failure
# Check current version
talosctl -n <IP> version
# Manual rollback if needed
talosctl rollback --nodes <IP>
```
### Kubernetes Upgrade Issues
**K8s Upgrade Failures**:
```bash
# Check upgrade status
talosctl --nodes <controlplane> upgrade-k8s --to <version> --dry-run
# Check individual component status
kubectl get pods -n kube-system
talosctl -n <IP> get apiserverconfig -o yaml
```
**Version Mismatch Issues**:
```bash
# Check version consistency
kubectl get nodes -o wide
talosctl -n <IP1>,<IP2>,<IP3> version
# Check component versions
kubectl get pods -n kube-system -o wide
```
## Resource and Performance Issues
### Memory and Storage Problems
**Out of Memory**:
```bash
# Check memory usage
talosctl -n <IP> memory
talosctl -n <IP> processes --sort-by=memory | head -20
# Check for memory pressure
kubectl describe node <node-name> | grep -A 10 Conditions
# Check OOM events
talosctl -n <IP> dmesg | grep -i "out of memory"
```
**Disk Space Issues**:
```bash
# Check disk usage
talosctl -n <IP> df
talosctl -n <IP> disks
# Check specific directories
talosctl -n <IP> list /var/lib/containerd
talosctl -n <IP> list /var/lib/etcd
# Clean up if needed (automatic GC usually handles this)
kubectl describe node <node-name> | grep -A 5 "Disk Pressure"
```
### Performance Issues
**Slow Cluster Response**:
```bash
# Check API server response time
time kubectl get nodes
# Check etcd performance
talosctl -n <IP> etcd status
# Look for high DB size vs IN USE ratio (fragmentation)
# Check system load
talosctl -n <IP> cpu
talosctl -n <IP> memory
```
**High CPU/Memory Usage**:
```bash
# Identify resource-heavy processes
talosctl -n <IP> processes --sort-by=cpu | head -10
talosctl -n <IP> processes --sort-by=memory | head -10
# Check cgroup usage
talosctl -n <IP> cgroups --preset memory
talosctl -n <IP> cgroups --preset cpu
```
## Configuration Issues
### Machine Configuration Problems
**Invalid Configuration**:
```bash
# Validate configuration before applying
talosctl validate -f machineconfig.yaml
# Check current configuration
talosctl -n <IP> get machineconfig -o yaml
# Compare with expected configuration
diff <(talosctl -n <IP> get mc v1alpha1 -o yaml) expected-config.yaml
```
**Configuration Drift**:
```bash
# Check configuration version
talosctl -n <IP> get machineconfig
# Re-apply configuration if needed
talosctl apply-config --nodes <IP> --file corrected-config.yaml --dry-run
talosctl apply-config --nodes <IP> --file corrected-config.yaml
```
## Emergency Procedures
### Node Unresponsive
**Complete Node Failure**:
1. **Physical access required**: Power cycle or hardware reset
2. **Check hardware**: Memory, disk, network interface status
3. **Boot issues**: May require bootable recovery media
**Partial Connectivity**:
```bash
# Try different network interfaces if multiple available
talosctl -e <alternate-ip> -n <IP> health
# Check if specific services are running
talosctl -n <IP> service machined
talosctl -n <IP> service apid
```
### Cluster-Wide Failures
**All Control Plane Nodes Down**:
1. **Assess scope**: Determine if data corruption or hardware failure
2. **Recovery strategy**: Use etcd backup if available
3. **Rebuild process**: May require complete cluster rebuild
**Follow disaster recovery procedures** as documented in etcd-management.md.
### Emergency Reset Procedures
**Single Node Reset**:
```bash
# Graceful reset (preserves some data)
talosctl -n <IP> reset
# Force reset (wipes all data)
talosctl -n <IP> reset --graceful=false --reboot
# Selective wipe (preserve STATE partition)
talosctl -n <IP> reset --system-labels-to-wipe=EPHEMERAL
```
**Cluster Reset** (DESTRUCTIVE):
```bash
# Reset all nodes (DANGER: DATA LOSS)
for ip in IP1 IP2 IP3; do
talosctl -n $ip reset --graceful=false --reboot
done
```
## Monitoring and Alerting
### Key Metrics to Monitor
- Node resource usage (CPU, memory, disk)
- etcd health and performance
- Control plane component status
- Network connectivity
- Certificate expiration
- Discovery service connectivity
### Log Locations for External Monitoring
- Kernel logs: `talosctl dmesg`
- Service logs: `talosctl logs <service>`
- System events: `talosctl events`
- Kubernetes events: `kubectl get events`
This troubleshooting guide provides systematic approaches to identify and resolve the most common issues encountered in Talos cluster operations.