Initial commit.
This commit is contained in:
135
ai/talos-v1.11/README.md
Normal file
135
ai/talos-v1.11/README.md
Normal file
@@ -0,0 +1,135 @@
|
||||
# Talos v1.11 Agent Context Documentation
|
||||
|
||||
This directory contains comprehensive documentation extracted from the official Talos v1.11 documentation, organized specifically to help AI agents become expert Talos cluster administrators.
|
||||
|
||||
## Documentation Structure
|
||||
|
||||
### Core Operations
|
||||
- **[cluster-operations.md](cluster-operations.md)** - Essential cluster operations including upgrades, node management, and configuration
|
||||
- **[cli-essentials.md](cli-essentials.md)** - Key talosctl commands and usage patterns for daily administration
|
||||
|
||||
### System Understanding
|
||||
- **[architecture-and-components.md](architecture-and-components.md)** - Deep dive into Talos architecture, components, and design principles
|
||||
- **[discovery-and-networking.md](discovery-and-networking.md)** - Cluster discovery mechanisms and network configuration
|
||||
|
||||
### Specialized Operations
|
||||
- **[etcd-management.md](etcd-management.md)** - etcd operations, maintenance, backup, and disaster recovery
|
||||
- **[bare-metal-administration.md](bare-metal-administration.md)** - Bare metal specific configurations, security, and hardware management
|
||||
- **[troubleshooting-guide.md](troubleshooting-guide.md)** - Systematic approaches to diagnosing and resolving common issues
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Essential Commands for New Agents
|
||||
```bash
|
||||
# Cluster health check
|
||||
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
|
||||
|
||||
# Node information
|
||||
talosctl get members
|
||||
talosctl -n <IP> version
|
||||
|
||||
# Service status
|
||||
talosctl -n <IP> services
|
||||
talosctl -n <IP> service kubelet
|
||||
|
||||
# System resources
|
||||
talosctl -n <IP> memory
|
||||
talosctl -n <IP> disks
|
||||
|
||||
# Logs and events
|
||||
talosctl -n <IP> dmesg | tail -50
|
||||
talosctl -n <IP> logs kubelet
|
||||
talosctl -n <IP> events --since=1h
|
||||
```
|
||||
|
||||
### Critical Procedures
|
||||
- **Bootstrap**: `talosctl bootstrap --nodes <first-controlplane-ip>`
|
||||
- **Backup etcd**: `talosctl -n <IP> etcd snapshot db.snapshot`
|
||||
- **Upgrade OS**: `talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x`
|
||||
- **Upgrade K8s**: `talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1`
|
||||
|
||||
### Emergency Commands
|
||||
- **Node reset**: `talosctl -n <IP> reset`
|
||||
- **Force reset**: `talosctl -n <IP> reset --graceful=false --reboot`
|
||||
- **Disaster recovery**: `talosctl -n <IP> bootstrap --recover-from=./db.snapshot`
|
||||
- **Rollback**: `talosctl rollback --nodes <IP>`
|
||||
|
||||
### Bare Metal Specific Commands
|
||||
- **Check hardware**: `talosctl -n <IP> disks`, `talosctl -n <IP> read /proc/cpuinfo`
|
||||
- **Network interfaces**: `talosctl -n <IP> get addresses`, `talosctl -n <IP> get routes`
|
||||
- **Extensions**: `talosctl -n <IP> get extensions`
|
||||
- **Encryption status**: `talosctl -n <IP> get encryptionconfig -o yaml`
|
||||
- **Hardware monitoring**: `talosctl -n <IP> dmesg | grep -i error`
|
||||
|
||||
## Key Concepts for Agents
|
||||
|
||||
### Architecture Fundamentals
|
||||
- **Immutable OS**: Single image, atomic updates, A-B rollback system
|
||||
- **API-driven**: All management through gRPC API, no SSH/shell access
|
||||
- **Controller pattern**: Kubernetes-style resource controllers for system management
|
||||
- **Minimal attack surface**: Only services necessary for Kubernetes
|
||||
|
||||
### Control Plane Design
|
||||
- **etcd quorum**: Requires majority for operations (3-node=2, 5-node=3)
|
||||
- **Bootstrap process**: One-time initialization of etcd cluster
|
||||
- **HA considerations**: Odd numbers of nodes, avoid even numbers
|
||||
- **Upgrade strategy**: Rolling upgrades with automatic rollback on failure
|
||||
|
||||
### Network and Discovery
|
||||
- **Service discovery**: Encrypted discovery service for cluster membership
|
||||
- **KubeSpan**: Optional WireGuard mesh networking
|
||||
- **mTLS everywhere**: All Talos API communication secured
|
||||
- **Discovery registries**: Service (default) and Kubernetes (deprecated)
|
||||
|
||||
### Bare Metal Considerations
|
||||
- **META configuration**: Network config embedded in disk images
|
||||
- **Hardware compatibility**: Driver support and firmware requirements
|
||||
- **Disk encryption**: LUKS2 with TPM, static keys, or node ID
|
||||
- **SecureBoot**: UKI images with embedded signatures
|
||||
- **System extensions**: Hardware-specific drivers and tools
|
||||
- **Performance tuning**: CPU governors, IOMMU, memory management
|
||||
|
||||
## Common Administration Patterns
|
||||
|
||||
### Daily Operations
|
||||
1. Check cluster health across all nodes
|
||||
2. Monitor resource usage and capacity
|
||||
3. Review system events and logs
|
||||
4. Verify etcd health and backup status
|
||||
5. Monitor discovery service connectivity
|
||||
|
||||
### Maintenance Windows
|
||||
1. Plan upgrade sequence (workers first, then control plane)
|
||||
2. Create etcd backup before major changes
|
||||
3. Apply configuration changes with dry-run first
|
||||
4. Monitor upgrade progress and be ready to rollback
|
||||
5. Verify cluster functionality after changes
|
||||
|
||||
### Troubleshooting Workflow
|
||||
1. **Gather information**: Health, version, resources, logs
|
||||
2. **Check connectivity**: Network, discovery, API endpoints
|
||||
3. **Examine services**: Status of critical services
|
||||
4. **Review logs**: System events, service logs, kernel messages
|
||||
5. **Apply fixes**: Configuration patches, service restarts, node resets
|
||||
|
||||
## Best Practices for Agents
|
||||
|
||||
### Configuration Management
|
||||
- Use reproducible configuration workflow (secrets + patches)
|
||||
- Always dry-run configuration changes first
|
||||
- Store machine configurations in version control
|
||||
- Test configuration changes in non-production first
|
||||
|
||||
### Operational Safety
|
||||
- Take etcd snapshots before major changes
|
||||
- Upgrade one node at a time
|
||||
- Monitor upgrade progress and have rollback ready
|
||||
- Test disaster recovery procedures regularly
|
||||
|
||||
### Performance Optimization
|
||||
- Monitor etcd fragmentation and defragment when needed
|
||||
- Scale vertically before horizontally for control plane
|
||||
- Use appropriate hardware for etcd (fast storage, low network latency)
|
||||
- Monitor resource usage trends and capacity planning
|
||||
|
||||
This documentation provides the essential knowledge needed to effectively administer Talos Linux clusters, organized by operational context and complexity level.
|
||||
248
ai/talos-v1.11/architecture-and-components.md
Normal file
248
ai/talos-v1.11/architecture-and-components.md
Normal file
@@ -0,0 +1,248 @@
|
||||
# Talos Architecture and Components Guide
|
||||
|
||||
This guide provides deep understanding of Talos Linux architecture and system components for effective cluster administration.
|
||||
|
||||
## Core Architecture Principles
|
||||
|
||||
Talos is designed to be:
|
||||
- **Atomic**: Distributed as a single, versioned, signed, immutable image
|
||||
- **Modular**: Composed of separate components with defined gRPC interfaces
|
||||
- **Minimal**: Focused init system that runs only services necessary for Kubernetes
|
||||
|
||||
## File System Architecture
|
||||
|
||||
### Partition Layout
|
||||
- **EFI**: Stores EFI boot data
|
||||
- **BIOS**: Used for GRUB's second stage boot
|
||||
- **BOOT**: Contains boot loader, initramfs, and kernel data
|
||||
- **META**: Stores node metadata (node IDs, etc.)
|
||||
- **STATE**: Stores machine configuration, node identity, cluster discovery, KubeSpan data
|
||||
- **EPHEMERAL**: Stores ephemeral state, mounted at `/var`
|
||||
|
||||
### Root File System Structure
|
||||
Three-layer design:
|
||||
1. **Base Layer**: Read-only squashfs mounted as loop device (immutable base)
|
||||
2. **Runtime Layer**: tmpfs filesystems for runtime needs (`/dev`, `/proc`, `/run`, `/sys`, `/tmp`, `/system`)
|
||||
3. **Overlay Layer**: overlayfs for persistent data backed by XFS at `/var`
|
||||
|
||||
#### Special Directories
|
||||
- `/system`: Internal files that need to be writable (recreated each boot)
|
||||
- Example: `/system/etc/hosts` bind-mounted over `/etc/hosts`
|
||||
- `/var`: Owned by Kubernetes, contains persistent data:
|
||||
- etcd data (control plane nodes)
|
||||
- kubelet data
|
||||
- containerd data
|
||||
- Survives reboots and upgrades, wiped on reset
|
||||
|
||||
## Core Components
|
||||
|
||||
### machined (PID 1)
|
||||
**Role**: Talos replacement for traditional init process
|
||||
**Functions**:
|
||||
- Machine configuration management
|
||||
- API handling
|
||||
- Resource and controller management
|
||||
- Service lifecycle management
|
||||
|
||||
**Managed Services**:
|
||||
- containerd
|
||||
- etcd (control plane nodes)
|
||||
- kubelet
|
||||
- networkd
|
||||
- trustd
|
||||
- udevd
|
||||
|
||||
**Architecture**: Uses controller-runtime pattern similar to Kubernetes controllers
|
||||
|
||||
### apid (API Gateway)
|
||||
**Role**: gRPC API endpoint for all Talos interactions
|
||||
**Functions**:
|
||||
- Routes requests to appropriate components
|
||||
- Provides proxy capabilities for multi-node operations
|
||||
- Handles authentication and authorization
|
||||
|
||||
**Usage Patterns**:
|
||||
```bash
|
||||
# Direct node communication
|
||||
talosctl -e <node-ip> <command>
|
||||
|
||||
# Proxy through endpoint to specific nodes
|
||||
talosctl -e <endpoint> -n <target-nodes> <command>
|
||||
|
||||
# Multi-node operations
|
||||
talosctl -e <endpoint> -n <node1>,<node2>,<node3> <command>
|
||||
```
|
||||
|
||||
### trustd (Trust Management)
|
||||
**Role**: Establishes and maintains trust within the system
|
||||
**Functions**:
|
||||
- Root of Trust implementation
|
||||
- PKI data distribution for control plane bootstrap
|
||||
- Certificate management
|
||||
- Secure file placement operations
|
||||
|
||||
### containerd (Container Runtime)
|
||||
**Role**: Industry-standard container runtime
|
||||
**Namespaces**:
|
||||
- `system`: Talos services
|
||||
- `k8s.io`: Kubernetes services
|
||||
|
||||
### udevd (Device Management)
|
||||
**Role**: Device file manager (eudev implementation)
|
||||
**Functions**:
|
||||
- Kernel device notification handling
|
||||
- Device node management in `/dev`
|
||||
- Hardware discovery and setup
|
||||
|
||||
## Control Plane Architecture
|
||||
|
||||
### etcd Cluster Design
|
||||
**Critical Concepts**:
|
||||
- **Quorum**: Majority of members must agree on leader
|
||||
- **Membership**: Formal etcd cluster membership required
|
||||
- **Consensus**: Uses Raft protocol for distributed consensus
|
||||
|
||||
**Quorum Requirements**:
|
||||
- 3 nodes: Requires 2 for quorum (tolerates 1 failure)
|
||||
- 5 nodes: Requires 3 for quorum (tolerates 2 failures)
|
||||
- Even numbers are worse than odd (4 nodes still only tolerates 1 failure)
|
||||
|
||||
### Control Plane Components
|
||||
**Running as Static Pods on Control Plane Nodes**:
|
||||
|
||||
#### kube-apiserver
|
||||
- Kubernetes API endpoint
|
||||
- Connects to local etcd instance
|
||||
- Handles all API operations
|
||||
|
||||
#### kube-controller-manager
|
||||
- Runs control loops
|
||||
- Manages cluster state reconciliation
|
||||
- Handles node lifecycle, replication, etc.
|
||||
|
||||
#### kube-scheduler
|
||||
- Pod placement decisions
|
||||
- Resource-aware scheduling
|
||||
- Constraint satisfaction
|
||||
|
||||
### Bootstrap Process
|
||||
1. **etcd Bootstrap**: One node chosen as bootstrap node, initializes etcd cluster
|
||||
2. **Static Pods**: Control plane components start as static pods via kubelet
|
||||
3. **API Availability**: Control plane endpoint becomes available
|
||||
4. **Manifest Injection**: Bootstrap manifests (join tokens, RBAC, etc.) injected
|
||||
5. **Cluster Formation**: Other control plane nodes join etcd cluster
|
||||
6. **HA Control Plane**: All control plane nodes run full component set
|
||||
|
||||
## Resource System Architecture
|
||||
|
||||
### Controller-Runtime Pattern
|
||||
Talos uses Kubernetes-style controller pattern:
|
||||
- **Resources**: Typed configuration and state objects
|
||||
- **Controllers**: Reconcile desired vs actual state
|
||||
- **Events**: Reactive architecture for state changes
|
||||
|
||||
### Resource Namespaces
|
||||
- `config`: Machine configuration resources
|
||||
- `cluster`: Cluster membership and discovery
|
||||
- `controlplane`: Control plane component configurations
|
||||
- `secrets`: Certificate and key management
|
||||
- `network`: Network configuration and state
|
||||
|
||||
### Key Resources
|
||||
```bash
|
||||
# Machine configuration
|
||||
talosctl get machineconfig
|
||||
talosctl get machinetype
|
||||
|
||||
# Cluster membership
|
||||
talosctl get members
|
||||
talosctl get affiliates
|
||||
talosctl get identities
|
||||
|
||||
# Control plane
|
||||
talosctl get apiserverconfig
|
||||
talosctl get controllermanagerconfig
|
||||
talosctl get schedulerconfig
|
||||
|
||||
# Network
|
||||
talosctl get addresses
|
||||
talosctl get routes
|
||||
talosctl get nodeaddresses
|
||||
```
|
||||
|
||||
## Network Architecture
|
||||
|
||||
### Network Stack
|
||||
- **CNI**: Container Network Interface for pod networking
|
||||
- **Host Networking**: Node-to-node communication
|
||||
- **Service Discovery**: Built-in cluster member discovery
|
||||
- **KubeSpan**: Optional WireGuard mesh networking
|
||||
|
||||
### Discovery Service Integration
|
||||
- **Service Registry**: External discovery service (default: discovery.talos.dev)
|
||||
- **Kubernetes Registry**: Deprecated, uses Kubernetes Node resources
|
||||
- **Encrypted Communication**: All discovery data encrypted before transmission
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Immutable Base
|
||||
- Read-only root filesystem
|
||||
- Signed and verified boot process
|
||||
- Atomic updates with rollback capability
|
||||
|
||||
### Process Isolation
|
||||
- Minimal attack surface
|
||||
- No shell access
|
||||
- No arbitrary user services
|
||||
- Container-based workload isolation
|
||||
|
||||
### Network Security
|
||||
- Mutual TLS (mTLS) for all API communication
|
||||
- Certificate-based node authentication
|
||||
- Optional WireGuard mesh networking (KubeSpan)
|
||||
- Encrypted service discovery
|
||||
|
||||
### Kernel Hardening
|
||||
Configured according to Kernel Self Protection Project (KSPP) recommendations:
|
||||
- Stack protection
|
||||
- Control flow integrity
|
||||
- Memory protection features
|
||||
- Attack surface reduction
|
||||
|
||||
## Extension Points
|
||||
|
||||
### Machine Configuration
|
||||
- Declarative configuration management
|
||||
- Patch-based configuration updates
|
||||
- Runtime configuration validation
|
||||
|
||||
### System Extensions
|
||||
- Kernel modules
|
||||
- System services (limited)
|
||||
- Network configuration
|
||||
- Storage configuration
|
||||
|
||||
### Kubernetes Integration
|
||||
- Automatic kubelet configuration
|
||||
- Bootstrap manifest management
|
||||
- Certificate lifecycle management
|
||||
- Node lifecycle automation
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### etcd Performance
|
||||
- Performance decreases with cluster size
|
||||
- Network latency affects consensus performance
|
||||
- Storage I/O directly impacts etcd performance
|
||||
|
||||
### Resource Requirements
|
||||
- **Control Plane Nodes**: Higher memory for etcd, CPU for control plane
|
||||
- **Worker Nodes**: Resources scale with workload requirements
|
||||
- **Network**: Low latency crucial for etcd performance
|
||||
|
||||
### Scaling Patterns
|
||||
- **Horizontal Scaling**: Add worker nodes for capacity
|
||||
- **Vertical Scaling**: Increase control plane node resources for performance
|
||||
- **Control Plane Scaling**: Odd numbers (3, 5) for availability
|
||||
|
||||
This architecture enables Talos to provide a secure, minimal, and operationally simple platform for running Kubernetes clusters while maintaining the reliability and performance characteristics needed for production workloads.
|
||||
506
ai/talos-v1.11/bare-metal-administration.md
Normal file
506
ai/talos-v1.11/bare-metal-administration.md
Normal file
@@ -0,0 +1,506 @@
|
||||
# Bare Metal Talos Administration Guide
|
||||
|
||||
This guide covers bare metal specific operations, configurations, and best practices for Talos Linux clusters.
|
||||
|
||||
## META-Based Network Configuration
|
||||
|
||||
Talos supports META-based network configuration for bare metal deployments where configuration is embedded in the disk image.
|
||||
|
||||
### Basic META Configuration
|
||||
```yaml
|
||||
# META configuration for bare metal networking
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
- interface: eth0
|
||||
addresses:
|
||||
- 192.168.1.100/24
|
||||
routes:
|
||||
- network: 0.0.0.0/0
|
||||
gateway: 192.168.1.1
|
||||
mtu: 1500
|
||||
nameservers:
|
||||
- 8.8.8.8
|
||||
- 1.1.1.1
|
||||
```
|
||||
|
||||
### Advanced Network Configurations
|
||||
|
||||
#### VLAN Configuration
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
- interface: eth0.100 # VLAN 100
|
||||
vlan:
|
||||
parentDevice: eth0
|
||||
vid: 100
|
||||
addresses:
|
||||
- 192.168.100.10/24
|
||||
routes:
|
||||
- network: 192.168.100.0/24
|
||||
```
|
||||
|
||||
#### Interface Bonding
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
- interface: bond0
|
||||
bond:
|
||||
mode: 802.3ad
|
||||
lacpRate: fast
|
||||
xmitHashPolicy: layer3+4
|
||||
miimon: 100
|
||||
updelay: 200
|
||||
downdelay: 200
|
||||
interfaces:
|
||||
- eth0
|
||||
- eth1
|
||||
addresses:
|
||||
- 192.168.1.100/24
|
||||
routes:
|
||||
- network: 0.0.0.0/0
|
||||
gateway: 192.168.1.1
|
||||
```
|
||||
|
||||
#### Bridge Configuration
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
- interface: br0
|
||||
bridge:
|
||||
stp:
|
||||
enabled: false
|
||||
interfaces:
|
||||
- eth0
|
||||
- eth1
|
||||
addresses:
|
||||
- 192.168.1.100/24
|
||||
routes:
|
||||
- network: 0.0.0.0/0
|
||||
gateway: 192.168.1.1
|
||||
```
|
||||
|
||||
### Network Troubleshooting Commands
|
||||
```bash
|
||||
# Check interface configuration
|
||||
talosctl -n <IP> get addresses
|
||||
talosctl -n <IP> get routes
|
||||
talosctl -n <IP> get links
|
||||
|
||||
# Check network configuration
|
||||
talosctl -n <IP> get networkconfig -o yaml
|
||||
|
||||
# Test network connectivity
|
||||
talosctl -n <IP> list /sys/class/net
|
||||
talosctl -n <IP> read /proc/net/dev
|
||||
```
|
||||
|
||||
## Disk Encryption for Bare Metal
|
||||
|
||||
### LUKS2 Encryption Configuration
|
||||
```yaml
|
||||
machine:
|
||||
systemDiskEncryption:
|
||||
state:
|
||||
provider: luks2
|
||||
keys:
|
||||
- slot: 0
|
||||
static:
|
||||
passphrase: "your-secure-passphrase"
|
||||
ephemeral:
|
||||
provider: luks2
|
||||
keys:
|
||||
- slot: 0
|
||||
nodeID: {}
|
||||
```
|
||||
|
||||
### TPM-Based Encryption
|
||||
```yaml
|
||||
machine:
|
||||
systemDiskEncryption:
|
||||
state:
|
||||
provider: luks2
|
||||
keys:
|
||||
- slot: 0
|
||||
tpm: {}
|
||||
ephemeral:
|
||||
provider: luks2
|
||||
keys:
|
||||
- slot: 0
|
||||
tpm: {}
|
||||
```
|
||||
|
||||
### Key Management Operations
|
||||
```bash
|
||||
# Check encryption status
|
||||
talosctl -n <IP> get encryptionconfig -o yaml
|
||||
|
||||
# Rotate encryption keys
|
||||
talosctl -n <IP> apply-config --file updated-config.yaml --mode staged
|
||||
```
|
||||
|
||||
## SecureBoot Implementation
|
||||
|
||||
### UKI (Unified Kernel Image) Setup
|
||||
SecureBoot requires UKI format images with embedded signatures.
|
||||
|
||||
#### Generate SecureBoot Keys
|
||||
```bash
|
||||
# Generate platform key (PK)
|
||||
talosctl gen secureboot uki --platform-key-path platform.key --platform-cert-path platform.crt
|
||||
|
||||
# Generate PCR signing key
|
||||
talosctl gen secureboot pcr --pcr-key-path pcr.key --pcr-cert-path pcr.crt
|
||||
|
||||
# Generate database entries
|
||||
talosctl gen secureboot database --enrolled-certificate platform.crt
|
||||
```
|
||||
|
||||
#### Machine Configuration for SecureBoot
|
||||
```yaml
|
||||
machine:
|
||||
secureboot:
|
||||
enabled: true
|
||||
uklPath: /boot/vmlinuz
|
||||
systemDiskEncryption:
|
||||
state:
|
||||
provider: luks2
|
||||
keys:
|
||||
- slot: 0
|
||||
tpm:
|
||||
pcrTargets:
|
||||
- 0
|
||||
- 1
|
||||
- 7
|
||||
```
|
||||
|
||||
### UEFI Configuration
|
||||
- Enable SecureBoot in UEFI firmware
|
||||
- Enroll platform keys and certificates
|
||||
- Configure TPM 2.0 for PCR measurements
|
||||
- Set boot order for UKI images
|
||||
|
||||
## Hardware-Specific Configurations
|
||||
|
||||
### Performance Tuning for Bare Metal
|
||||
|
||||
#### CPU Governor Configuration
|
||||
```yaml
|
||||
machine:
|
||||
sysfs:
|
||||
"devices.system.cpu.cpu0.cpufreq.scaling_governor": "performance"
|
||||
"devices.system.cpu.cpu1.cpufreq.scaling_governor": "performance"
|
||||
```
|
||||
|
||||
#### Hardware Vulnerability Mitigations
|
||||
```yaml
|
||||
machine:
|
||||
kernel:
|
||||
args:
|
||||
- mitigations=off # For maximum performance (less secure)
|
||||
# or
|
||||
- mitigations=auto # Default balanced approach
|
||||
```
|
||||
|
||||
#### IOMMU Configuration
|
||||
```yaml
|
||||
machine:
|
||||
kernel:
|
||||
args:
|
||||
- intel_iommu=on
|
||||
- iommu=pt
|
||||
```
|
||||
|
||||
### Memory Management
|
||||
```yaml
|
||||
machine:
|
||||
kernel:
|
||||
args:
|
||||
- hugepages=1024 # 1GB hugepages
|
||||
- transparent_hugepage=never
|
||||
```
|
||||
|
||||
## Ingress Firewall for Bare Metal
|
||||
|
||||
### Basic Firewall Configuration
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
firewall:
|
||||
defaultAction: block
|
||||
rules:
|
||||
- name: allow-talos-api
|
||||
portSelector:
|
||||
ports:
|
||||
- 50000
|
||||
- 50001
|
||||
ingress:
|
||||
- subnet: 192.168.1.0/24
|
||||
- name: allow-kubernetes-api
|
||||
portSelector:
|
||||
ports:
|
||||
- 6443
|
||||
ingress:
|
||||
- subnet: 0.0.0.0/0
|
||||
- name: allow-etcd
|
||||
portSelector:
|
||||
ports:
|
||||
- 2379
|
||||
- 2380
|
||||
ingress:
|
||||
- subnet: 192.168.1.0/24
|
||||
```
|
||||
|
||||
### Advanced Firewall Rules
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
firewall:
|
||||
defaultAction: block
|
||||
rules:
|
||||
- name: allow-ssh-management
|
||||
portSelector:
|
||||
ports:
|
||||
- 22
|
||||
ingress:
|
||||
- subnet: 10.0.1.0/24 # Management network only
|
||||
- name: allow-monitoring
|
||||
portSelector:
|
||||
ports:
|
||||
- 9100 # Node exporter
|
||||
- 10250 # kubelet metrics
|
||||
ingress:
|
||||
- subnet: 192.168.1.0/24
|
||||
```
|
||||
|
||||
## System Extensions for Bare Metal
|
||||
|
||||
### Common Bare Metal Extensions
|
||||
```yaml
|
||||
machine:
|
||||
install:
|
||||
extensions:
|
||||
- image: ghcr.io/siderolabs/iscsi-tools:latest
|
||||
- image: ghcr.io/siderolabs/util-linux-tools:latest
|
||||
- image: ghcr.io/siderolabs/drbd:latest
|
||||
```
|
||||
|
||||
### Storage Extensions
|
||||
```yaml
|
||||
machine:
|
||||
install:
|
||||
extensions:
|
||||
- image: ghcr.io/siderolabs/zfs:latest
|
||||
- image: ghcr.io/siderolabs/nut-client:latest
|
||||
- image: ghcr.io/siderolabs/smartmontools:latest
|
||||
```
|
||||
|
||||
### Checking Extension Status
|
||||
```bash
|
||||
# List installed extensions
|
||||
talosctl -n <IP> get extensions
|
||||
|
||||
# Check extension services
|
||||
talosctl -n <IP> get extensionserviceconfigs
|
||||
```
|
||||
|
||||
## Static Pod Configuration for Bare Metal
|
||||
|
||||
### Local Storage Static Pods
|
||||
```yaml
|
||||
machine:
|
||||
pods:
|
||||
- name: local-storage-provisioner
|
||||
namespace: kube-system
|
||||
image: rancher/local-path-provisioner:v0.0.24
|
||||
args:
|
||||
- --config-path=/etc/config/config.json
|
||||
env:
|
||||
- name: POD_NAMESPACE
|
||||
value: kube-system
|
||||
volumeMounts:
|
||||
- name: config
|
||||
mountPath: /etc/config
|
||||
- name: local-storage
|
||||
mountPath: /opt/local-path-provisioner
|
||||
volumes:
|
||||
- name: config
|
||||
hostPath:
|
||||
path: /etc/local-storage
|
||||
- name: local-storage
|
||||
hostPath:
|
||||
path: /var/lib/local-storage
|
||||
```
|
||||
|
||||
### Hardware Monitoring Static Pods
|
||||
```yaml
|
||||
machine:
|
||||
pods:
|
||||
- name: node-exporter
|
||||
namespace: monitoring
|
||||
image: prom/node-exporter:latest
|
||||
args:
|
||||
- --path.rootfs=/host
|
||||
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 65534
|
||||
volumeMounts:
|
||||
- name: proc
|
||||
mountPath: /host/proc
|
||||
readOnly: true
|
||||
- name: sys
|
||||
mountPath: /host/sys
|
||||
readOnly: true
|
||||
- name: rootfs
|
||||
mountPath: /host
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: proc
|
||||
hostPath:
|
||||
path: /proc
|
||||
- name: sys
|
||||
hostPath:
|
||||
path: /sys
|
||||
- name: rootfs
|
||||
hostPath:
|
||||
path: /
|
||||
```
|
||||
|
||||
## Bare Metal Boot Asset Management
|
||||
|
||||
### PXE Boot Configuration
|
||||
For network booting, configure DHCP/TFTP with appropriate boot assets:
|
||||
|
||||
```bash
|
||||
# Download kernel and initramfs for PXE
|
||||
curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/vmlinuz-amd64
|
||||
curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/initramfs-amd64.xz
|
||||
```
|
||||
|
||||
### USB Boot Asset Creation
|
||||
```bash
|
||||
# Write installer image to USB
|
||||
sudo dd if=metal-amd64.iso of=/dev/sdX bs=4M status=progress
|
||||
```
|
||||
|
||||
### Image Factory Integration
|
||||
For custom bare metal images:
|
||||
```bash
|
||||
# Generate schematic for bare metal with extensions
|
||||
curl -X POST --data-binary @schematic.yaml \
|
||||
https://factory.talos.dev/schematics
|
||||
|
||||
# Download custom installer
|
||||
curl -LO https://factory.talos.dev/image/<schematic-id>/v1.11.0/metal-amd64.iso
|
||||
```
|
||||
|
||||
## Hardware Compatibility and Drivers
|
||||
|
||||
### Check Hardware Support
|
||||
```bash
|
||||
# Check PCI devices
|
||||
talosctl -n <IP> read /proc/bus/pci/devices
|
||||
|
||||
# Check USB devices
|
||||
talosctl -n <IP> read /proc/bus/usb/devices
|
||||
|
||||
# Check loaded kernel modules
|
||||
talosctl -n <IP> read /proc/modules
|
||||
|
||||
# Check hardware information
|
||||
talosctl -n <IP> read /proc/cpuinfo
|
||||
talosctl -n <IP> read /proc/meminfo
|
||||
```
|
||||
|
||||
### Common Hardware Issues
|
||||
|
||||
#### Network Interface Issues
|
||||
```bash
|
||||
# Check interface status
|
||||
talosctl -n <IP> list /sys/class/net/
|
||||
|
||||
# Check driver information
|
||||
talosctl -n <IP> read /sys/class/net/eth0/device/driver
|
||||
|
||||
# Check firmware loading
|
||||
talosctl -n <IP> dmesg | grep firmware
|
||||
```
|
||||
|
||||
#### Storage Controller Issues
|
||||
```bash
|
||||
# Check block devices
|
||||
talosctl -n <IP> disks
|
||||
|
||||
# Check SMART status (if smartmontools extension installed)
|
||||
talosctl -n <IP> list /dev/disk/by-id/
|
||||
```
|
||||
|
||||
## Bare Metal Monitoring and Maintenance
|
||||
|
||||
### Hardware Health Monitoring
|
||||
```bash
|
||||
# Check system temperatures (if available)
|
||||
talosctl -n <IP> read /sys/class/thermal/thermal_zone0/temp
|
||||
|
||||
# Check power supply status
|
||||
talosctl -n <IP> read /sys/class/power_supply/*/status
|
||||
|
||||
# Monitor system events for hardware issues
|
||||
talosctl -n <IP> dmesg | grep -i error
|
||||
talosctl -n <IP> dmesg | grep -i "machine check"
|
||||
```
|
||||
|
||||
### Performance Monitoring
|
||||
```bash
|
||||
# Check CPU performance
|
||||
talosctl -n <IP> read /proc/cpuinfo | grep MHz
|
||||
talosctl -n <IP> cgroups --preset cpu
|
||||
|
||||
# Check memory performance
|
||||
talosctl -n <IP> memory
|
||||
talosctl -n <IP> cgroups --preset memory
|
||||
|
||||
# Check I/O performance
|
||||
talosctl -n <IP> read /proc/diskstats
|
||||
```
|
||||
|
||||
## Security Hardening for Bare Metal
|
||||
|
||||
### BIOS/UEFI Security
|
||||
- Enable SecureBoot
|
||||
- Disable unused boot devices
|
||||
- Set administrator passwords
|
||||
- Enable TPM 2.0
|
||||
- Disable legacy boot modes
|
||||
|
||||
### Physical Security
|
||||
- Secure physical access to servers
|
||||
- Use chassis intrusion detection
|
||||
- Implement network port security
|
||||
- Consider hardware-based attestation
|
||||
|
||||
### Network Security
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
firewall:
|
||||
defaultAction: block
|
||||
rules:
|
||||
# Only allow necessary services
|
||||
- name: allow-cluster-traffic
|
||||
portSelector:
|
||||
ports:
|
||||
- 6443 # Kubernetes API
|
||||
- 2379 # etcd client
|
||||
- 2380 # etcd peer
|
||||
- 10250 # kubelet API
|
||||
- 50000 # Talos API
|
||||
ingress:
|
||||
- subnet: 192.168.1.0/24
|
||||
```
|
||||
|
||||
This bare metal guide provides comprehensive coverage of hardware-specific configurations, performance optimization, security hardening, and operational practices for Talos Linux on physical servers.
|
||||
382
ai/talos-v1.11/cli-essentials.md
Normal file
382
ai/talos-v1.11/cli-essentials.md
Normal file
@@ -0,0 +1,382 @@
|
||||
# Talosctl CLI Essentials
|
||||
|
||||
This guide covers essential talosctl commands and usage patterns for effective Talos cluster administration.
|
||||
|
||||
## Command Structure and Context
|
||||
|
||||
### Basic Command Pattern
|
||||
```bash
|
||||
talosctl [global-flags] <command> [command-flags] [arguments]
|
||||
|
||||
# Examples:
|
||||
talosctl -n <IP> get members
|
||||
talosctl --nodes <IP1>,<IP2> service kubelet
|
||||
talosctl -e <endpoint> -n <target-nodes> upgrade --image <image>
|
||||
```
|
||||
|
||||
### Global Flags
|
||||
- `-e, --endpoints`: API endpoints to connect to
|
||||
- `-n, --nodes`: Target nodes for commands (defaults to first endpoint if omitted)
|
||||
- `--talosconfig`: Path to Talos configuration file
|
||||
- `--context`: Configuration context to use
|
||||
|
||||
### Configuration Management
|
||||
```bash
|
||||
# Use specific config file
|
||||
export TALOSCONFIG=/path/to/talosconfig
|
||||
|
||||
# List available contexts
|
||||
talosctl config contexts
|
||||
|
||||
# Switch context
|
||||
talosctl config context <context-name>
|
||||
|
||||
# View current config
|
||||
talosctl config info
|
||||
```
|
||||
|
||||
## Cluster Management Commands
|
||||
|
||||
### Bootstrap and Node Management
|
||||
```bash
|
||||
# Bootstrap etcd cluster on first control plane node
|
||||
talosctl bootstrap --nodes <first-controlplane-ip>
|
||||
|
||||
# Apply machine configuration
|
||||
talosctl apply-config --nodes <IP> --file <config.yaml>
|
||||
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
|
||||
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
|
||||
|
||||
# Reset node (wipe and reboot)
|
||||
talosctl reset --nodes <IP>
|
||||
talosctl reset --nodes <IP> --graceful=false --reboot
|
||||
|
||||
# Reboot node
|
||||
talosctl reboot --nodes <IP>
|
||||
|
||||
# Shutdown node
|
||||
talosctl shutdown --nodes <IP>
|
||||
```
|
||||
|
||||
### Configuration Patching
|
||||
```bash
|
||||
# Patch machine configuration
|
||||
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
|
||||
|
||||
# Patch with file
|
||||
talosctl -n <IP> patch mc --patch @patch.yaml --mode reboot
|
||||
|
||||
# Edit machine config interactively
|
||||
talosctl -n <IP> edit mc --mode staged
|
||||
```
|
||||
|
||||
## System Information and Monitoring
|
||||
|
||||
### Node Status and Health
|
||||
```bash
|
||||
# Cluster member information
|
||||
talosctl get members
|
||||
talosctl get affiliates
|
||||
talosctl get identities
|
||||
|
||||
# Node health check
|
||||
talosctl -n <IP> health
|
||||
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
|
||||
|
||||
# System information
|
||||
talosctl -n <IP> version
|
||||
talosctl -n <IP> get machineconfig
|
||||
talosctl -n <IP> get machinetype
|
||||
```
|
||||
|
||||
### Resource Monitoring
|
||||
```bash
|
||||
# CPU and memory usage
|
||||
talosctl -n <IP> cpu
|
||||
talosctl -n <IP> memory
|
||||
|
||||
# Disk usage and information
|
||||
talosctl -n <IP> disks
|
||||
talosctl -n <IP> df
|
||||
|
||||
# Network interfaces
|
||||
talosctl -n <IP> interfaces
|
||||
talosctl -n <IP> get addresses
|
||||
talosctl -n <IP> get routes
|
||||
|
||||
# Process information
|
||||
talosctl -n <IP> processes
|
||||
talosctl -n <IP> cgroups --preset memory
|
||||
talosctl -n <IP> cgroups --preset cpu
|
||||
```
|
||||
|
||||
### Service Management
|
||||
```bash
|
||||
# List all services
|
||||
talosctl -n <IP> services
|
||||
|
||||
# Check specific service status
|
||||
talosctl -n <IP> service kubelet
|
||||
talosctl -n <IP> service containerd
|
||||
talosctl -n <IP> service etcd
|
||||
|
||||
# Restart service
|
||||
talosctl -n <IP> service kubelet restart
|
||||
|
||||
# Start/stop service
|
||||
talosctl -n <IP> service <service-name> start
|
||||
talosctl -n <IP> service <service-name> stop
|
||||
```
|
||||
|
||||
## Logging and Diagnostics
|
||||
|
||||
### Log Retrieval
|
||||
```bash
|
||||
# Kernel logs
|
||||
talosctl -n <IP> dmesg
|
||||
talosctl -n <IP> dmesg -f # Follow mode
|
||||
talosctl -n <IP> dmesg --tail=100
|
||||
|
||||
# Service logs
|
||||
talosctl -n <IP> logs kubelet
|
||||
talosctl -n <IP> logs containerd
|
||||
talosctl -n <IP> logs etcd
|
||||
talosctl -n <IP> logs machined
|
||||
|
||||
# Follow logs
|
||||
talosctl -n <IP> logs kubelet -f
|
||||
```
|
||||
|
||||
### System Events
|
||||
```bash
|
||||
# Monitor system events
|
||||
talosctl -n <IP> events
|
||||
talosctl -n <IP> events --tail
|
||||
|
||||
# Filter events
|
||||
talosctl -n <IP> events --since=1h
|
||||
talosctl -n <IP> events --grep=error
|
||||
```
|
||||
|
||||
## File System and Container Operations
|
||||
|
||||
### File Operations
|
||||
```bash
|
||||
# List files/directories
|
||||
talosctl -n <IP> list /var/log
|
||||
talosctl -n <IP> list /etc/kubernetes
|
||||
|
||||
# Copy files to/from node
|
||||
talosctl -n <IP> copy /local/file /remote/path
|
||||
talosctl -n <IP> cp /var/log/containers/app.log ./app.log
|
||||
|
||||
# Read file contents
|
||||
talosctl -n <IP> read /etc/resolv.conf
|
||||
talosctl -n <IP> cat /var/log/audit/audit.log
|
||||
```
|
||||
|
||||
### Container Operations
|
||||
```bash
|
||||
# List containers
|
||||
talosctl -n <IP> containers
|
||||
talosctl -n <IP> containers -k # Kubernetes containers
|
||||
|
||||
# Container logs
|
||||
talosctl -n <IP> logs --kubernetes <container-name>
|
||||
|
||||
# Execute in container
|
||||
talosctl -n <IP> exec --kubernetes <pod-name> -- <command>
|
||||
```
|
||||
|
||||
## Kubernetes Integration
|
||||
|
||||
### Kubernetes Cluster Operations
|
||||
```bash
|
||||
# Get kubeconfig
|
||||
talosctl kubeconfig
|
||||
talosctl kubeconfig --nodes <controlplane-ip>
|
||||
talosctl kubeconfig --force --nodes <controlplane-ip>
|
||||
|
||||
# Bootstrap manifests
|
||||
talosctl -n <IP> get manifests
|
||||
talosctl -n <IP> get manifests -o yaml | yq eval-all '.spec | .[] | splitDoc' - > manifests.yaml
|
||||
|
||||
# Upgrade Kubernetes
|
||||
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
|
||||
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
|
||||
```
|
||||
|
||||
### Resource Inspection
|
||||
```bash
|
||||
# Control plane component configs
|
||||
talosctl -n <IP> get apiserverconfig -o yaml
|
||||
talosctl -n <IP> get controllermanagerconfig -o yaml
|
||||
talosctl -n <IP> get schedulerconfig -o yaml
|
||||
|
||||
# etcd configuration
|
||||
talosctl -n <IP> get etcdconfig -o yaml
|
||||
```
|
||||
|
||||
## etcd Management
|
||||
|
||||
### etcd Operations
|
||||
```bash
|
||||
# etcd cluster status
|
||||
talosctl -n <IP1>,<IP2>,<IP3> etcd status
|
||||
|
||||
# etcd members
|
||||
talosctl -n <IP> etcd members
|
||||
|
||||
# etcd snapshots
|
||||
talosctl -n <IP> etcd snapshot db.snapshot
|
||||
|
||||
# etcd maintenance
|
||||
talosctl -n <IP> etcd defrag
|
||||
talosctl -n <IP> etcd alarm list
|
||||
talosctl -n <IP> etcd alarm disarm
|
||||
|
||||
# Leadership management
|
||||
talosctl -n <IP> etcd forfeit-leadership
|
||||
```
|
||||
|
||||
### Disaster Recovery
|
||||
```bash
|
||||
# Bootstrap from snapshot
|
||||
talosctl -n <IP> bootstrap --recover-from=./db.snapshot
|
||||
talosctl -n <IP> bootstrap --recover-from=./db.snapshot --recover-skip-hash-check
|
||||
```
|
||||
|
||||
## Upgrade and Maintenance
|
||||
|
||||
### OS Upgrades
|
||||
```bash
|
||||
# Upgrade Talos OS
|
||||
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
|
||||
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
|
||||
|
||||
# Monitor upgrade progress
|
||||
talosctl upgrade --nodes <IP> --image <image> --wait
|
||||
talosctl upgrade --nodes <IP> --image <image> --wait --debug
|
||||
|
||||
# Rollback
|
||||
talosctl rollback --nodes <IP>
|
||||
```
|
||||
|
||||
## Resource System Commands
|
||||
|
||||
### Resource Management
|
||||
```bash
|
||||
# List resource types
|
||||
talosctl get rd
|
||||
|
||||
# Get specific resources
|
||||
talosctl get <resource-type>
|
||||
talosctl get <resource-type> -o yaml
|
||||
talosctl get <resource-type> --namespace=<namespace>
|
||||
|
||||
# Watch resources
|
||||
talosctl get <resource-type> --watch
|
||||
|
||||
# Common resource types
|
||||
talosctl get machineconfig
|
||||
talosctl get members
|
||||
talosctl get services
|
||||
talosctl get networkconfig
|
||||
talosctl get secrets
|
||||
```
|
||||
|
||||
## Local Development
|
||||
|
||||
### Local Cluster Management
|
||||
```bash
|
||||
# Create local cluster
|
||||
talosctl cluster create
|
||||
talosctl cluster create --controlplanes 3 --workers 2
|
||||
|
||||
# Destroy local cluster
|
||||
talosctl cluster destroy
|
||||
|
||||
# Show local cluster status
|
||||
talosctl cluster show
|
||||
```
|
||||
|
||||
## Advanced Usage Patterns
|
||||
|
||||
### Multi-Node Operations
|
||||
```bash
|
||||
# Run command on multiple nodes
|
||||
talosctl -e <endpoint> -n <node1>,<node2>,<node3> <command>
|
||||
|
||||
# Different endpoint and target nodes
|
||||
talosctl -e <public-endpoint> -n <internal-node1>,<internal-node2> <command>
|
||||
```
|
||||
|
||||
### Output Formatting
|
||||
```bash
|
||||
# JSON output
|
||||
talosctl -n <IP> get members -o json
|
||||
|
||||
# YAML output
|
||||
talosctl -n <IP> get machineconfig -o yaml
|
||||
|
||||
# Table output (default)
|
||||
talosctl -n <IP> get members -o table
|
||||
|
||||
# Custom column output
|
||||
talosctl -n <IP> get members -o columns=HOSTNAME,MACHINE\ TYPE,OS
|
||||
```
|
||||
|
||||
### Filtering and Selection
|
||||
```bash
|
||||
# Filter resources
|
||||
talosctl get members --search <hostname>
|
||||
talosctl get services --search kubelet
|
||||
|
||||
# Namespace filtering
|
||||
talosctl get secrets --namespace=secrets
|
||||
talosctl get affiliates --namespace=cluster-raw
|
||||
```
|
||||
|
||||
## Common Command Workflows
|
||||
|
||||
### Initial Cluster Setup
|
||||
```bash
|
||||
# 1. Generate configurations
|
||||
talosctl gen config cluster-name https://cluster-endpoint:6443
|
||||
|
||||
# 2. Apply to nodes
|
||||
talosctl apply-config --nodes <controlplane-1> --file controlplane.yaml
|
||||
talosctl apply-config --nodes <worker-1> --file worker.yaml
|
||||
|
||||
# 3. Bootstrap cluster
|
||||
talosctl bootstrap --nodes <controlplane-1>
|
||||
|
||||
# 4. Get kubeconfig
|
||||
talosctl kubeconfig --nodes <controlplane-1>
|
||||
```
|
||||
|
||||
### Cluster Health Check
|
||||
```bash
|
||||
# Check all aspects of cluster health
|
||||
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
|
||||
talosctl -n <IP1>,<IP2>,<IP3> etcd status
|
||||
talosctl -n <IP1>,<IP2>,<IP3> service kubelet
|
||||
kubectl get nodes
|
||||
kubectl get pods --all-namespaces
|
||||
```
|
||||
|
||||
### Node Troubleshooting
|
||||
```bash
|
||||
# System diagnostics
|
||||
talosctl -n <IP> dmesg | tail -100
|
||||
talosctl -n <IP> services | grep -v Running
|
||||
talosctl -n <IP> logs kubelet | tail -50
|
||||
talosctl -n <IP> events --since=1h
|
||||
|
||||
# Resource usage
|
||||
talosctl -n <IP> memory
|
||||
talosctl -n <IP> df
|
||||
talosctl -n <IP> processes | head -20
|
||||
```
|
||||
|
||||
This CLI reference provides the essential commands and patterns needed for day-to-day Talos cluster administration and troubleshooting.
|
||||
239
ai/talos-v1.11/cluster-operations.md
Normal file
239
ai/talos-v1.11/cluster-operations.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# Talos Cluster Operations Guide
|
||||
|
||||
This guide covers essential cluster operations for Talos Linux v1.11 administrators.
|
||||
|
||||
## Upgrading Operations
|
||||
|
||||
### Talos OS Upgrades
|
||||
|
||||
Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image.
|
||||
|
||||
#### Upgrade Process
|
||||
```bash
|
||||
# Upgrade a single node
|
||||
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
|
||||
|
||||
# Use --stage flag if upgrade fails due to open files
|
||||
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
|
||||
|
||||
# Monitor upgrade progress
|
||||
talosctl dmesg -f
|
||||
talosctl upgrade --wait --debug
|
||||
```
|
||||
|
||||
#### Upgrade Sequence
|
||||
1. Node cordons itself in Kubernetes
|
||||
2. Node drains existing workloads
|
||||
3. Internal processes shut down
|
||||
4. Filesystems unmount
|
||||
5. Disk verification and image upgrade
|
||||
6. Bootloader set to boot once with new image
|
||||
7. Node reboots
|
||||
8. Node rejoins cluster and uncordons
|
||||
|
||||
#### Rollback
|
||||
```bash
|
||||
talosctl rollback --nodes <IP>
|
||||
```
|
||||
|
||||
### Kubernetes Upgrades
|
||||
|
||||
Kubernetes upgrades are separate from OS upgrades and non-disruptive.
|
||||
|
||||
#### Automated Upgrade (Recommended)
|
||||
```bash
|
||||
# Check what will be upgraded
|
||||
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
|
||||
|
||||
# Perform upgrade
|
||||
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
|
||||
```
|
||||
|
||||
#### Manual Component Upgrades
|
||||
For manual control, patch each component individually:
|
||||
|
||||
**API Server:**
|
||||
```bash
|
||||
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]'
|
||||
```
|
||||
|
||||
**Controller Manager:**
|
||||
```bash
|
||||
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]'
|
||||
```
|
||||
|
||||
**Scheduler:**
|
||||
```bash
|
||||
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]'
|
||||
```
|
||||
|
||||
**Kubelet:**
|
||||
```bash
|
||||
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]'
|
||||
```
|
||||
|
||||
## Node Management
|
||||
|
||||
### Adding Control Plane Nodes
|
||||
1. Apply machine configuration to new node
|
||||
2. Node automatically joins etcd cluster via control plane endpoint
|
||||
3. Control plane components start automatically
|
||||
|
||||
### Removing Control Plane Nodes
|
||||
```bash
|
||||
# Recommended approach - reset then delete
|
||||
talosctl -n <IP.of.node.to.remove> reset
|
||||
kubectl delete node <node-name>
|
||||
```
|
||||
|
||||
### Adding Worker Nodes
|
||||
1. Apply worker machine configuration
|
||||
2. Node automatically joins via bootstrap token
|
||||
|
||||
### Removing Worker Nodes
|
||||
```bash
|
||||
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
|
||||
kubectl delete node <node-name>
|
||||
talosctl -n <IP> reset
|
||||
```
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### Applying Configuration Changes
|
||||
```bash
|
||||
# Apply config with automatic mode detection
|
||||
talosctl apply-config --nodes <IP> --file <config.yaml>
|
||||
|
||||
# Apply with specific modes
|
||||
talosctl apply-config --nodes <IP> --file <config.yaml> --mode no-reboot
|
||||
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
|
||||
talosctl apply-config --nodes <IP> --file <config.yaml> --mode staged
|
||||
|
||||
# Dry run to preview changes
|
||||
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
|
||||
```
|
||||
|
||||
### Configuration Patching
|
||||
```bash
|
||||
# Patch machine configuration
|
||||
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
|
||||
|
||||
# Patch with file
|
||||
talosctl -n <IP> patch mc --patch @patch.yaml
|
||||
```
|
||||
|
||||
### Retrieving Current Configuration
|
||||
```bash
|
||||
# Get machine configuration
|
||||
talosctl -n <IP> get mc v1alpha1 -o yaml
|
||||
|
||||
# Get effective configuration
|
||||
talosctl -n <IP> get machineconfig -o yaml
|
||||
```
|
||||
|
||||
## Cluster Health Monitoring
|
||||
|
||||
### Node Status
|
||||
```bash
|
||||
# Check node status
|
||||
talosctl -n <IP> get members
|
||||
talosctl -n <IP> health
|
||||
|
||||
# Check system services
|
||||
talosctl -n <IP> services
|
||||
talosctl -n <IP> service <service-name>
|
||||
```
|
||||
|
||||
### Resource Monitoring
|
||||
```bash
|
||||
# System resources
|
||||
talosctl -n <IP> memory
|
||||
talosctl -n <IP> cpu
|
||||
talosctl -n <IP> disks
|
||||
|
||||
# Process information
|
||||
talosctl -n <IP> processes
|
||||
talosctl -n <IP> cgroups --preset memory
|
||||
```
|
||||
|
||||
### Log Monitoring
|
||||
```bash
|
||||
# Kernel logs
|
||||
talosctl -n <IP> dmesg
|
||||
talosctl -n <IP> dmesg -f # Follow mode
|
||||
|
||||
# Service logs
|
||||
talosctl -n <IP> logs <service-name>
|
||||
talosctl -n <IP> logs kubelet
|
||||
```
|
||||
|
||||
## Control Plane Best Practices
|
||||
|
||||
### Cluster Sizing Recommendations
|
||||
- **3 nodes**: Sufficient for most use cases, tolerates 1 node failure
|
||||
- **5 nodes**: Better availability (tolerates 2 node failures), higher resource cost
|
||||
- **Avoid even numbers**: 2 or 4 nodes provide worse availability than odd numbers
|
||||
|
||||
### Node Replacement Strategy
|
||||
- **Failed node**: Remove first, then add replacement
|
||||
- **Healthy node**: Add replacement first, then remove old node
|
||||
|
||||
### Performance Considerations
|
||||
- etcd performance decreases as cluster scales
|
||||
- 5-node cluster commits ~5% fewer writes than 3-node cluster
|
||||
- Vertically scale nodes for performance, don't add more nodes
|
||||
|
||||
## Machine Configuration Versioning
|
||||
|
||||
### Reproducible Configuration Workflow
|
||||
Store only:
|
||||
- `secrets.yaml` (generated once at cluster creation)
|
||||
- Patch files (YAML/JSON patches describing differences from defaults)
|
||||
|
||||
Generate configs when needed:
|
||||
```bash
|
||||
# Generate fresh configs with existing secrets
|
||||
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml
|
||||
|
||||
# Apply patches to generated configs
|
||||
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml --config-patch @patch.yaml
|
||||
```
|
||||
|
||||
This prevents configuration drift after automated upgrades.
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Upgrade Failures
|
||||
- **Invalid installer image**: Check image reference and network connectivity
|
||||
- **Filesystem unmount failure**: Use `--stage` flag
|
||||
- **Boot failure**: System automatically rolls back to previous version
|
||||
- **Workload issues**: Use `talosctl rollback` to revert
|
||||
|
||||
### Node Join Issues
|
||||
- Verify network connectivity to control plane endpoint
|
||||
- Check discovery service configuration
|
||||
- Validate machine configuration syntax
|
||||
- Ensure bootstrap process completed on initial control plane node
|
||||
|
||||
### Control Plane Quorum Loss
|
||||
- Identify healthy nodes with `talosctl etcd status`
|
||||
- Follow disaster recovery procedures if quorum cannot be restored
|
||||
- Use etcd snapshots for cluster recovery
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Certificate Rotation
|
||||
Talos automatically rotates certificates, but monitor expiration:
|
||||
```bash
|
||||
talosctl -n <IP> get secrets
|
||||
```
|
||||
|
||||
### Pod Security
|
||||
Control plane nodes are tainted by default to prevent workload scheduling. This protects:
|
||||
- Control plane from resource starvation
|
||||
- Credentials from workload exposure
|
||||
|
||||
### Network Security
|
||||
- All API communication uses mutual TLS (mTLS)
|
||||
- Discovery service data is encrypted before transmission
|
||||
- WireGuard (KubeSpan) provides mesh networking security
|
||||
344
ai/talos-v1.11/discovery-and-networking.md
Normal file
344
ai/talos-v1.11/discovery-and-networking.md
Normal file
@@ -0,0 +1,344 @@
|
||||
# Discovery and Networking Guide
|
||||
|
||||
This guide covers Talos cluster discovery mechanisms, network configuration, and connectivity troubleshooting.
|
||||
|
||||
## Cluster Discovery System
|
||||
|
||||
Talos includes built-in node discovery that allows cluster members to find each other and maintain membership information.
|
||||
|
||||
### Discovery Registries
|
||||
|
||||
#### Service Registry (Default)
|
||||
- **External Service**: Uses public discovery service at `https://discovery.talos.dev/`
|
||||
- **Encryption**: All data encrypted with AES-GCM before transmission
|
||||
- **Functionality**: Works without dependency on etcd/Kubernetes
|
||||
- **Advantages**: Available even when control plane is down
|
||||
|
||||
#### Kubernetes Registry (Deprecated)
|
||||
- **Data Source**: Uses Kubernetes Node resources and annotations
|
||||
- **Limitation**: Incompatible with Kubernetes 1.32+ due to AuthorizeNodeWithSelectors
|
||||
- **Status**: Disabled by default, deprecated
|
||||
|
||||
### Discovery Configuration
|
||||
```yaml
|
||||
cluster:
|
||||
discovery:
|
||||
enabled: true
|
||||
registries:
|
||||
service:
|
||||
disabled: false # Default
|
||||
kubernetes:
|
||||
disabled: true # Deprecated, disabled by default
|
||||
```
|
||||
|
||||
**To disable service registry**:
|
||||
```yaml
|
||||
cluster:
|
||||
discovery:
|
||||
enabled: true
|
||||
registries:
|
||||
service:
|
||||
disabled: true
|
||||
```
|
||||
|
||||
## Discovery Data Flow
|
||||
|
||||
### Service Registry Process
|
||||
1. **Data Encryption**: Node encrypts affiliate data with cluster key
|
||||
2. **Endpoint Encryption**: Endpoints separately encrypted for deduplication
|
||||
3. **Data Submission**: Node submits own data + observed peer endpoints
|
||||
4. **Server Processing**: Discovery service aggregates and deduplicates data
|
||||
5. **Data Distribution**: Encrypted updates sent to all cluster members
|
||||
6. **Local Processing**: Nodes decrypt data for cluster discovery and KubeSpan
|
||||
|
||||
### Data Protection
|
||||
- **Cluster Isolation**: Cluster ID used as key selector
|
||||
- **End-to-End Encryption**: Discovery service cannot decrypt affiliate data
|
||||
- **Memory-Only Storage**: Data stored in memory with encrypted snapshots
|
||||
- **No Sensitive Exposure**: Service only sees encrypted blobs and cluster metadata
|
||||
|
||||
## Discovery Resources
|
||||
|
||||
### Node Identity
|
||||
```bash
|
||||
# View node's unique identity
|
||||
talosctl get identities -o yaml
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```yaml
|
||||
spec:
|
||||
nodeId: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd
|
||||
```
|
||||
|
||||
**Identity Characteristics**:
|
||||
- Base62 encoded random 32 bytes
|
||||
- URL-safe encoding
|
||||
- Preserved in STATE partition (`node-identity.yaml`)
|
||||
- Survives reboots and upgrades
|
||||
- Regenerated on reset/wipe
|
||||
|
||||
### Affiliates (Proposed Members)
|
||||
```bash
|
||||
# View discovered affiliates (proposed cluster members)
|
||||
talosctl get affiliates
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```
|
||||
ID VERSION HOSTNAME MACHINE TYPE ADDRESSES
|
||||
2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 2 talos-default-controlplane-2 controlplane ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]
|
||||
```
|
||||
|
||||
### Members (Approved Members)
|
||||
```bash
|
||||
# View cluster members
|
||||
talosctl get members
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```
|
||||
ID VERSION HOSTNAME MACHINE TYPE OS ADDRESSES
|
||||
talos-default-controlplane-1 2 talos-default-controlplane-1 controlplane Talos (v1.11.0) ["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"]
|
||||
```
|
||||
|
||||
### Raw Registry Data
|
||||
```bash
|
||||
# View data from specific registries
|
||||
talosctl get affiliates --namespace=cluster-raw
|
||||
```
|
||||
|
||||
**Output shows registry sources**:
|
||||
```
|
||||
ID VERSION HOSTNAME
|
||||
k8s/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 3 talos-default-controlplane-2
|
||||
service/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 23 talos-default-controlplane-2
|
||||
```
|
||||
|
||||
## Network Architecture
|
||||
|
||||
### Network Layers
|
||||
|
||||
#### Host Networking
|
||||
- **Node-to-Node**: Direct IP connectivity between cluster nodes
|
||||
- **Control Plane**: API server communication via control plane endpoint
|
||||
- **Discovery**: HTTPS connection to discovery service (port 443)
|
||||
|
||||
#### Container Networking
|
||||
- **CNI**: Container Network Interface for pod networking
|
||||
- **Service Mesh**: Optional service mesh implementations
|
||||
- **Network Policies**: Kubernetes network policy enforcement
|
||||
|
||||
#### Optional: KubeSpan (WireGuard Mesh)
|
||||
- **Mesh Networking**: Full mesh WireGuard connections
|
||||
- **Discovery Integration**: Uses discovery service for peer coordination
|
||||
- **Encryption**: WireGuard public keys distributed via discovery
|
||||
- **Use Cases**: Multi-cloud, hybrid, NAT traversal
|
||||
|
||||
### Network Configuration Patterns
|
||||
|
||||
#### Basic Network Setup
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
- interface: eth0
|
||||
dhcp: true
|
||||
```
|
||||
|
||||
#### Static IP Configuration
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
- interface: eth0
|
||||
addresses:
|
||||
- 192.168.1.100/24
|
||||
routes:
|
||||
- network: 0.0.0.0/0
|
||||
gateway: 192.168.1.1
|
||||
mtu: 1500
|
||||
nameservers:
|
||||
- 8.8.8.8
|
||||
- 1.1.1.1
|
||||
```
|
||||
|
||||
#### Multiple Interface Configuration
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
- interface: eth0 # Management interface
|
||||
dhcp: true
|
||||
- interface: eth1 # Kubernetes traffic
|
||||
addresses:
|
||||
- 10.0.1.100/24
|
||||
routes:
|
||||
- network: 10.0.0.0/16
|
||||
gateway: 10.0.1.1
|
||||
```
|
||||
|
||||
## KubeSpan Configuration
|
||||
|
||||
### Basic KubeSpan Setup
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
kubespan:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
### Advanced KubeSpan Configuration
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
kubespan:
|
||||
enabled: true
|
||||
advertiseKubernetesNetworks: true
|
||||
allowDownPeerBypass: true
|
||||
mtu: 1420 # Account for WireGuard overhead
|
||||
filters:
|
||||
endpoints:
|
||||
- 0.0.0.0/0 # Allow all endpoints
|
||||
```
|
||||
|
||||
**KubeSpan Features**:
|
||||
- Automatic peer discovery via discovery service
|
||||
- NAT traversal capabilities
|
||||
- Encrypted mesh networking
|
||||
- Kubernetes network advertisement
|
||||
- Fault tolerance with peer bypass
|
||||
|
||||
## Network Troubleshooting
|
||||
|
||||
### Discovery Issues
|
||||
|
||||
#### Check Discovery Service Connectivity
|
||||
```bash
|
||||
# Test connectivity to discovery service
|
||||
talosctl get affiliates
|
||||
|
||||
# Check discovery configuration
|
||||
talosctl get discoveryconfig -o yaml
|
||||
|
||||
# Monitor discovery events
|
||||
talosctl events --tail
|
||||
```
|
||||
|
||||
#### Common Discovery Problems
|
||||
1. **No Affiliates Discovered**:
|
||||
- Check discovery service connectivity
|
||||
- Verify cluster ID matches across nodes
|
||||
- Confirm discovery is enabled
|
||||
|
||||
2. **Partial Affiliate List**:
|
||||
- Network connectivity issues between nodes
|
||||
- Discovery service regional availability
|
||||
- Firewall blocking discovery traffic
|
||||
|
||||
3. **Discovery Service Unreachable**:
|
||||
- Network connectivity to discovery.talos.dev:443
|
||||
- Corporate firewall/proxy configuration
|
||||
- DNS resolution issues
|
||||
|
||||
### Network Connectivity Testing
|
||||
|
||||
#### Basic Network Tests
|
||||
```bash
|
||||
# Test network interfaces
|
||||
talosctl get addresses
|
||||
talosctl get routes
|
||||
talosctl get nodeaddresses
|
||||
|
||||
# Check network configuration
|
||||
talosctl get networkconfig -o yaml
|
||||
|
||||
# Test connectivity
|
||||
talosctl -n <IP> ping <target-ip>
|
||||
```
|
||||
|
||||
#### Inter-Node Connectivity
|
||||
```bash
|
||||
# Test control plane endpoint
|
||||
talosctl health --control-plane-nodes <IP1>,<IP2>,<IP3>
|
||||
|
||||
# Check etcd connectivity
|
||||
talosctl -n <IP> etcd members
|
||||
|
||||
# Test Kubernetes API
|
||||
kubectl get nodes
|
||||
```
|
||||
|
||||
#### KubeSpan Troubleshooting
|
||||
```bash
|
||||
# Check KubeSpan status
|
||||
talosctl get kubespanpeerspecs
|
||||
talosctl get kubespanpeerstatuses
|
||||
|
||||
# Monitor WireGuard connections
|
||||
talosctl -n <IP> interfaces
|
||||
|
||||
# Check KubeSpan logs
|
||||
talosctl -n <IP> logs controller-runtime | grep kubespan
|
||||
```
|
||||
|
||||
### Network Performance Optimization
|
||||
|
||||
#### Network Interface Tuning
|
||||
```yaml
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
- interface: eth0
|
||||
mtu: 9000 # Jumbo frames if supported
|
||||
dhcp: true
|
||||
```
|
||||
|
||||
#### KubeSpan Performance
|
||||
- Adjust MTU for WireGuard overhead (typically -80 bytes)
|
||||
- Consider endpoint filters for large clusters
|
||||
- Monitor WireGuard peer connection stability
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Discovery Security
|
||||
- **Encrypted Communication**: All discovery data encrypted end-to-end
|
||||
- **Cluster Isolation**: Cluster ID prevents cross-cluster data access
|
||||
- **No Sensitive Data**: Only encrypted metadata transmitted
|
||||
- **Network Security**: HTTPS transport with certificate validation
|
||||
|
||||
### Network Security
|
||||
- **mTLS**: All Talos API communication uses mutual TLS
|
||||
- **Certificate Rotation**: Automatic certificate lifecycle management
|
||||
- **Network Policies**: Implement Kubernetes network policies for workloads
|
||||
- **Firewall Rules**: Restrict network access to necessary ports only
|
||||
|
||||
### Required Network Ports
|
||||
- **6443**: Kubernetes API server
|
||||
- **2379-2380**: etcd client/peer communication
|
||||
- **10250**: kubelet API
|
||||
- **50000**: Talos API (apid)
|
||||
- **443**: Discovery service (outbound)
|
||||
- **51820**: KubeSpan WireGuard (if enabled)
|
||||
|
||||
## Operational Best Practices
|
||||
|
||||
### Monitoring
|
||||
- Monitor discovery service connectivity
|
||||
- Track cluster member changes
|
||||
- Alert on network partitions
|
||||
- Monitor KubeSpan peer status
|
||||
|
||||
### Backup and Recovery
|
||||
- Document network configuration
|
||||
- Backup discovery service configuration
|
||||
- Test network recovery procedures
|
||||
- Plan for discovery service outages
|
||||
|
||||
### Scaling Considerations
|
||||
- Discovery service scales to thousands of nodes
|
||||
- KubeSpan mesh scales to hundreds of nodes efficiently
|
||||
- Consider network segmentation for large clusters
|
||||
- Plan for multi-region deployments
|
||||
|
||||
This networking foundation enables Talos clusters to maintain connectivity and membership across various network topologies while providing security and performance optimization options.
|
||||
287
ai/talos-v1.11/etcd-management.md
Normal file
287
ai/talos-v1.11/etcd-management.md
Normal file
@@ -0,0 +1,287 @@
|
||||
# etcd Management and Disaster Recovery Guide
|
||||
|
||||
This guide covers etcd database operations, maintenance, and disaster recovery procedures for Talos Linux clusters.
|
||||
|
||||
## etcd Health Monitoring
|
||||
|
||||
### Basic Health Checks
|
||||
```bash
|
||||
# Check etcd status across all control plane nodes
|
||||
talosctl -n <IP1>,<IP2>,<IP3> etcd status
|
||||
|
||||
# Check etcd alarms
|
||||
talosctl -n <IP> etcd alarm list
|
||||
|
||||
# Check etcd members
|
||||
talosctl -n <IP> etcd members
|
||||
|
||||
# Check service status
|
||||
talosctl -n <IP> service etcd
|
||||
```
|
||||
|
||||
### Understanding etcd Status Output
|
||||
```
|
||||
NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS
|
||||
172.20.0.2 a49c021e76e707db 17 MB 4.5 MB (26.10%) ecebb05b59a776f1 53391 4 53391 false
|
||||
```
|
||||
|
||||
**Key Metrics**:
|
||||
- **DB SIZE**: Total database size on disk
|
||||
- **IN USE**: Actual data size (fragmentation = DB SIZE - IN USE)
|
||||
- **LEADER**: Current etcd cluster leader
|
||||
- **RAFT INDEX**: Consensus log position
|
||||
- **LEARNER**: Whether node is still joining cluster
|
||||
|
||||
## Space Quota Management
|
||||
|
||||
### Default Configuration
|
||||
- Default space quota: 2 GiB
|
||||
- Recommended maximum: 8 GiB
|
||||
- Database locks when quota exceeded
|
||||
|
||||
### Quota Exceeded Handling
|
||||
**Symptoms**:
|
||||
```bash
|
||||
talosctl -n <IP> etcd alarm list
|
||||
# Output: ALARM: NOSPACE
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
1. Increase quota in machine configuration:
|
||||
```yaml
|
||||
cluster:
|
||||
etcd:
|
||||
extraArgs:
|
||||
quota-backend-bytes: 4294967296 # 4 GiB
|
||||
```
|
||||
|
||||
2. Apply configuration and reboot:
|
||||
```bash
|
||||
talosctl -n <IP> apply-config --file updated-config.yaml --mode reboot
|
||||
```
|
||||
|
||||
3. Clear the alarm:
|
||||
```bash
|
||||
talosctl -n <IP> etcd alarm disarm
|
||||
```
|
||||
|
||||
## Database Defragmentation
|
||||
|
||||
### When to Defragment
|
||||
- In use/DB size ratio < 0.5 (heavily fragmented)
|
||||
- Database size exceeds quota but actual data is small
|
||||
- Performance degradation due to fragmentation
|
||||
|
||||
### Defragmentation Process
|
||||
```bash
|
||||
# Check fragmentation status
|
||||
talosctl -n <IP1>,<IP2>,<IP3> etcd status
|
||||
|
||||
# Defragment single node (resource-intensive operation)
|
||||
talosctl -n <IP1> etcd defrag
|
||||
|
||||
# Verify defragmentation results
|
||||
talosctl -n <IP1> etcd status
|
||||
```
|
||||
|
||||
**Important Notes**:
|
||||
- Defragment one node at a time
|
||||
- Operation blocks reads/writes during execution
|
||||
- Can significantly improve performance if heavily fragmented
|
||||
|
||||
### Post-Defragmentation Verification
|
||||
After successful defragmentation, DB size should closely match IN USE size:
|
||||
```
|
||||
NODE MEMBER DB SIZE IN USE
|
||||
172.20.0.2 a49c021e76e707db 4.5 MB 4.5 MB (100.00%)
|
||||
```
|
||||
|
||||
## Backup Operations
|
||||
|
||||
### Regular Snapshots
|
||||
```bash
|
||||
# Create consistent snapshot
|
||||
talosctl -n <IP> etcd snapshot db.snapshot
|
||||
```
|
||||
|
||||
**Output Example**:
|
||||
```
|
||||
etcd snapshot saved to "db.snapshot" (2015264 bytes)
|
||||
snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136
|
||||
```
|
||||
|
||||
### Disaster Snapshots
|
||||
When etcd cluster is unhealthy and normal snapshot fails:
|
||||
```bash
|
||||
# Copy database directly (may be inconsistent)
|
||||
talosctl -n <IP> cp /var/lib/etcd/member/snap/db .
|
||||
```
|
||||
|
||||
### Automated Backup Strategy
|
||||
- Schedule regular snapshots (daily/hourly based on change frequency)
|
||||
- Store snapshots in multiple locations
|
||||
- Test restore procedures regularly
|
||||
- Document recovery procedures
|
||||
|
||||
## Disaster Recovery
|
||||
|
||||
### Pre-Recovery Assessment
|
||||
**Check if Recovery is Necessary**:
|
||||
```bash
|
||||
# Query etcd health on all control plane nodes
|
||||
talosctl -n <IP1>,<IP2>,<IP3> service etcd
|
||||
|
||||
# Check member list consistency
|
||||
talosctl -n <IP1> etcd members
|
||||
talosctl -n <IP2> etcd members
|
||||
talosctl -n <IP3> etcd members
|
||||
```
|
||||
|
||||
**Recovery is needed when**:
|
||||
- Quorum is lost (majority of nodes down)
|
||||
- etcd data corruption
|
||||
- Complete cluster failure
|
||||
|
||||
### Recovery Prerequisites
|
||||
1. **Latest etcd snapshot** (preferably consistent)
|
||||
2. **Machine configuration backup**:
|
||||
```bash
|
||||
talosctl -n <IP> get mc v1alpha1 -o yaml | yq eval '.spec' -
|
||||
```
|
||||
3. **No init-type nodes** (deprecated, incompatible with recovery)
|
||||
|
||||
### Recovery Procedure
|
||||
|
||||
#### Step 1: Prepare Control Plane Nodes
|
||||
```bash
|
||||
# If nodes have hardware issues, replace them with same configuration
|
||||
# If nodes are running but etcd is corrupted, wipe EPHEMERAL partition:
|
||||
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
|
||||
```
|
||||
|
||||
#### Step 2: Verify etcd State
|
||||
All etcd services should be in "Preparing" state:
|
||||
```bash
|
||||
talosctl -n <IP> service etcd
|
||||
# Expected: STATE: Preparing
|
||||
```
|
||||
|
||||
#### Step 3: Bootstrap from Snapshot
|
||||
```bash
|
||||
# Bootstrap cluster from snapshot
|
||||
talosctl -n <IP> bootstrap --recover-from=./db.snapshot
|
||||
|
||||
# For direct database copies, skip hash check:
|
||||
talosctl -n <IP> bootstrap --recover-from=./db --recover-skip-hash-check
|
||||
```
|
||||
|
||||
#### Step 4: Verify Recovery
|
||||
**Monitor kernel logs** for recovery progress:
|
||||
```bash
|
||||
talosctl -n <IP> dmesg -f
|
||||
```
|
||||
|
||||
**Expected log entries**:
|
||||
```
|
||||
recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
|
||||
{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot"}
|
||||
```
|
||||
|
||||
**Verify cluster health**:
|
||||
```bash
|
||||
# etcd should become healthy on bootstrap node
|
||||
talosctl -n <IP> service etcd
|
||||
|
||||
# Kubernetes control plane should start
|
||||
kubectl get nodes
|
||||
|
||||
# Other control plane nodes should join automatically
|
||||
talosctl -n <IP1>,<IP2>,<IP3> etcd status
|
||||
```
|
||||
|
||||
## etcd Version Management
|
||||
|
||||
### Downgrade Process (v3.6 to v3.5)
|
||||
**Prerequisites**:
|
||||
- Healthy cluster running v3.6.x
|
||||
- Recent backup snapshot
|
||||
- Downgrade only one minor version at a time
|
||||
|
||||
#### Step 1: Validate Downgrade
|
||||
```bash
|
||||
talosctl -n <IP1> etcd downgrade validate 3.5
|
||||
```
|
||||
|
||||
#### Step 2: Enable Downgrade
|
||||
```bash
|
||||
talosctl -n <IP1> etcd downgrade enable 3.5
|
||||
```
|
||||
|
||||
#### Step 3: Verify Schema Migration
|
||||
```bash
|
||||
# Check storage version migrated to 3.5
|
||||
talosctl -n <IP1>,<IP2>,<IP3> etcd status
|
||||
# Verify STORAGE column shows 3.5.0
|
||||
```
|
||||
|
||||
#### Step 4: Patch Machine Configuration
|
||||
```bash
|
||||
# Transfer leadership if node is leader
|
||||
talosctl -n <IP1> etcd forfeit-leadership
|
||||
|
||||
# Create patch file
|
||||
cat > etcd-patch.yaml <<EOF
|
||||
cluster:
|
||||
etcd:
|
||||
image: gcr.io/etcd-development/etcd:v3.5.22
|
||||
EOF
|
||||
|
||||
# Apply patch with reboot
|
||||
talosctl -n <IP1> patch machineconfig --patch @etcd-patch.yaml --mode reboot
|
||||
```
|
||||
|
||||
#### Step 5: Repeat for All Control Plane Nodes
|
||||
Continue patching remaining control plane nodes one by one.
|
||||
|
||||
## Operational Best Practices
|
||||
|
||||
### Monitoring
|
||||
- Monitor database size and fragmentation regularly
|
||||
- Set up alerts for space quota approaching limits
|
||||
- Track etcd performance metrics (request latency, leader changes)
|
||||
- Monitor disk I/O and network latency
|
||||
|
||||
### Maintenance Windows
|
||||
- Schedule defragmentation during low-traffic periods
|
||||
- Coordinate with application teams for maintenance windows
|
||||
- Test backup/restore procedures in non-production environments
|
||||
|
||||
### Performance Optimization
|
||||
- Use fast storage (NVMe SSDs preferred)
|
||||
- Minimize network latency between control plane nodes
|
||||
- Monitor and tune etcd configuration based on workload
|
||||
|
||||
### Security
|
||||
- Encrypt etcd data at rest
|
||||
- Secure backup storage with appropriate access controls
|
||||
- Regularly rotate certificates
|
||||
- Monitor for unauthorized access attempts
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Split Brain Prevention
|
||||
- Ensure odd number of control plane nodes
|
||||
- Monitor network connectivity between nodes
|
||||
- Use dedicated network for control plane communication when possible
|
||||
|
||||
### Performance Issues
|
||||
- Check disk I/O latency
|
||||
- Monitor memory usage
|
||||
- Consider vertical scaling before adding nodes
|
||||
- Review etcd request patterns and optimize applications
|
||||
|
||||
### Backup/Restore Issues
|
||||
- Test restore procedures regularly
|
||||
- Verify backup integrity
|
||||
- Ensure consistent network and storage configuration
|
||||
- Document and practice disaster recovery procedures
|
||||
480
ai/talos-v1.11/troubleshooting-guide.md
Normal file
480
ai/talos-v1.11/troubleshooting-guide.md
Normal file
@@ -0,0 +1,480 @@
|
||||
# Talos Troubleshooting Guide
|
||||
|
||||
This guide provides systematic approaches to diagnosing and resolving common Talos cluster issues.
|
||||
|
||||
## General Troubleshooting Methodology
|
||||
|
||||
### 1. Gather Information
|
||||
```bash
|
||||
# Node status and health
|
||||
talosctl -n <IP> health
|
||||
talosctl -n <IP> version
|
||||
talosctl -n <IP> get members
|
||||
|
||||
# System resources
|
||||
talosctl -n <IP> memory
|
||||
talosctl -n <IP> disks
|
||||
talosctl -n <IP> processes | head -20
|
||||
|
||||
# Service status
|
||||
talosctl -n <IP> services
|
||||
```
|
||||
|
||||
### 2. Check Logs
|
||||
```bash
|
||||
# Kernel logs (system-level issues)
|
||||
talosctl -n <IP> dmesg | tail -100
|
||||
|
||||
# Service logs
|
||||
talosctl -n <IP> logs machined
|
||||
talosctl -n <IP> logs kubelet
|
||||
talosctl -n <IP> logs containerd
|
||||
|
||||
# System events
|
||||
talosctl -n <IP> events --since=1h
|
||||
```
|
||||
|
||||
### 3. Network Connectivity
|
||||
```bash
|
||||
# Discovery and membership
|
||||
talosctl get affiliates
|
||||
talosctl get members
|
||||
|
||||
# Network interfaces
|
||||
talosctl -n <IP> interfaces
|
||||
talosctl -n <IP> get addresses
|
||||
|
||||
# Control plane connectivity
|
||||
kubectl get nodes
|
||||
talosctl -n <IP1>,<IP2>,<IP3> etcd status
|
||||
```
|
||||
|
||||
## Bootstrap and Initial Setup Issues
|
||||
|
||||
### Cluster Bootstrap Failures
|
||||
|
||||
**Symptoms**: Bootstrap command fails or times out
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check etcd service state
|
||||
talosctl -n <IP> service etcd
|
||||
|
||||
# Check if node is trying to join instead of bootstrap
|
||||
talosctl -n <IP> logs etcd | grep -i bootstrap
|
||||
|
||||
# Verify machine configuration
|
||||
talosctl -n <IP> get machineconfig -o yaml
|
||||
```
|
||||
|
||||
**Common Causes & Solutions**:
|
||||
1. **Wrong node type**: Ensure using `controlplane`, not deprecated `init`
|
||||
2. **Network issues**: Verify control plane endpoint connectivity
|
||||
3. **Configuration errors**: Check machine configuration validity
|
||||
4. **Previous bootstrap**: etcd data exists from previous attempts
|
||||
|
||||
**Resolution**:
|
||||
```bash
|
||||
# Reset node if previous bootstrap data exists
|
||||
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
|
||||
|
||||
# Re-apply configuration and bootstrap
|
||||
talosctl apply-config --nodes <IP> --file controlplane.yaml
|
||||
talosctl bootstrap --nodes <IP>
|
||||
```
|
||||
|
||||
### Node Join Issues
|
||||
|
||||
**Symptoms**: New nodes don't join cluster
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check discovery
|
||||
talosctl get affiliates
|
||||
talosctl get members
|
||||
|
||||
# Check bootstrap token
|
||||
kubectl get secrets -n kube-system | grep bootstrap-token
|
||||
|
||||
# Check kubelet logs
|
||||
talosctl -n <IP> logs kubelet | grep -i certificate
|
||||
```
|
||||
|
||||
**Common Solutions**:
|
||||
```bash
|
||||
# Regenerate bootstrap token if expired
|
||||
kubeadm token create --print-join-command
|
||||
|
||||
# Verify discovery service connectivity
|
||||
talosctl -n <IP> get affiliates --namespace=cluster-raw
|
||||
|
||||
# Check machine configuration matches cluster
|
||||
talosctl -n <IP> get machineconfig -o yaml
|
||||
```
|
||||
|
||||
## Control Plane Issues
|
||||
|
||||
### etcd Problems
|
||||
|
||||
**etcd Won't Start**:
|
||||
```bash
|
||||
# Check etcd service status and logs
|
||||
talosctl -n <IP> service etcd
|
||||
talosctl -n <IP> logs etcd
|
||||
|
||||
# Check etcd data directory
|
||||
talosctl -n <IP> list /var/lib/etcd
|
||||
|
||||
# Check disk space and permissions
|
||||
talosctl -n <IP> df
|
||||
```
|
||||
|
||||
**etcd Quorum Loss**:
|
||||
```bash
|
||||
# Check member status
|
||||
talosctl -n <IP1>,<IP2>,<IP3> etcd status
|
||||
talosctl -n <IP> etcd members
|
||||
|
||||
# Identify healthy members
|
||||
for ip in IP1 IP2 IP3; do
|
||||
echo "=== Node $ip ==="
|
||||
talosctl -n $ip service etcd
|
||||
done
|
||||
```
|
||||
|
||||
**Solution for Quorum Loss**:
|
||||
1. If majority available: Remove failed members, add replacements
|
||||
2. If majority lost: Follow disaster recovery procedure
|
||||
|
||||
### API Server Issues
|
||||
|
||||
**API Server Not Responding**:
|
||||
```bash
|
||||
# Check API server pod status
|
||||
kubectl get pods -n kube-system | grep apiserver
|
||||
|
||||
# Check API server configuration
|
||||
talosctl -n <IP> get apiserverconfig -o yaml
|
||||
|
||||
# Check control plane endpoint
|
||||
curl -k https://<control-plane-endpoint>:6443/healthz
|
||||
```
|
||||
|
||||
**Common Solutions**:
|
||||
```bash
|
||||
# Restart kubelet to reload static pods
|
||||
talosctl -n <IP> service kubelet restart
|
||||
|
||||
# Check for configuration issues
|
||||
talosctl -n <IP> logs kubelet | grep apiserver
|
||||
|
||||
# Verify etcd connectivity
|
||||
talosctl -n <IP> etcd status
|
||||
```
|
||||
|
||||
## Node-Level Issues
|
||||
|
||||
### Kubelet Problems
|
||||
|
||||
**Kubelet Service Issues**:
|
||||
```bash
|
||||
# Check kubelet status and logs
|
||||
talosctl -n <IP> service kubelet
|
||||
talosctl -n <IP> logs kubelet | tail -50
|
||||
|
||||
# Check kubelet configuration
|
||||
talosctl -n <IP> get kubeletconfig -o yaml
|
||||
|
||||
# Check container runtime
|
||||
talosctl -n <IP> service containerd
|
||||
```
|
||||
|
||||
**Common Kubelet Issues**:
|
||||
1. **Certificate problems**: Check certificate expiration and rotation
|
||||
2. **Container runtime issues**: Verify containerd health
|
||||
3. **Resource constraints**: Check memory and disk space
|
||||
4. **Network connectivity**: Verify API server connectivity
|
||||
|
||||
### Container Runtime Issues
|
||||
|
||||
**Containerd Problems**:
|
||||
```bash
|
||||
# Check containerd service
|
||||
talosctl -n <IP> service containerd
|
||||
talosctl -n <IP> logs containerd
|
||||
|
||||
# List containers
|
||||
talosctl -n <IP> containers
|
||||
talosctl -n <IP> containers -k # Kubernetes containers
|
||||
|
||||
# Check containerd configuration
|
||||
talosctl -n <IP> read /etc/cri/conf.d/cri.toml
|
||||
```
|
||||
|
||||
**Common Solutions**:
|
||||
```bash
|
||||
# Restart containerd
|
||||
talosctl -n <IP> service containerd restart
|
||||
|
||||
# Check disk space for container images
|
||||
talosctl -n <IP> df
|
||||
|
||||
# Clean up unused containers/images
|
||||
# (This happens automatically via kubelet GC)
|
||||
```
|
||||
|
||||
## Network Issues
|
||||
|
||||
### Network Connectivity Problems
|
||||
|
||||
**Node-to-Node Connectivity**:
|
||||
```bash
|
||||
# Test basic network connectivity
|
||||
talosctl -n <IP1> interfaces
|
||||
talosctl -n <IP1> get routes
|
||||
|
||||
# Test specific connectivity
|
||||
talosctl -n <IP1> read /etc/resolv.conf
|
||||
|
||||
# Check network configuration
|
||||
talosctl -n <IP> get networkconfig -o yaml
|
||||
```
|
||||
|
||||
**DNS Resolution Issues**:
|
||||
```bash
|
||||
# Check DNS configuration
|
||||
talosctl -n <IP> read /etc/resolv.conf
|
||||
|
||||
# Test DNS resolution
|
||||
talosctl -n <IP> exec --kubernetes coredns-pod -- nslookup kubernetes.default.svc.cluster.local
|
||||
```
|
||||
|
||||
### Discovery Service Issues
|
||||
|
||||
**Discovery Not Working**:
|
||||
```bash
|
||||
# Check discovery configuration
|
||||
talosctl get discoveryconfig -o yaml
|
||||
|
||||
# Check affiliate discovery
|
||||
talosctl get affiliates
|
||||
talosctl get affiliates --namespace=cluster-raw
|
||||
|
||||
# Test discovery service connectivity
|
||||
curl -v https://discovery.talos.dev/
|
||||
```
|
||||
|
||||
**KubeSpan Issues** (if enabled):
|
||||
```bash
|
||||
# Check KubeSpan configuration
|
||||
talosctl get kubespanconfig -o yaml
|
||||
|
||||
# Check peer status
|
||||
talosctl get kubespanpeerspecs
|
||||
talosctl get kubespanpeerstatuses
|
||||
|
||||
# Check WireGuard interface
|
||||
talosctl -n <IP> interfaces | grep kubespan
|
||||
```
|
||||
|
||||
## Upgrade Issues
|
||||
|
||||
### OS Upgrade Problems
|
||||
|
||||
**Upgrade Fails or Hangs**:
|
||||
```bash
|
||||
# Check upgrade status
|
||||
talosctl -n <IP> dmesg | grep -i upgrade
|
||||
talosctl -n <IP> events | grep -i upgrade
|
||||
|
||||
# Use staged upgrade for filesystem lock issues
|
||||
talosctl upgrade --nodes <IP> --image <image> --stage
|
||||
|
||||
# Monitor upgrade progress
|
||||
talosctl upgrade --nodes <IP> --image <image> --wait --debug
|
||||
```
|
||||
|
||||
**Boot Issues After Upgrade**:
|
||||
```bash
|
||||
# Check boot logs
|
||||
talosctl -n <IP> dmesg | head -100
|
||||
|
||||
# System automatically rolls back on boot failure
|
||||
# Check current version
|
||||
talosctl -n <IP> version
|
||||
|
||||
# Manual rollback if needed
|
||||
talosctl rollback --nodes <IP>
|
||||
```
|
||||
|
||||
### Kubernetes Upgrade Issues
|
||||
|
||||
**K8s Upgrade Failures**:
|
||||
```bash
|
||||
# Check upgrade status
|
||||
talosctl --nodes <controlplane> upgrade-k8s --to <version> --dry-run
|
||||
|
||||
# Check individual component status
|
||||
kubectl get pods -n kube-system
|
||||
talosctl -n <IP> get apiserverconfig -o yaml
|
||||
```
|
||||
|
||||
**Version Mismatch Issues**:
|
||||
```bash
|
||||
# Check version consistency
|
||||
kubectl get nodes -o wide
|
||||
talosctl -n <IP1>,<IP2>,<IP3> version
|
||||
|
||||
# Check component versions
|
||||
kubectl get pods -n kube-system -o wide
|
||||
```
|
||||
|
||||
## Resource and Performance Issues
|
||||
|
||||
### Memory and Storage Problems
|
||||
|
||||
**Out of Memory**:
|
||||
```bash
|
||||
# Check memory usage
|
||||
talosctl -n <IP> memory
|
||||
talosctl -n <IP> processes --sort-by=memory | head -20
|
||||
|
||||
# Check for memory pressure
|
||||
kubectl describe node <node-name> | grep -A 10 Conditions
|
||||
|
||||
# Check OOM events
|
||||
talosctl -n <IP> dmesg | grep -i "out of memory"
|
||||
```
|
||||
|
||||
**Disk Space Issues**:
|
||||
```bash
|
||||
# Check disk usage
|
||||
talosctl -n <IP> df
|
||||
talosctl -n <IP> disks
|
||||
|
||||
# Check specific directories
|
||||
talosctl -n <IP> list /var/lib/containerd
|
||||
talosctl -n <IP> list /var/lib/etcd
|
||||
|
||||
# Clean up if needed (automatic GC usually handles this)
|
||||
kubectl describe node <node-name> | grep -A 5 "Disk Pressure"
|
||||
```
|
||||
|
||||
### Performance Issues
|
||||
|
||||
**Slow Cluster Response**:
|
||||
```bash
|
||||
# Check API server response time
|
||||
time kubectl get nodes
|
||||
|
||||
# Check etcd performance
|
||||
talosctl -n <IP> etcd status
|
||||
# Look for high DB size vs IN USE ratio (fragmentation)
|
||||
|
||||
# Check system load
|
||||
talosctl -n <IP> cpu
|
||||
talosctl -n <IP> memory
|
||||
```
|
||||
|
||||
**High CPU/Memory Usage**:
|
||||
```bash
|
||||
# Identify resource-heavy processes
|
||||
talosctl -n <IP> processes --sort-by=cpu | head -10
|
||||
talosctl -n <IP> processes --sort-by=memory | head -10
|
||||
|
||||
# Check cgroup usage
|
||||
talosctl -n <IP> cgroups --preset memory
|
||||
talosctl -n <IP> cgroups --preset cpu
|
||||
```
|
||||
|
||||
## Configuration Issues
|
||||
|
||||
### Machine Configuration Problems
|
||||
|
||||
**Invalid Configuration**:
|
||||
```bash
|
||||
# Validate configuration before applying
|
||||
talosctl validate -f machineconfig.yaml
|
||||
|
||||
# Check current configuration
|
||||
talosctl -n <IP> get machineconfig -o yaml
|
||||
|
||||
# Compare with expected configuration
|
||||
diff <(talosctl -n <IP> get mc v1alpha1 -o yaml) expected-config.yaml
|
||||
```
|
||||
|
||||
**Configuration Drift**:
|
||||
```bash
|
||||
# Check configuration version
|
||||
talosctl -n <IP> get machineconfig
|
||||
|
||||
# Re-apply configuration if needed
|
||||
talosctl apply-config --nodes <IP> --file corrected-config.yaml --dry-run
|
||||
talosctl apply-config --nodes <IP> --file corrected-config.yaml
|
||||
```
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Node Unresponsive
|
||||
|
||||
**Complete Node Failure**:
|
||||
1. **Physical access required**: Power cycle or hardware reset
|
||||
2. **Check hardware**: Memory, disk, network interface status
|
||||
3. **Boot issues**: May require bootable recovery media
|
||||
|
||||
**Partial Connectivity**:
|
||||
```bash
|
||||
# Try different network interfaces if multiple available
|
||||
talosctl -e <alternate-ip> -n <IP> health
|
||||
|
||||
# Check if specific services are running
|
||||
talosctl -n <IP> service machined
|
||||
talosctl -n <IP> service apid
|
||||
```
|
||||
|
||||
### Cluster-Wide Failures
|
||||
|
||||
**All Control Plane Nodes Down**:
|
||||
1. **Assess scope**: Determine if data corruption or hardware failure
|
||||
2. **Recovery strategy**: Use etcd backup if available
|
||||
3. **Rebuild process**: May require complete cluster rebuild
|
||||
|
||||
**Follow disaster recovery procedures** as documented in etcd-management.md.
|
||||
|
||||
### Emergency Reset Procedures
|
||||
|
||||
**Single Node Reset**:
|
||||
```bash
|
||||
# Graceful reset (preserves some data)
|
||||
talosctl -n <IP> reset
|
||||
|
||||
# Force reset (wipes all data)
|
||||
talosctl -n <IP> reset --graceful=false --reboot
|
||||
|
||||
# Selective wipe (preserve STATE partition)
|
||||
talosctl -n <IP> reset --system-labels-to-wipe=EPHEMERAL
|
||||
```
|
||||
|
||||
**Cluster Reset** (DESTRUCTIVE):
|
||||
```bash
|
||||
# Reset all nodes (DANGER: DATA LOSS)
|
||||
for ip in IP1 IP2 IP3; do
|
||||
talosctl -n $ip reset --graceful=false --reboot
|
||||
done
|
||||
```
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Key Metrics to Monitor
|
||||
- Node resource usage (CPU, memory, disk)
|
||||
- etcd health and performance
|
||||
- Control plane component status
|
||||
- Network connectivity
|
||||
- Certificate expiration
|
||||
- Discovery service connectivity
|
||||
|
||||
### Log Locations for External Monitoring
|
||||
- Kernel logs: `talosctl dmesg`
|
||||
- Service logs: `talosctl logs <service>`
|
||||
- System events: `talosctl events`
|
||||
- Kubernetes events: `kubectl get events`
|
||||
|
||||
This troubleshooting guide provides systematic approaches to identify and resolve the most common issues encountered in Talos cluster operations.
|
||||
Reference in New Issue
Block a user