This directory contains comprehensive documentation extracted from the official Talos v1.11 documentation, organized specifically to help AI agents become expert Talos cluster administrators.

Documentation Structure

Core Operations

cluster-operations.md - Essential cluster operations including upgrades, node management, and configuration
cli-essentials.md - Key talosctl commands and usage patterns for daily administration

System Understanding

architecture-and-components.md - Deep dive into Talos architecture, components, and design principles
discovery-and-networking.md - Cluster discovery mechanisms and network configuration

Specialized Operations

etcd-management.md - etcd operations, maintenance, backup, and disaster recovery
bare-metal-administration.md - Bare metal specific configurations, security, and hardware management
troubleshooting-guide.md - Systematic approaches to diagnosing and resolving common issues

Quick Reference

Essential Commands for New Agents

# Cluster health check
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>

# Node information
talosctl get members
talosctl -n <IP> version

# Service status
talosctl -n <IP> services
talosctl -n <IP> service kubelet

# System resources
talosctl -n <IP> memory
talosctl -n <IP> disks

# Logs and events
talosctl -n <IP> dmesg | tail -50
talosctl -n <IP> logs kubelet
talosctl -n <IP> events --since=1h

Critical Procedures

Bootstrap: talosctl bootstrap --nodes <first-controlplane-ip>
Backup etcd: talosctl -n <IP> etcd snapshot db.snapshot
Upgrade OS: talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
Upgrade K8s: talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1

Emergency Commands

Node reset: talosctl -n <IP> reset
Force reset: talosctl -n <IP> reset --graceful=false --reboot
Disaster recovery: talosctl -n <IP> bootstrap --recover-from=./db.snapshot
Rollback: talosctl rollback --nodes <IP>

Bare Metal Specific Commands

Check hardware: talosctl -n <IP> disks, talosctl -n <IP> read /proc/cpuinfo
Network interfaces: talosctl -n <IP> get addresses, talosctl -n <IP> get routes
Extensions: talosctl -n <IP> get extensions
Encryption status: talosctl -n <IP> get encryptionconfig -o yaml
Hardware monitoring: talosctl -n <IP> dmesg | grep -i error

Key Concepts for Agents

Architecture Fundamentals

Immutable OS: Single image, atomic updates, A-B rollback system
API-driven: All management through gRPC API, no SSH/shell access
Controller pattern: Kubernetes-style resource controllers for system management
Minimal attack surface: Only services necessary for Kubernetes

Control Plane Design

etcd quorum: Requires majority for operations (3-node=2, 5-node=3)
Bootstrap process: One-time initialization of etcd cluster
HA considerations: Odd numbers of nodes, avoid even numbers
Upgrade strategy: Rolling upgrades with automatic rollback on failure

Network and Discovery

Service discovery: Encrypted discovery service for cluster membership
KubeSpan: Optional WireGuard mesh networking
mTLS everywhere: All Talos API communication secured
Discovery registries: Service (default) and Kubernetes (deprecated)

Bare Metal Considerations

META configuration: Network config embedded in disk images
Hardware compatibility: Driver support and firmware requirements
Disk encryption: LUKS2 with TPM, static keys, or node ID
SecureBoot: UKI images with embedded signatures
System extensions: Hardware-specific drivers and tools
Performance tuning: CPU governors, IOMMU, memory management

Common Administration Patterns

Daily Operations

Check cluster health across all nodes
Monitor resource usage and capacity
Review system events and logs
Verify etcd health and backup status
Monitor discovery service connectivity

Maintenance Windows

Plan upgrade sequence (workers first, then control plane)
Create etcd backup before major changes
Apply configuration changes with dry-run first
Monitor upgrade progress and be ready to rollback
Verify cluster functionality after changes

Troubleshooting Workflow

Gather information: Health, version, resources, logs
Check connectivity: Network, discovery, API endpoints
Examine services: Status of critical services
Review logs: System events, service logs, kernel messages
Apply fixes: Configuration patches, service restarts, node resets

Best Practices for Agents

Configuration Management

Use reproducible configuration workflow (secrets + patches)
Always dry-run configuration changes first
Store machine configurations in version control
Test configuration changes in non-production first

Operational Safety

Take etcd snapshots before major changes
Upgrade one node at a time
Monitor upgrade progress and have rollback ready
Test disaster recovery procedures regularly

Performance Optimization

Monitor etcd fragmentation and defragment when needed
Scale vertically before horizontally for control plane
Use appropriate hardware for etcd (fast storage, low network latency)
Monitor resource usage trends and capacity planning

This documentation provides the essential knowledge needed to effectively administer Talos Linux clusters, organized by operational context and complexity level.