Files
wild-cloud-dev/ai/talos-v1.11
2025-10-11 18:08:04 +00:00
..
2025-10-11 18:08:04 +00:00
2025-10-11 18:08:04 +00:00
2025-10-11 18:08:04 +00:00
2025-10-11 18:08:04 +00:00
2025-10-11 18:08:04 +00:00
2025-10-11 18:08:04 +00:00
2025-10-11 18:08:04 +00:00
2025-10-11 18:08:04 +00:00

Talos v1.11 Agent Context Documentation

This directory contains comprehensive documentation extracted from the official Talos v1.11 documentation, organized specifically to help AI agents become expert Talos cluster administrators.

Documentation Structure

Core Operations

  • cluster-operations.md - Essential cluster operations including upgrades, node management, and configuration
  • cli-essentials.md - Key talosctl commands and usage patterns for daily administration

System Understanding

Specialized Operations

Quick Reference

Essential Commands for New Agents

# Cluster health check
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>

# Node information
talosctl get members
talosctl -n <IP> version

# Service status
talosctl -n <IP> services
talosctl -n <IP> service kubelet

# System resources
talosctl -n <IP> memory
talosctl -n <IP> disks

# Logs and events
talosctl -n <IP> dmesg | tail -50
talosctl -n <IP> logs kubelet
talosctl -n <IP> events --since=1h

Critical Procedures

  • Bootstrap: talosctl bootstrap --nodes <first-controlplane-ip>
  • Backup etcd: talosctl -n <IP> etcd snapshot db.snapshot
  • Upgrade OS: talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
  • Upgrade K8s: talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1

Emergency Commands

  • Node reset: talosctl -n <IP> reset
  • Force reset: talosctl -n <IP> reset --graceful=false --reboot
  • Disaster recovery: talosctl -n <IP> bootstrap --recover-from=./db.snapshot
  • Rollback: talosctl rollback --nodes <IP>

Bare Metal Specific Commands

  • Check hardware: talosctl -n <IP> disks, talosctl -n <IP> read /proc/cpuinfo
  • Network interfaces: talosctl -n <IP> get addresses, talosctl -n <IP> get routes
  • Extensions: talosctl -n <IP> get extensions
  • Encryption status: talosctl -n <IP> get encryptionconfig -o yaml
  • Hardware monitoring: talosctl -n <IP> dmesg | grep -i error

Key Concepts for Agents

Architecture Fundamentals

  • Immutable OS: Single image, atomic updates, A-B rollback system
  • API-driven: All management through gRPC API, no SSH/shell access
  • Controller pattern: Kubernetes-style resource controllers for system management
  • Minimal attack surface: Only services necessary for Kubernetes

Control Plane Design

  • etcd quorum: Requires majority for operations (3-node=2, 5-node=3)
  • Bootstrap process: One-time initialization of etcd cluster
  • HA considerations: Odd numbers of nodes, avoid even numbers
  • Upgrade strategy: Rolling upgrades with automatic rollback on failure

Network and Discovery

  • Service discovery: Encrypted discovery service for cluster membership
  • KubeSpan: Optional WireGuard mesh networking
  • mTLS everywhere: All Talos API communication secured
  • Discovery registries: Service (default) and Kubernetes (deprecated)

Bare Metal Considerations

  • META configuration: Network config embedded in disk images
  • Hardware compatibility: Driver support and firmware requirements
  • Disk encryption: LUKS2 with TPM, static keys, or node ID
  • SecureBoot: UKI images with embedded signatures
  • System extensions: Hardware-specific drivers and tools
  • Performance tuning: CPU governors, IOMMU, memory management

Common Administration Patterns

Daily Operations

  1. Check cluster health across all nodes
  2. Monitor resource usage and capacity
  3. Review system events and logs
  4. Verify etcd health and backup status
  5. Monitor discovery service connectivity

Maintenance Windows

  1. Plan upgrade sequence (workers first, then control plane)
  2. Create etcd backup before major changes
  3. Apply configuration changes with dry-run first
  4. Monitor upgrade progress and be ready to rollback
  5. Verify cluster functionality after changes

Troubleshooting Workflow

  1. Gather information: Health, version, resources, logs
  2. Check connectivity: Network, discovery, API endpoints
  3. Examine services: Status of critical services
  4. Review logs: System events, service logs, kernel messages
  5. Apply fixes: Configuration patches, service restarts, node resets

Best Practices for Agents

Configuration Management

  • Use reproducible configuration workflow (secrets + patches)
  • Always dry-run configuration changes first
  • Store machine configurations in version control
  • Test configuration changes in non-production first

Operational Safety

  • Take etcd snapshots before major changes
  • Upgrade one node at a time
  • Monitor upgrade progress and have rollback ready
  • Test disaster recovery procedures regularly

Performance Optimization

  • Monitor etcd fragmentation and defragment when needed
  • Scale vertically before horizontally for control plane
  • Use appropriate hardware for etcd (fast storage, low network latency)
  • Monitor resource usage trends and capacity planning

This documentation provides the essential knowledge needed to effectively administer Talos Linux clusters, organized by operational context and complexity level.