Initial commit.

2025-10-11 18:08:04 +00:00
commit 8947da88eb
43 changed files with 7850 additions and 0 deletions
--- a/ai/talos-v1.11/README.md
+++ b/ai/talos-v1.11/README.md
@@ -0,0 +1,135 @@
+# Talos v1.11 Agent Context Documentation
+
+This directory contains comprehensive documentation extracted from the official Talos v1.11 documentation, organized specifically to help AI agents become expert Talos cluster administrators.
+
+## Documentation Structure
+
+### Core Operations
+- **[cluster-operations.md](cluster-operations.md)** - Essential cluster operations including upgrades, node management, and configuration
+- **[cli-essentials.md](cli-essentials.md)** - Key talosctl commands and usage patterns for daily administration
+
+### System Understanding
+- **[architecture-and-components.md](architecture-and-components.md)** - Deep dive into Talos architecture, components, and design principles
+- **[discovery-and-networking.md](discovery-and-networking.md)** - Cluster discovery mechanisms and network configuration
+
+### Specialized Operations
+- **[etcd-management.md](etcd-management.md)** - etcd operations, maintenance, backup, and disaster recovery
+- **[bare-metal-administration.md](bare-metal-administration.md)** - Bare metal specific configurations, security, and hardware management
+- **[troubleshooting-guide.md](troubleshooting-guide.md)** - Systematic approaches to diagnosing and resolving common issues
+
+## Quick Reference
+
+### Essential Commands for New Agents
+```bash
+# Cluster health check
+talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
+
+# Node information
+talosctl get members
+talosctl -n <IP> version
+
+# Service status
+talosctl -n <IP> services
+talosctl -n <IP> service kubelet
+
+# System resources
+talosctl -n <IP> memory
+talosctl -n <IP> disks
+
+# Logs and events
+talosctl -n <IP> dmesg | tail -50
+talosctl -n <IP> logs kubelet
+talosctl -n <IP> events --since=1h
+```
+
+### Critical Procedures
+- **Bootstrap**: `talosctl bootstrap --nodes <first-controlplane-ip>`
+- **Backup etcd**: `talosctl -n <IP> etcd snapshot db.snapshot`
+- **Upgrade OS**: `talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x`
+- **Upgrade K8s**: `talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1`
+
+### Emergency Commands
+- **Node reset**: `talosctl -n <IP> reset`
+- **Force reset**: `talosctl -n <IP> reset --graceful=false --reboot`
+- **Disaster recovery**: `talosctl -n <IP> bootstrap --recover-from=./db.snapshot`
+- **Rollback**: `talosctl rollback --nodes <IP>`
+
+### Bare Metal Specific Commands
+- **Check hardware**: `talosctl -n <IP> disks`, `talosctl -n <IP> read /proc/cpuinfo`
+- **Network interfaces**: `talosctl -n <IP> get addresses`, `talosctl -n <IP> get routes`
+- **Extensions**: `talosctl -n <IP> get extensions`
+- **Encryption status**: `talosctl -n <IP> get encryptionconfig -o yaml`
+- **Hardware monitoring**: `talosctl -n <IP> dmesg | grep -i error`
+
+## Key Concepts for Agents
+
+### Architecture Fundamentals
+- **Immutable OS**: Single image, atomic updates, A-B rollback system
+- **API-driven**: All management through gRPC API, no SSH/shell access
+- **Controller pattern**: Kubernetes-style resource controllers for system management
+- **Minimal attack surface**: Only services necessary for Kubernetes
+
+### Control Plane Design
+- **etcd quorum**: Requires majority for operations (3-node=2, 5-node=3)
+- **Bootstrap process**: One-time initialization of etcd cluster
+- **HA considerations**: Odd numbers of nodes, avoid even numbers
+- **Upgrade strategy**: Rolling upgrades with automatic rollback on failure
+
+### Network and Discovery
+- **Service discovery**: Encrypted discovery service for cluster membership
+- **KubeSpan**: Optional WireGuard mesh networking
+- **mTLS everywhere**: All Talos API communication secured
+- **Discovery registries**: Service (default) and Kubernetes (deprecated)
+
+### Bare Metal Considerations
+- **META configuration**: Network config embedded in disk images
+- **Hardware compatibility**: Driver support and firmware requirements
+- **Disk encryption**: LUKS2 with TPM, static keys, or node ID
+- **SecureBoot**: UKI images with embedded signatures
+- **System extensions**: Hardware-specific drivers and tools
+- **Performance tuning**: CPU governors, IOMMU, memory management
+
+## Common Administration Patterns
+
+### Daily Operations
+1. Check cluster health across all nodes
+2. Monitor resource usage and capacity
+3. Review system events and logs
+4. Verify etcd health and backup status
+5. Monitor discovery service connectivity
+
+### Maintenance Windows
+1. Plan upgrade sequence (workers first, then control plane)
+2. Create etcd backup before major changes
+3. Apply configuration changes with dry-run first
+4. Monitor upgrade progress and be ready to rollback
+5. Verify cluster functionality after changes
+
+### Troubleshooting Workflow
+1. **Gather information**: Health, version, resources, logs
+2. **Check connectivity**: Network, discovery, API endpoints
+3. **Examine services**: Status of critical services
+4. **Review logs**: System events, service logs, kernel messages
+5. **Apply fixes**: Configuration patches, service restarts, node resets
+
+## Best Practices for Agents
+
+### Configuration Management
+- Use reproducible configuration workflow (secrets + patches)
+- Always dry-run configuration changes first
+- Store machine configurations in version control
+- Test configuration changes in non-production first
+
+### Operational Safety
+- Take etcd snapshots before major changes
+- Upgrade one node at a time
+- Monitor upgrade progress and have rollback ready
+- Test disaster recovery procedures regularly
+
+### Performance Optimization
+- Monitor etcd fragmentation and defragment when needed
+- Scale vertically before horizontally for control plane
+- Use appropriate hardware for etcd (fast storage, low network latency)
+- Monitor resource usage trends and capacity planning
+
+This documentation provides the essential knowledge needed to effectively administer Talos Linux clusters, organized by operational context and complexity level.
--- a/ai/talos-v1.11/architecture-and-components.md
+++ b/ai/talos-v1.11/architecture-and-components.md
@@ -0,0 +1,248 @@
+# Talos Architecture and Components Guide
+
+This guide provides deep understanding of Talos Linux architecture and system components for effective cluster administration.
+
+## Core Architecture Principles
+
+Talos is designed to be:
+- **Atomic**: Distributed as a single, versioned, signed, immutable image
+- **Modular**: Composed of separate components with defined gRPC interfaces
+- **Minimal**: Focused init system that runs only services necessary for Kubernetes
+
+## File System Architecture
+
+### Partition Layout
+- **EFI**: Stores EFI boot data
+- **BIOS**: Used for GRUB's second stage boot
+- **BOOT**: Contains boot loader, initramfs, and kernel data
+- **META**: Stores node metadata (node IDs, etc.)
+- **STATE**: Stores machine configuration, node identity, cluster discovery, KubeSpan data
+- **EPHEMERAL**: Stores ephemeral state, mounted at `/var`
+
+### Root File System Structure
+Three-layer design:
+1. **Base Layer**: Read-only squashfs mounted as loop device (immutable base)
+2. **Runtime Layer**: tmpfs filesystems for runtime needs (`/dev`, `/proc`, `/run`, `/sys`, `/tmp`, `/system`)
+3. **Overlay Layer**: overlayfs for persistent data backed by XFS at `/var`
+
+#### Special Directories
+- `/system`: Internal files that need to be writable (recreated each boot)
+  - Example: `/system/etc/hosts` bind-mounted over `/etc/hosts`
+- `/var`: Owned by Kubernetes, contains persistent data:
+  - etcd data (control plane nodes)
+  - kubelet data
+  - containerd data
+  - Survives reboots and upgrades, wiped on reset
+
+## Core Components
+
+### machined (PID 1)
+**Role**: Talos replacement for traditional init process
+**Functions**:
+- Machine configuration management
+- API handling
+- Resource and controller management
+- Service lifecycle management
+
+**Managed Services**:
+- containerd
+- etcd (control plane nodes)
+- kubelet
+- networkd
+- trustd
+- udevd
+
+**Architecture**: Uses controller-runtime pattern similar to Kubernetes controllers
+
+### apid (API Gateway)
+**Role**: gRPC API endpoint for all Talos interactions
+**Functions**:
+- Routes requests to appropriate components
+- Provides proxy capabilities for multi-node operations
+- Handles authentication and authorization
+
+**Usage Patterns**:
+```bash
+# Direct node communication
+talosctl -e <node-ip> <command>
+
+# Proxy through endpoint to specific nodes
+talosctl -e <endpoint> -n <target-nodes> <command>
+
+# Multi-node operations
+talosctl -e <endpoint> -n <node1>,<node2>,<node3> <command>
+```
+
+### trustd (Trust Management)
+**Role**: Establishes and maintains trust within the system
+**Functions**:
+- Root of Trust implementation
+- PKI data distribution for control plane bootstrap
+- Certificate management
+- Secure file placement operations
+
+### containerd (Container Runtime)
+**Role**: Industry-standard container runtime
+**Namespaces**:
+- `system`: Talos services
+- `k8s.io`: Kubernetes services
+
+### udevd (Device Management)
+**Role**: Device file manager (eudev implementation)
+**Functions**:
+- Kernel device notification handling
+- Device node management in `/dev`
+- Hardware discovery and setup
+
+## Control Plane Architecture
+
+### etcd Cluster Design
+**Critical Concepts**:
+- **Quorum**: Majority of members must agree on leader
+- **Membership**: Formal etcd cluster membership required
+- **Consensus**: Uses Raft protocol for distributed consensus
+
+**Quorum Requirements**:
+- 3 nodes: Requires 2 for quorum (tolerates 1 failure)
+- 5 nodes: Requires 3 for quorum (tolerates 2 failures)
+- Even numbers are worse than odd (4 nodes still only tolerates 1 failure)
+
+### Control Plane Components
+**Running as Static Pods on Control Plane Nodes**:
+
+#### kube-apiserver
+- Kubernetes API endpoint
+- Connects to local etcd instance
+- Handles all API operations
+
+#### kube-controller-manager
+- Runs control loops
+- Manages cluster state reconciliation
+- Handles node lifecycle, replication, etc.
+
+#### kube-scheduler
+- Pod placement decisions
+- Resource-aware scheduling
+- Constraint satisfaction
+
+### Bootstrap Process
+1. **etcd Bootstrap**: One node chosen as bootstrap node, initializes etcd cluster
+2. **Static Pods**: Control plane components start as static pods via kubelet
+3. **API Availability**: Control plane endpoint becomes available
+4. **Manifest Injection**: Bootstrap manifests (join tokens, RBAC, etc.) injected
+5. **Cluster Formation**: Other control plane nodes join etcd cluster
+6. **HA Control Plane**: All control plane nodes run full component set
+
+## Resource System Architecture
+
+### Controller-Runtime Pattern
+Talos uses Kubernetes-style controller pattern:
+- **Resources**: Typed configuration and state objects
+- **Controllers**: Reconcile desired vs actual state
+- **Events**: Reactive architecture for state changes
+
+### Resource Namespaces
+- `config`: Machine configuration resources
+- `cluster`: Cluster membership and discovery
+- `controlplane`: Control plane component configurations
+- `secrets`: Certificate and key management
+- `network`: Network configuration and state
+
+### Key Resources
+```bash
+# Machine configuration
+talosctl get machineconfig
+talosctl get machinetype
+
+# Cluster membership
+talosctl get members
+talosctl get affiliates
+talosctl get identities
+
+# Control plane
+talosctl get apiserverconfig
+talosctl get controllermanagerconfig
+talosctl get schedulerconfig
+
+# Network
+talosctl get addresses
+talosctl get routes
+talosctl get nodeaddresses
+```
+
+## Network Architecture
+
+### Network Stack
+- **CNI**: Container Network Interface for pod networking
+- **Host Networking**: Node-to-node communication
+- **Service Discovery**: Built-in cluster member discovery
+- **KubeSpan**: Optional WireGuard mesh networking
+
+### Discovery Service Integration
+- **Service Registry**: External discovery service (default: discovery.talos.dev)
+- **Kubernetes Registry**: Deprecated, uses Kubernetes Node resources
+- **Encrypted Communication**: All discovery data encrypted before transmission
+
+## Security Architecture
+
+### Immutable Base
+- Read-only root filesystem
+- Signed and verified boot process
+- Atomic updates with rollback capability
+
+### Process Isolation
+- Minimal attack surface
+- No shell access
+- No arbitrary user services
+- Container-based workload isolation
+
+### Network Security
+- Mutual TLS (mTLS) for all API communication
+- Certificate-based node authentication
+- Optional WireGuard mesh networking (KubeSpan)
+- Encrypted service discovery
+
+### Kernel Hardening
+Configured according to Kernel Self Protection Project (KSPP) recommendations:
+- Stack protection
+- Control flow integrity
+- Memory protection features
+- Attack surface reduction
+
+## Extension Points
+
+### Machine Configuration
+- Declarative configuration management
+- Patch-based configuration updates
+- Runtime configuration validation
+
+### System Extensions
+- Kernel modules
+- System services (limited)
+- Network configuration
+- Storage configuration
+
+### Kubernetes Integration
+- Automatic kubelet configuration
+- Bootstrap manifest management
+- Certificate lifecycle management
+- Node lifecycle automation
+
+## Performance Characteristics
+
+### etcd Performance
+- Performance decreases with cluster size
+- Network latency affects consensus performance
+- Storage I/O directly impacts etcd performance
+
+### Resource Requirements
+- **Control Plane Nodes**: Higher memory for etcd, CPU for control plane
+- **Worker Nodes**: Resources scale with workload requirements
+- **Network**: Low latency crucial for etcd performance
+
+### Scaling Patterns
+- **Horizontal Scaling**: Add worker nodes for capacity
+- **Vertical Scaling**: Increase control plane node resources for performance
+- **Control Plane Scaling**: Odd numbers (3, 5) for availability
+
+This architecture enables Talos to provide a secure, minimal, and operationally simple platform for running Kubernetes clusters while maintaining the reliability and performance characteristics needed for production workloads.
--- a/ai/talos-v1.11/bare-metal-administration.md
+++ b/ai/talos-v1.11/bare-metal-administration.md
@@ -0,0 +1,506 @@
+# Bare Metal Talos Administration Guide
+
+This guide covers bare metal specific operations, configurations, and best practices for Talos Linux clusters.
+
+## META-Based Network Configuration
+
+Talos supports META-based network configuration for bare metal deployments where configuration is embedded in the disk image.
+
+### Basic META Configuration
+```yaml
+# META configuration for bare metal networking
+machine:
+  network:
+    interfaces:
+      - interface: eth0
+        addresses:
+          - 192.168.1.100/24
+        routes:
+          - network: 0.0.0.0/0
+            gateway: 192.168.1.1
+        mtu: 1500
+    nameservers:
+      - 8.8.8.8
+      - 1.1.1.1
+```
+
+### Advanced Network Configurations
+
+#### VLAN Configuration
+```yaml
+machine:
+  network:
+    interfaces:
+      - interface: eth0.100  # VLAN 100
+        vlan:
+          parentDevice: eth0
+          vid: 100
+        addresses:
+          - 192.168.100.10/24
+        routes:
+          - network: 192.168.100.0/24
+```
+
+#### Interface Bonding
+```yaml
+machine:
+  network:
+    interfaces:
+      - interface: bond0
+        bond:
+          mode: 802.3ad
+          lacpRate: fast
+          xmitHashPolicy: layer3+4
+          miimon: 100
+          updelay: 200
+          downdelay: 200
+          interfaces:
+            - eth0
+            - eth1
+        addresses:
+          - 192.168.1.100/24
+        routes:
+          - network: 0.0.0.0/0
+            gateway: 192.168.1.1
+```
+
+#### Bridge Configuration
+```yaml
+machine:
+  network:
+    interfaces:
+      - interface: br0
+        bridge:
+          stp:
+            enabled: false
+          interfaces:
+            - eth0
+            - eth1
+        addresses:
+          - 192.168.1.100/24
+        routes:
+          - network: 0.0.0.0/0
+            gateway: 192.168.1.1
+```
+
+### Network Troubleshooting Commands
+```bash
+# Check interface configuration
+talosctl -n <IP> get addresses
+talosctl -n <IP> get routes
+talosctl -n <IP> get links
+
+# Check network configuration
+talosctl -n <IP> get networkconfig -o yaml
+
+# Test network connectivity
+talosctl -n <IP> list /sys/class/net
+talosctl -n <IP> read /proc/net/dev
+```
+
+## Disk Encryption for Bare Metal
+
+### LUKS2 Encryption Configuration
+```yaml
+machine:
+  systemDiskEncryption:
+    state:
+      provider: luks2
+      keys:
+        - slot: 0
+          static:
+            passphrase: "your-secure-passphrase"
+    ephemeral:
+      provider: luks2
+      keys:
+        - slot: 0
+          nodeID: {}
+```
+
+### TPM-Based Encryption
+```yaml
+machine:
+  systemDiskEncryption:
+    state:
+      provider: luks2
+      keys:
+        - slot: 0
+          tpm: {}
+    ephemeral:
+      provider: luks2
+      keys:
+        - slot: 0
+          tpm: {}
+```
+
+### Key Management Operations
+```bash
+# Check encryption status
+talosctl -n <IP> get encryptionconfig -o yaml
+
+# Rotate encryption keys
+talosctl -n <IP> apply-config --file updated-config.yaml --mode staged
+```
+
+## SecureBoot Implementation
+
+### UKI (Unified Kernel Image) Setup
+SecureBoot requires UKI format images with embedded signatures.
+
+#### Generate SecureBoot Keys
+```bash
+# Generate platform key (PK)
+talosctl gen secureboot uki --platform-key-path platform.key --platform-cert-path platform.crt
+
+# Generate PCR signing key
+talosctl gen secureboot pcr --pcr-key-path pcr.key --pcr-cert-path pcr.crt
+
+# Generate database entries
+talosctl gen secureboot database --enrolled-certificate platform.crt
+```
+
+#### Machine Configuration for SecureBoot
+```yaml
+machine:
+  secureboot:
+    enabled: true
+    uklPath: /boot/vmlinuz
+  systemDiskEncryption:
+    state:
+      provider: luks2
+      keys:
+        - slot: 0
+          tpm:
+            pcrTargets:
+              - 0
+              - 1
+              - 7
+```
+
+### UEFI Configuration
+- Enable SecureBoot in UEFI firmware
+- Enroll platform keys and certificates
+- Configure TPM 2.0 for PCR measurements
+- Set boot order for UKI images
+
+## Hardware-Specific Configurations
+
+### Performance Tuning for Bare Metal
+
+#### CPU Governor Configuration
+```yaml
+machine:
+  sysfs:
+    "devices.system.cpu.cpu0.cpufreq.scaling_governor": "performance"
+    "devices.system.cpu.cpu1.cpufreq.scaling_governor": "performance"
+```
+
+#### Hardware Vulnerability Mitigations
+```yaml
+machine:
+  kernel:
+    args:
+      - mitigations=off  # For maximum performance (less secure)
+      # or
+      - mitigations=auto  # Default balanced approach
+```
+
+#### IOMMU Configuration
+```yaml
+machine:
+  kernel:
+    args:
+      - intel_iommu=on
+      - iommu=pt
+```
+
+### Memory Management
+```yaml
+machine:
+  kernel:
+    args:
+      - hugepages=1024  # 1GB hugepages
+      - transparent_hugepage=never
+```
+
+## Ingress Firewall for Bare Metal
+
+### Basic Firewall Configuration
+```yaml
+machine:
+  network:
+    firewall:
+      defaultAction: block
+      rules:
+        - name: allow-talos-api
+          portSelector:
+            ports:
+              - 50000
+              - 50001
+          ingress:
+            - subnet: 192.168.1.0/24
+        - name: allow-kubernetes-api
+          portSelector:
+            ports:
+              - 6443
+          ingress:
+            - subnet: 0.0.0.0/0
+        - name: allow-etcd
+          portSelector:
+            ports:
+              - 2379
+              - 2380
+          ingress:
+            - subnet: 192.168.1.0/24
+```
+
+### Advanced Firewall Rules
+```yaml
+machine:
+  network:
+    firewall:
+      defaultAction: block
+      rules:
+        - name: allow-ssh-management
+          portSelector:
+            ports:
+              - 22
+          ingress:
+            - subnet: 10.0.1.0/24  # Management network only
+        - name: allow-monitoring
+          portSelector:
+            ports:
+              - 9100  # Node exporter
+              - 10250 # kubelet metrics
+          ingress:
+            - subnet: 192.168.1.0/24
+```
+
+## System Extensions for Bare Metal
+
+### Common Bare Metal Extensions
+```yaml
+machine:
+  install:
+    extensions:
+      - image: ghcr.io/siderolabs/iscsi-tools:latest
+      - image: ghcr.io/siderolabs/util-linux-tools:latest
+      - image: ghcr.io/siderolabs/drbd:latest
+```
+
+### Storage Extensions
+```yaml
+machine:
+  install:
+    extensions:
+      - image: ghcr.io/siderolabs/zfs:latest
+      - image: ghcr.io/siderolabs/nut-client:latest
+      - image: ghcr.io/siderolabs/smartmontools:latest
+```
+
+### Checking Extension Status
+```bash
+# List installed extensions
+talosctl -n <IP> get extensions
+
+# Check extension services
+talosctl -n <IP> get extensionserviceconfigs
+```
+
+## Static Pod Configuration for Bare Metal
+
+### Local Storage Static Pods
+```yaml
+machine:
+  pods:
+    - name: local-storage-provisioner
+      namespace: kube-system
+      image: rancher/local-path-provisioner:v0.0.24
+      args:
+        - --config-path=/etc/config/config.json
+      env:
+        - name: POD_NAMESPACE
+          value: kube-system
+      volumeMounts:
+        - name: config
+          mountPath: /etc/config
+        - name: local-storage
+          mountPath: /opt/local-path-provisioner
+      volumes:
+        - name: config
+          hostPath:
+            path: /etc/local-storage
+        - name: local-storage
+          hostPath:
+            path: /var/lib/local-storage
+```
+
+### Hardware Monitoring Static Pods
+```yaml
+machine:
+  pods:
+    - name: node-exporter
+      namespace: monitoring
+      image: prom/node-exporter:latest
+      args:
+        - --path.rootfs=/host
+        - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
+      securityContext:
+        runAsNonRoot: true
+        runAsUser: 65534
+      volumeMounts:
+        - name: proc
+          mountPath: /host/proc
+          readOnly: true
+        - name: sys
+          mountPath: /host/sys
+          readOnly: true
+        - name: rootfs
+          mountPath: /host
+          readOnly: true
+      volumes:
+        - name: proc
+          hostPath:
+            path: /proc
+        - name: sys
+          hostPath:
+            path: /sys
+        - name: rootfs
+          hostPath:
+            path: /
+```
+
+## Bare Metal Boot Asset Management
+
+### PXE Boot Configuration
+For network booting, configure DHCP/TFTP with appropriate boot assets:
+
+```bash
+# Download kernel and initramfs for PXE
+curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/vmlinuz-amd64
+curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/initramfs-amd64.xz
+```
+
+### USB Boot Asset Creation
+```bash
+# Write installer image to USB
+sudo dd if=metal-amd64.iso of=/dev/sdX bs=4M status=progress
+```
+
+### Image Factory Integration
+For custom bare metal images:
+```bash
+# Generate schematic for bare metal with extensions
+curl -X POST --data-binary @schematic.yaml \
+  https://factory.talos.dev/schematics
+
+# Download custom installer
+curl -LO https://factory.talos.dev/image/<schematic-id>/v1.11.0/metal-amd64.iso
+```
+
+## Hardware Compatibility and Drivers
+
+### Check Hardware Support
+```bash
+# Check PCI devices
+talosctl -n <IP> read /proc/bus/pci/devices
+
+# Check USB devices
+talosctl -n <IP> read /proc/bus/usb/devices
+
+# Check loaded kernel modules
+talosctl -n <IP> read /proc/modules
+
+# Check hardware information
+talosctl -n <IP> read /proc/cpuinfo
+talosctl -n <IP> read /proc/meminfo
+```
+
+### Common Hardware Issues
+
+#### Network Interface Issues
+```bash
+# Check interface status
+talosctl -n <IP> list /sys/class/net/
+
+# Check driver information
+talosctl -n <IP> read /sys/class/net/eth0/device/driver
+
+# Check firmware loading
+talosctl -n <IP> dmesg | grep firmware
+```
+
+#### Storage Controller Issues
+```bash
+# Check block devices
+talosctl -n <IP> disks
+
+# Check SMART status (if smartmontools extension installed)
+talosctl -n <IP> list /dev/disk/by-id/
+```
+
+## Bare Metal Monitoring and Maintenance
+
+### Hardware Health Monitoring
+```bash
+# Check system temperatures (if available)
+talosctl -n <IP> read /sys/class/thermal/thermal_zone0/temp
+
+# Check power supply status
+talosctl -n <IP> read /sys/class/power_supply/*/status
+
+# Monitor system events for hardware issues
+talosctl -n <IP> dmesg | grep -i error
+talosctl -n <IP> dmesg | grep -i "machine check"
+```
+
+### Performance Monitoring
+```bash
+# Check CPU performance
+talosctl -n <IP> read /proc/cpuinfo | grep MHz
+talosctl -n <IP> cgroups --preset cpu
+
+# Check memory performance
+talosctl -n <IP> memory
+talosctl -n <IP> cgroups --preset memory
+
+# Check I/O performance
+talosctl -n <IP> read /proc/diskstats
+```
+
+## Security Hardening for Bare Metal
+
+### BIOS/UEFI Security
+- Enable SecureBoot
+- Disable unused boot devices
+- Set administrator passwords
+- Enable TPM 2.0
+- Disable legacy boot modes
+
+### Physical Security
+- Secure physical access to servers
+- Use chassis intrusion detection
+- Implement network port security
+- Consider hardware-based attestation
+
+### Network Security
+```yaml
+machine:
+  network:
+    firewall:
+      defaultAction: block
+      rules:
+        # Only allow necessary services
+        - name: allow-cluster-traffic
+          portSelector:
+            ports:
+              - 6443   # Kubernetes API
+              - 2379   # etcd client
+              - 2380   # etcd peer
+              - 10250  # kubelet API
+              - 50000  # Talos API
+          ingress:
+            - subnet: 192.168.1.0/24
+```
+
+This bare metal guide provides comprehensive coverage of hardware-specific configurations, performance optimization, security hardening, and operational practices for Talos Linux on physical servers.
--- a/ai/talos-v1.11/cli-essentials.md
+++ b/ai/talos-v1.11/cli-essentials.md
@@ -0,0 +1,382 @@
+# Talosctl CLI Essentials
+
+This guide covers essential talosctl commands and usage patterns for effective Talos cluster administration.
+
+## Command Structure and Context
+
+### Basic Command Pattern
+```bash
+talosctl [global-flags] <command> [command-flags] [arguments]
+
+# Examples:
+talosctl -n <IP> get members
+talosctl --nodes <IP1>,<IP2> service kubelet
+talosctl -e <endpoint> -n <target-nodes> upgrade --image <image>
+```
+
+### Global Flags
+- `-e, --endpoints`: API endpoints to connect to
+- `-n, --nodes`: Target nodes for commands (defaults to first endpoint if omitted)
+- `--talosconfig`: Path to Talos configuration file
+- `--context`: Configuration context to use
+
+### Configuration Management
+```bash
+# Use specific config file
+export TALOSCONFIG=/path/to/talosconfig
+
+# List available contexts
+talosctl config contexts
+
+# Switch context
+talosctl config context <context-name>
+
+# View current config
+talosctl config info
+```
+
+## Cluster Management Commands
+
+### Bootstrap and Node Management
+```bash
+# Bootstrap etcd cluster on first control plane node
+talosctl bootstrap --nodes <first-controlplane-ip>
+
+# Apply machine configuration
+talosctl apply-config --nodes <IP> --file <config.yaml>
+talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
+talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
+
+# Reset node (wipe and reboot)
+talosctl reset --nodes <IP>
+talosctl reset --nodes <IP> --graceful=false --reboot
+
+# Reboot node
+talosctl reboot --nodes <IP>
+
+# Shutdown node
+talosctl shutdown --nodes <IP>
+```
+
+### Configuration Patching
+```bash
+# Patch machine configuration
+talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
+
+# Patch with file
+talosctl -n <IP> patch mc --patch @patch.yaml --mode reboot
+
+# Edit machine config interactively
+talosctl -n <IP> edit mc --mode staged
+```
+
+## System Information and Monitoring
+
+### Node Status and Health
+```bash
+# Cluster member information
+talosctl get members
+talosctl get affiliates
+talosctl get identities
+
+# Node health check
+talosctl -n <IP> health
+talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
+
+# System information
+talosctl -n <IP> version
+talosctl -n <IP> get machineconfig
+talosctl -n <IP> get machinetype
+```
+
+### Resource Monitoring
+```bash
+# CPU and memory usage
+talosctl -n <IP> cpu
+talosctl -n <IP> memory
+
+# Disk usage and information
+talosctl -n <IP> disks
+talosctl -n <IP> df
+
+# Network interfaces
+talosctl -n <IP> interfaces
+talosctl -n <IP> get addresses
+talosctl -n <IP> get routes
+
+# Process information
+talosctl -n <IP> processes
+talosctl -n <IP> cgroups --preset memory
+talosctl -n <IP> cgroups --preset cpu
+```
+
+### Service Management
+```bash
+# List all services
+talosctl -n <IP> services
+
+# Check specific service status
+talosctl -n <IP> service kubelet
+talosctl -n <IP> service containerd
+talosctl -n <IP> service etcd
+
+# Restart service
+talosctl -n <IP> service kubelet restart
+
+# Start/stop service
+talosctl -n <IP> service <service-name> start
+talosctl -n <IP> service <service-name> stop
+```
+
+## Logging and Diagnostics
+
+### Log Retrieval
+```bash
+# Kernel logs
+talosctl -n <IP> dmesg
+talosctl -n <IP> dmesg -f  # Follow mode
+talosctl -n <IP> dmesg --tail=100
+
+# Service logs
+talosctl -n <IP> logs kubelet
+talosctl -n <IP> logs containerd
+talosctl -n <IP> logs etcd
+talosctl -n <IP> logs machined
+
+# Follow logs
+talosctl -n <IP> logs kubelet -f
+```
+
+### System Events
+```bash
+# Monitor system events
+talosctl -n <IP> events
+talosctl -n <IP> events --tail
+
+# Filter events
+talosctl -n <IP> events --since=1h
+talosctl -n <IP> events --grep=error
+```
+
+## File System and Container Operations
+
+### File Operations
+```bash
+# List files/directories
+talosctl -n <IP> list /var/log
+talosctl -n <IP> list /etc/kubernetes
+
+# Copy files to/from node
+talosctl -n <IP> copy /local/file /remote/path
+talosctl -n <IP> cp /var/log/containers/app.log ./app.log
+
+# Read file contents
+talosctl -n <IP> read /etc/resolv.conf
+talosctl -n <IP> cat /var/log/audit/audit.log
+```
+
+### Container Operations
+```bash
+# List containers
+talosctl -n <IP> containers
+talosctl -n <IP> containers -k  # Kubernetes containers
+
+# Container logs
+talosctl -n <IP> logs --kubernetes <container-name>
+
+# Execute in container
+talosctl -n <IP> exec --kubernetes <pod-name> -- <command>
+```
+
+## Kubernetes Integration
+
+### Kubernetes Cluster Operations
+```bash
+# Get kubeconfig
+talosctl kubeconfig
+talosctl kubeconfig --nodes <controlplane-ip>
+talosctl kubeconfig --force --nodes <controlplane-ip>
+
+# Bootstrap manifests
+talosctl -n <IP> get manifests
+talosctl -n <IP> get manifests -o yaml | yq eval-all '.spec | .[] | splitDoc' - > manifests.yaml
+
+# Upgrade Kubernetes
+talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
+talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
+```
+
+### Resource Inspection
+```bash
+# Control plane component configs
+talosctl -n <IP> get apiserverconfig -o yaml
+talosctl -n <IP> get controllermanagerconfig -o yaml
+talosctl -n <IP> get schedulerconfig -o yaml
+
+# etcd configuration
+talosctl -n <IP> get etcdconfig -o yaml
+```
+
+## etcd Management
+
+### etcd Operations
+```bash
+# etcd cluster status
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+
+# etcd members
+talosctl -n <IP> etcd members
+
+# etcd snapshots
+talosctl -n <IP> etcd snapshot db.snapshot
+
+# etcd maintenance
+talosctl -n <IP> etcd defrag
+talosctl -n <IP> etcd alarm list
+talosctl -n <IP> etcd alarm disarm
+
+# Leadership management
+talosctl -n <IP> etcd forfeit-leadership
+```
+
+### Disaster Recovery
+```bash
+# Bootstrap from snapshot
+talosctl -n <IP> bootstrap --recover-from=./db.snapshot
+talosctl -n <IP> bootstrap --recover-from=./db.snapshot --recover-skip-hash-check
+```
+
+## Upgrade and Maintenance
+
+### OS Upgrades
+```bash
+# Upgrade Talos OS
+talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
+talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
+
+# Monitor upgrade progress
+talosctl upgrade --nodes <IP> --image <image> --wait
+talosctl upgrade --nodes <IP> --image <image> --wait --debug
+
+# Rollback
+talosctl rollback --nodes <IP>
+```
+
+## Resource System Commands
+
+### Resource Management
+```bash
+# List resource types
+talosctl get rd
+
+# Get specific resources
+talosctl get <resource-type>
+talosctl get <resource-type> -o yaml
+talosctl get <resource-type> --namespace=<namespace>
+
+# Watch resources
+talosctl get <resource-type> --watch
+
+# Common resource types
+talosctl get machineconfig
+talosctl get members
+talosctl get services
+talosctl get networkconfig
+talosctl get secrets
+```
+
+## Local Development
+
+### Local Cluster Management
+```bash
+# Create local cluster
+talosctl cluster create
+talosctl cluster create --controlplanes 3 --workers 2
+
+# Destroy local cluster
+talosctl cluster destroy
+
+# Show local cluster status
+talosctl cluster show
+```
+
+## Advanced Usage Patterns
+
+### Multi-Node Operations
+```bash
+# Run command on multiple nodes
+talosctl -e <endpoint> -n <node1>,<node2>,<node3> <command>
+
+# Different endpoint and target nodes
+talosctl -e <public-endpoint> -n <internal-node1>,<internal-node2> <command>
+```
+
+### Output Formatting
+```bash
+# JSON output
+talosctl -n <IP> get members -o json
+
+# YAML output
+talosctl -n <IP> get machineconfig -o yaml
+
+# Table output (default)
+talosctl -n <IP> get members -o table
+
+# Custom column output
+talosctl -n <IP> get members -o columns=HOSTNAME,MACHINE\ TYPE,OS
+```
+
+### Filtering and Selection
+```bash
+# Filter resources
+talosctl get members --search <hostname>
+talosctl get services --search kubelet
+
+# Namespace filtering
+talosctl get secrets --namespace=secrets
+talosctl get affiliates --namespace=cluster-raw
+```
+
+## Common Command Workflows
+
+### Initial Cluster Setup
+```bash
+# 1. Generate configurations
+talosctl gen config cluster-name https://cluster-endpoint:6443
+
+# 2. Apply to nodes
+talosctl apply-config --nodes <controlplane-1> --file controlplane.yaml
+talosctl apply-config --nodes <worker-1> --file worker.yaml
+
+# 3. Bootstrap cluster
+talosctl bootstrap --nodes <controlplane-1>
+
+# 4. Get kubeconfig
+talosctl kubeconfig --nodes <controlplane-1>
+```
+
+### Cluster Health Check
+```bash
+# Check all aspects of cluster health
+talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+talosctl -n <IP1>,<IP2>,<IP3> service kubelet
+kubectl get nodes
+kubectl get pods --all-namespaces
+```
+
+### Node Troubleshooting
+```bash
+# System diagnostics
+talosctl -n <IP> dmesg | tail -100
+talosctl -n <IP> services | grep -v Running
+talosctl -n <IP> logs kubelet | tail -50
+talosctl -n <IP> events --since=1h
+
+# Resource usage
+talosctl -n <IP> memory
+talosctl -n <IP> df
+talosctl -n <IP> processes | head -20
+```
+
+This CLI reference provides the essential commands and patterns needed for day-to-day Talos cluster administration and troubleshooting.
--- a/ai/talos-v1.11/cluster-operations.md
+++ b/ai/talos-v1.11/cluster-operations.md
@@ -0,0 +1,239 @@
+# Talos Cluster Operations Guide
+
+This guide covers essential cluster operations for Talos Linux v1.11 administrators.
+
+## Upgrading Operations
+
+### Talos OS Upgrades
+
+Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image.
+
+#### Upgrade Process
+```bash
+# Upgrade a single node
+talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
+
+# Use --stage flag if upgrade fails due to open files
+talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
+
+# Monitor upgrade progress
+talosctl dmesg -f
+talosctl upgrade --wait --debug
+```
+
+#### Upgrade Sequence
+1. Node cordons itself in Kubernetes
+2. Node drains existing workloads
+3. Internal processes shut down
+4. Filesystems unmount
+5. Disk verification and image upgrade
+6. Bootloader set to boot once with new image
+7. Node reboots
+8. Node rejoins cluster and uncordons
+
+#### Rollback
+```bash
+talosctl rollback --nodes <IP>
+```
+
+### Kubernetes Upgrades
+
+Kubernetes upgrades are separate from OS upgrades and non-disruptive.
+
+#### Automated Upgrade (Recommended)
+```bash
+# Check what will be upgraded
+talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
+
+# Perform upgrade
+talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
+```
+
+#### Manual Component Upgrades
+For manual control, patch each component individually:
+
+**API Server:**
+```bash
+talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]'
+```
+
+**Controller Manager:**
+```bash
+talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]'
+```
+
+**Scheduler:**
+```bash
+talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]'
+```
+
+**Kubelet:**
+```bash
+talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]'
+```
+
+## Node Management
+
+### Adding Control Plane Nodes
+1. Apply machine configuration to new node
+2. Node automatically joins etcd cluster via control plane endpoint
+3. Control plane components start automatically
+
+### Removing Control Plane Nodes
+```bash
+# Recommended approach - reset then delete
+talosctl -n <IP.of.node.to.remove> reset
+kubectl delete node <node-name>
+```
+
+### Adding Worker Nodes
+1. Apply worker machine configuration
+2. Node automatically joins via bootstrap token
+
+### Removing Worker Nodes
+```bash
+kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
+kubectl delete node <node-name>
+talosctl -n <IP> reset
+```
+
+## Configuration Management
+
+### Applying Configuration Changes
+```bash
+# Apply config with automatic mode detection
+talosctl apply-config --nodes <IP> --file <config.yaml>
+
+# Apply with specific modes
+talosctl apply-config --nodes <IP> --file <config.yaml> --mode no-reboot
+talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
+talosctl apply-config --nodes <IP> --file <config.yaml> --mode staged
+
+# Dry run to preview changes
+talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
+```
+
+### Configuration Patching
+```bash
+# Patch machine configuration
+talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
+
+# Patch with file
+talosctl -n <IP> patch mc --patch @patch.yaml
+```
+
+### Retrieving Current Configuration
+```bash
+# Get machine configuration
+talosctl -n <IP> get mc v1alpha1 -o yaml
+
+# Get effective configuration
+talosctl -n <IP> get machineconfig -o yaml
+```
+
+## Cluster Health Monitoring
+
+### Node Status
+```bash
+# Check node status
+talosctl -n <IP> get members
+talosctl -n <IP> health
+
+# Check system services
+talosctl -n <IP> services
+talosctl -n <IP> service <service-name>
+```
+
+### Resource Monitoring
+```bash
+# System resources
+talosctl -n <IP> memory
+talosctl -n <IP> cpu
+talosctl -n <IP> disks
+
+# Process information
+talosctl -n <IP> processes
+talosctl -n <IP> cgroups --preset memory
+```
+
+### Log Monitoring
+```bash
+# Kernel logs
+talosctl -n <IP> dmesg
+talosctl -n <IP> dmesg -f  # Follow mode
+
+# Service logs
+talosctl -n <IP> logs <service-name>
+talosctl -n <IP> logs kubelet
+```
+
+## Control Plane Best Practices
+
+### Cluster Sizing Recommendations
+- **3 nodes**: Sufficient for most use cases, tolerates 1 node failure
+- **5 nodes**: Better availability (tolerates 2 node failures), higher resource cost
+- **Avoid even numbers**: 2 or 4 nodes provide worse availability than odd numbers
+
+### Node Replacement Strategy
+- **Failed node**: Remove first, then add replacement
+- **Healthy node**: Add replacement first, then remove old node
+
+### Performance Considerations
+- etcd performance decreases as cluster scales
+- 5-node cluster commits ~5% fewer writes than 3-node cluster
+- Vertically scale nodes for performance, don't add more nodes
+
+## Machine Configuration Versioning
+
+### Reproducible Configuration Workflow
+Store only:
+- `secrets.yaml` (generated once at cluster creation)
+- Patch files (YAML/JSON patches describing differences from defaults)
+
+Generate configs when needed:
+```bash
+# Generate fresh configs with existing secrets
+talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml
+
+# Apply patches to generated configs
+talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml --config-patch @patch.yaml
+```
+
+This prevents configuration drift after automated upgrades.
+
+## Troubleshooting Common Issues
+
+### Upgrade Failures
+- **Invalid installer image**: Check image reference and network connectivity
+- **Filesystem unmount failure**: Use `--stage` flag
+- **Boot failure**: System automatically rolls back to previous version
+- **Workload issues**: Use `talosctl rollback` to revert
+
+### Node Join Issues
+- Verify network connectivity to control plane endpoint
+- Check discovery service configuration
+- Validate machine configuration syntax
+- Ensure bootstrap process completed on initial control plane node
+
+### Control Plane Quorum Loss
+- Identify healthy nodes with `talosctl etcd status`
+- Follow disaster recovery procedures if quorum cannot be restored
+- Use etcd snapshots for cluster recovery
+
+## Security Considerations
+
+### Certificate Rotation
+Talos automatically rotates certificates, but monitor expiration:
+```bash
+talosctl -n <IP> get secrets
+```
+
+### Pod Security
+Control plane nodes are tainted by default to prevent workload scheduling. This protects:
+- Control plane from resource starvation
+- Credentials from workload exposure
+
+### Network Security
+- All API communication uses mutual TLS (mTLS)
+- Discovery service data is encrypted before transmission
+- WireGuard (KubeSpan) provides mesh networking security
--- a/ai/talos-v1.11/discovery-and-networking.md
+++ b/ai/talos-v1.11/discovery-and-networking.md
@@ -0,0 +1,344 @@
+# Discovery and Networking Guide
+
+This guide covers Talos cluster discovery mechanisms, network configuration, and connectivity troubleshooting.
+
+## Cluster Discovery System
+
+Talos includes built-in node discovery that allows cluster members to find each other and maintain membership information.
+
+### Discovery Registries
+
+#### Service Registry (Default)
+- **External Service**: Uses public discovery service at `https://discovery.talos.dev/`
+- **Encryption**: All data encrypted with AES-GCM before transmission
+- **Functionality**: Works without dependency on etcd/Kubernetes
+- **Advantages**: Available even when control plane is down
+
+#### Kubernetes Registry (Deprecated)
+- **Data Source**: Uses Kubernetes Node resources and annotations
+- **Limitation**: Incompatible with Kubernetes 1.32+ due to AuthorizeNodeWithSelectors
+- **Status**: Disabled by default, deprecated
+
+### Discovery Configuration
+```yaml
+cluster:
+  discovery:
+    enabled: true
+    registries:
+      service:
+        disabled: false  # Default
+      kubernetes:
+        disabled: true   # Deprecated, disabled by default
+```
+
+**To disable service registry**:
+```yaml
+cluster:
+  discovery:
+    enabled: true
+    registries:
+      service:
+        disabled: true
+```
+
+## Discovery Data Flow
+
+### Service Registry Process
+1. **Data Encryption**: Node encrypts affiliate data with cluster key
+2. **Endpoint Encryption**: Endpoints separately encrypted for deduplication
+3. **Data Submission**: Node submits own data + observed peer endpoints
+4. **Server Processing**: Discovery service aggregates and deduplicates data
+5. **Data Distribution**: Encrypted updates sent to all cluster members
+6. **Local Processing**: Nodes decrypt data for cluster discovery and KubeSpan
+
+### Data Protection
+- **Cluster Isolation**: Cluster ID used as key selector
+- **End-to-End Encryption**: Discovery service cannot decrypt affiliate data
+- **Memory-Only Storage**: Data stored in memory with encrypted snapshots
+- **No Sensitive Exposure**: Service only sees encrypted blobs and cluster metadata
+
+## Discovery Resources
+
+### Node Identity
+```bash
+# View node's unique identity
+talosctl get identities -o yaml
+```
+
+**Output**:
+```yaml
+spec:
+    nodeId: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd
+```
+
+**Identity Characteristics**:
+- Base62 encoded random 32 bytes
+- URL-safe encoding
+- Preserved in STATE partition (`node-identity.yaml`)
+- Survives reboots and upgrades
+- Regenerated on reset/wipe
+
+### Affiliates (Proposed Members)
+```bash
+# View discovered affiliates (proposed cluster members)
+talosctl get affiliates
+```
+
+**Output**:
+```
+ID                                             VERSION   HOSTNAME                       MACHINE TYPE   ADDRESSES
+2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF    2         talos-default-controlplane-2   controlplane   ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]
+```
+
+### Members (Approved Members)
+```bash
+# View cluster members
+talosctl get members
+```
+
+**Output**:
+```
+ID                             VERSION   HOSTNAME                       MACHINE TYPE   OS                ADDRESSES
+talos-default-controlplane-1   2         talos-default-controlplane-1   controlplane   Talos (v1.11.0)   ["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"]
+```
+
+### Raw Registry Data
+```bash
+# View data from specific registries
+talosctl get affiliates --namespace=cluster-raw
+```
+
+**Output shows registry sources**:
+```
+ID                                                     VERSION   HOSTNAME
+k8s/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF        3         talos-default-controlplane-2
+service/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF    23        talos-default-controlplane-2
+```
+
+## Network Architecture
+
+### Network Layers
+
+#### Host Networking
+- **Node-to-Node**: Direct IP connectivity between cluster nodes
+- **Control Plane**: API server communication via control plane endpoint
+- **Discovery**: HTTPS connection to discovery service (port 443)
+
+#### Container Networking
+- **CNI**: Container Network Interface for pod networking
+- **Service Mesh**: Optional service mesh implementations
+- **Network Policies**: Kubernetes network policy enforcement
+
+#### Optional: KubeSpan (WireGuard Mesh)
+- **Mesh Networking**: Full mesh WireGuard connections
+- **Discovery Integration**: Uses discovery service for peer coordination
+- **Encryption**: WireGuard public keys distributed via discovery
+- **Use Cases**: Multi-cloud, hybrid, NAT traversal
+
+### Network Configuration Patterns
+
+#### Basic Network Setup
+```yaml
+machine:
+  network:
+    interfaces:
+      - interface: eth0
+        dhcp: true
+```
+
+#### Static IP Configuration
+```yaml
+machine:
+  network:
+    interfaces:
+      - interface: eth0
+        addresses:
+          - 192.168.1.100/24
+        routes:
+          - network: 0.0.0.0/0
+            gateway: 192.168.1.1
+        mtu: 1500
+    nameservers:
+      - 8.8.8.8
+      - 1.1.1.1
+```
+
+#### Multiple Interface Configuration
+```yaml
+machine:
+  network:
+    interfaces:
+      - interface: eth0  # Management interface
+        dhcp: true
+      - interface: eth1  # Kubernetes traffic
+        addresses:
+          - 10.0.1.100/24
+        routes:
+          - network: 10.0.0.0/16
+            gateway: 10.0.1.1
+```
+
+## KubeSpan Configuration
+
+### Basic KubeSpan Setup
+```yaml
+machine:
+  network:
+    kubespan:
+      enabled: true
+```
+
+### Advanced KubeSpan Configuration
+```yaml
+machine:
+  network:
+    kubespan:
+      enabled: true
+      advertiseKubernetesNetworks: true
+      allowDownPeerBypass: true
+      mtu: 1420  # Account for WireGuard overhead
+      filters:
+        endpoints:
+          - 0.0.0.0/0  # Allow all endpoints
+```
+
+**KubeSpan Features**:
+- Automatic peer discovery via discovery service
+- NAT traversal capabilities
+- Encrypted mesh networking
+- Kubernetes network advertisement
+- Fault tolerance with peer bypass
+
+## Network Troubleshooting
+
+### Discovery Issues
+
+#### Check Discovery Service Connectivity
+```bash
+# Test connectivity to discovery service
+talosctl get affiliates
+
+# Check discovery configuration
+talosctl get discoveryconfig -o yaml
+
+# Monitor discovery events
+talosctl events --tail
+```
+
+#### Common Discovery Problems
+1. **No Affiliates Discovered**:
+   - Check discovery service connectivity
+   - Verify cluster ID matches across nodes
+   - Confirm discovery is enabled
+
+2. **Partial Affiliate List**:
+   - Network connectivity issues between nodes
+   - Discovery service regional availability
+   - Firewall blocking discovery traffic
+
+3. **Discovery Service Unreachable**:
+   - Network connectivity to discovery.talos.dev:443
+   - Corporate firewall/proxy configuration
+   - DNS resolution issues
+
+### Network Connectivity Testing
+
+#### Basic Network Tests
+```bash
+# Test network interfaces
+talosctl get addresses
+talosctl get routes
+talosctl get nodeaddresses
+
+# Check network configuration
+talosctl get networkconfig -o yaml
+
+# Test connectivity
+talosctl -n <IP> ping <target-ip>
+```
+
+#### Inter-Node Connectivity
+```bash
+# Test control plane endpoint
+talosctl health --control-plane-nodes <IP1>,<IP2>,<IP3>
+
+# Check etcd connectivity
+talosctl -n <IP> etcd members
+
+# Test Kubernetes API
+kubectl get nodes
+```
+
+#### KubeSpan Troubleshooting
+```bash
+# Check KubeSpan status
+talosctl get kubespanpeerspecs
+talosctl get kubespanpeerstatuses
+
+# Monitor WireGuard connections
+talosctl -n <IP> interfaces
+
+# Check KubeSpan logs
+talosctl -n <IP> logs controller-runtime | grep kubespan
+```
+
+### Network Performance Optimization
+
+#### Network Interface Tuning
+```yaml
+machine:
+  network:
+    interfaces:
+      - interface: eth0
+        mtu: 9000  # Jumbo frames if supported
+        dhcp: true
+```
+
+#### KubeSpan Performance
+- Adjust MTU for WireGuard overhead (typically -80 bytes)
+- Consider endpoint filters for large clusters
+- Monitor WireGuard peer connection stability
+
+## Security Considerations
+
+### Discovery Security
+- **Encrypted Communication**: All discovery data encrypted end-to-end
+- **Cluster Isolation**: Cluster ID prevents cross-cluster data access
+- **No Sensitive Data**: Only encrypted metadata transmitted
+- **Network Security**: HTTPS transport with certificate validation
+
+### Network Security
+- **mTLS**: All Talos API communication uses mutual TLS
+- **Certificate Rotation**: Automatic certificate lifecycle management
+- **Network Policies**: Implement Kubernetes network policies for workloads
+- **Firewall Rules**: Restrict network access to necessary ports only
+
+### Required Network Ports
+- **6443**: Kubernetes API server
+- **2379-2380**: etcd client/peer communication
+- **10250**: kubelet API
+- **50000**: Talos API (apid)
+- **443**: Discovery service (outbound)
+- **51820**: KubeSpan WireGuard (if enabled)
+
+## Operational Best Practices
+
+### Monitoring
+- Monitor discovery service connectivity
+- Track cluster member changes
+- Alert on network partitions
+- Monitor KubeSpan peer status
+
+### Backup and Recovery
+- Document network configuration
+- Backup discovery service configuration
+- Test network recovery procedures
+- Plan for discovery service outages
+
+### Scaling Considerations
+- Discovery service scales to thousands of nodes
+- KubeSpan mesh scales to hundreds of nodes efficiently
+- Consider network segmentation for large clusters
+- Plan for multi-region deployments
+
+This networking foundation enables Talos clusters to maintain connectivity and membership across various network topologies while providing security and performance optimization options.
--- a/ai/talos-v1.11/etcd-management.md
+++ b/ai/talos-v1.11/etcd-management.md
@@ -0,0 +1,287 @@
+# etcd Management and Disaster Recovery Guide
+
+This guide covers etcd database operations, maintenance, and disaster recovery procedures for Talos Linux clusters.
+
+## etcd Health Monitoring
+
+### Basic Health Checks
+```bash
+# Check etcd status across all control plane nodes
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+
+# Check etcd alarms
+talosctl -n <IP> etcd alarm list
+
+# Check etcd members
+talosctl -n <IP> etcd members
+
+# Check service status
+talosctl -n <IP> service etcd
+```
+
+### Understanding etcd Status Output
+```
+NODE         MEMBER             DB SIZE   IN USE            LEADER             RAFT INDEX   RAFT TERM   RAFT APPLIED INDEX   LEARNER   ERRORS
+172.20.0.2   a49c021e76e707db   17 MB     4.5 MB (26.10%)   ecebb05b59a776f1   53391        4           53391                false
+```
+
+**Key Metrics**:
+- **DB SIZE**: Total database size on disk
+- **IN USE**: Actual data size (fragmentation = DB SIZE - IN USE)
+- **LEADER**: Current etcd cluster leader
+- **RAFT INDEX**: Consensus log position
+- **LEARNER**: Whether node is still joining cluster
+
+## Space Quota Management
+
+### Default Configuration
+- Default space quota: 2 GiB
+- Recommended maximum: 8 GiB
+- Database locks when quota exceeded
+
+### Quota Exceeded Handling
+**Symptoms**:
+```bash
+talosctl -n <IP> etcd alarm list
+# Output: ALARM: NOSPACE
+```
+
+**Resolution**:
+1. Increase quota in machine configuration:
+```yaml
+cluster:
+  etcd:
+    extraArgs:
+      quota-backend-bytes: 4294967296  # 4 GiB
+```
+
+2. Apply configuration and reboot:
+```bash
+talosctl -n <IP> apply-config --file updated-config.yaml --mode reboot
+```
+
+3. Clear the alarm:
+```bash
+talosctl -n <IP> etcd alarm disarm
+```
+
+## Database Defragmentation
+
+### When to Defragment
+- In use/DB size ratio < 0.5 (heavily fragmented)
+- Database size exceeds quota but actual data is small
+- Performance degradation due to fragmentation
+
+### Defragmentation Process
+```bash
+# Check fragmentation status
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+
+# Defragment single node (resource-intensive operation)
+talosctl -n <IP1> etcd defrag
+
+# Verify defragmentation results
+talosctl -n <IP1> etcd status
+```
+
+**Important Notes**:
+- Defragment one node at a time
+- Operation blocks reads/writes during execution
+- Can significantly improve performance if heavily fragmented
+
+### Post-Defragmentation Verification
+After successful defragmentation, DB size should closely match IN USE size:
+```
+NODE         MEMBER             DB SIZE   IN USE
+172.20.0.2   a49c021e76e707db   4.5 MB    4.5 MB (100.00%)
+```
+
+## Backup Operations
+
+### Regular Snapshots
+```bash
+# Create consistent snapshot
+talosctl -n <IP> etcd snapshot db.snapshot
+```
+
+**Output Example**:
+```
+etcd snapshot saved to "db.snapshot" (2015264 bytes)
+snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136
+```
+
+### Disaster Snapshots
+When etcd cluster is unhealthy and normal snapshot fails:
+```bash
+# Copy database directly (may be inconsistent)
+talosctl -n <IP> cp /var/lib/etcd/member/snap/db .
+```
+
+### Automated Backup Strategy
+- Schedule regular snapshots (daily/hourly based on change frequency)
+- Store snapshots in multiple locations
+- Test restore procedures regularly
+- Document recovery procedures
+
+## Disaster Recovery
+
+### Pre-Recovery Assessment
+**Check if Recovery is Necessary**:
+```bash
+# Query etcd health on all control plane nodes
+talosctl -n <IP1>,<IP2>,<IP3> service etcd
+
+# Check member list consistency
+talosctl -n <IP1> etcd members
+talosctl -n <IP2> etcd members
+talosctl -n <IP3> etcd members
+```
+
+**Recovery is needed when**:
+- Quorum is lost (majority of nodes down)
+- etcd data corruption
+- Complete cluster failure
+
+### Recovery Prerequisites
+1. **Latest etcd snapshot** (preferably consistent)
+2. **Machine configuration backup**:
+```bash
+talosctl -n <IP> get mc v1alpha1 -o yaml | yq eval '.spec' -
+```
+3. **No init-type nodes** (deprecated, incompatible with recovery)
+
+### Recovery Procedure
+
+#### Step 1: Prepare Control Plane Nodes
+```bash
+# If nodes have hardware issues, replace them with same configuration
+# If nodes are running but etcd is corrupted, wipe EPHEMERAL partition:
+talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
+```
+
+#### Step 2: Verify etcd State
+All etcd services should be in "Preparing" state:
+```bash
+talosctl -n <IP> service etcd
+# Expected: STATE: Preparing
+```
+
+#### Step 3: Bootstrap from Snapshot
+```bash
+# Bootstrap cluster from snapshot
+talosctl -n <IP> bootstrap --recover-from=./db.snapshot
+
+# For direct database copies, skip hash check:
+talosctl -n <IP> bootstrap --recover-from=./db --recover-skip-hash-check
+```
+
+#### Step 4: Verify Recovery
+**Monitor kernel logs** for recovery progress:
+```bash
+talosctl -n <IP> dmesg -f
+```
+
+**Expected log entries**:
+```
+recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
+{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot"}
+```
+
+**Verify cluster health**:
+```bash
+# etcd should become healthy on bootstrap node
+talosctl -n <IP> service etcd
+
+# Kubernetes control plane should start
+kubectl get nodes
+
+# Other control plane nodes should join automatically
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+```
+
+## etcd Version Management
+
+### Downgrade Process (v3.6 to v3.5)
+**Prerequisites**:
+- Healthy cluster running v3.6.x
+- Recent backup snapshot
+- Downgrade only one minor version at a time
+
+#### Step 1: Validate Downgrade
+```bash
+talosctl -n <IP1> etcd downgrade validate 3.5
+```
+
+#### Step 2: Enable Downgrade
+```bash
+talosctl -n <IP1> etcd downgrade enable 3.5
+```
+
+#### Step 3: Verify Schema Migration
+```bash
+# Check storage version migrated to 3.5
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+# Verify STORAGE column shows 3.5.0
+```
+
+#### Step 4: Patch Machine Configuration
+```bash
+# Transfer leadership if node is leader
+talosctl -n <IP1> etcd forfeit-leadership
+
+# Create patch file
+cat > etcd-patch.yaml <<EOF
+cluster:
+  etcd:
+    image: gcr.io/etcd-development/etcd:v3.5.22
+EOF
+
+# Apply patch with reboot
+talosctl -n <IP1> patch machineconfig --patch @etcd-patch.yaml --mode reboot
+```
+
+#### Step 5: Repeat for All Control Plane Nodes
+Continue patching remaining control plane nodes one by one.
+
+## Operational Best Practices
+
+### Monitoring
+- Monitor database size and fragmentation regularly
+- Set up alerts for space quota approaching limits
+- Track etcd performance metrics (request latency, leader changes)
+- Monitor disk I/O and network latency
+
+### Maintenance Windows
+- Schedule defragmentation during low-traffic periods
+- Coordinate with application teams for maintenance windows
+- Test backup/restore procedures in non-production environments
+
+### Performance Optimization
+- Use fast storage (NVMe SSDs preferred)
+- Minimize network latency between control plane nodes
+- Monitor and tune etcd configuration based on workload
+
+### Security
+- Encrypt etcd data at rest
+- Secure backup storage with appropriate access controls
+- Regularly rotate certificates
+- Monitor for unauthorized access attempts
+
+## Troubleshooting Common Issues
+
+### Split Brain Prevention
+- Ensure odd number of control plane nodes
+- Monitor network connectivity between nodes
+- Use dedicated network for control plane communication when possible
+
+### Performance Issues
+- Check disk I/O latency
+- Monitor memory usage
+- Consider vertical scaling before adding nodes
+- Review etcd request patterns and optimize applications
+
+### Backup/Restore Issues
+- Test restore procedures regularly
+- Verify backup integrity
+- Ensure consistent network and storage configuration
+- Document and practice disaster recovery procedures
--- a/ai/talos-v1.11/troubleshooting-guide.md
+++ b/ai/talos-v1.11/troubleshooting-guide.md
@@ -0,0 +1,480 @@
+# Talos Troubleshooting Guide
+
+This guide provides systematic approaches to diagnosing and resolving common Talos cluster issues.
+
+## General Troubleshooting Methodology
+
+### 1. Gather Information
+```bash
+# Node status and health
+talosctl -n <IP> health
+talosctl -n <IP> version
+talosctl -n <IP> get members
+
+# System resources
+talosctl -n <IP> memory
+talosctl -n <IP> disks
+talosctl -n <IP> processes | head -20
+
+# Service status
+talosctl -n <IP> services
+```
+
+### 2. Check Logs
+```bash
+# Kernel logs (system-level issues)
+talosctl -n <IP> dmesg | tail -100
+
+# Service logs
+talosctl -n <IP> logs machined
+talosctl -n <IP> logs kubelet
+talosctl -n <IP> logs containerd
+
+# System events
+talosctl -n <IP> events --since=1h
+```
+
+### 3. Network Connectivity
+```bash
+# Discovery and membership
+talosctl get affiliates
+talosctl get members
+
+# Network interfaces
+talosctl -n <IP> interfaces
+talosctl -n <IP> get addresses
+
+# Control plane connectivity
+kubectl get nodes
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+```
+
+## Bootstrap and Initial Setup Issues
+
+### Cluster Bootstrap Failures
+
+**Symptoms**: Bootstrap command fails or times out
+**Diagnosis**:
+```bash
+# Check etcd service state
+talosctl -n <IP> service etcd
+
+# Check if node is trying to join instead of bootstrap
+talosctl -n <IP> logs etcd | grep -i bootstrap
+
+# Verify machine configuration
+talosctl -n <IP> get machineconfig -o yaml
+```
+
+**Common Causes & Solutions**:
+1. **Wrong node type**: Ensure using `controlplane`, not deprecated `init`
+2. **Network issues**: Verify control plane endpoint connectivity
+3. **Configuration errors**: Check machine configuration validity
+4. **Previous bootstrap**: etcd data exists from previous attempts
+
+**Resolution**:
+```bash
+# Reset node if previous bootstrap data exists
+talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
+
+# Re-apply configuration and bootstrap
+talosctl apply-config --nodes <IP> --file controlplane.yaml
+talosctl bootstrap --nodes <IP>
+```
+
+### Node Join Issues
+
+**Symptoms**: New nodes don't join cluster
+**Diagnosis**:
+```bash
+# Check discovery
+talosctl get affiliates
+talosctl get members
+
+# Check bootstrap token
+kubectl get secrets -n kube-system | grep bootstrap-token
+
+# Check kubelet logs
+talosctl -n <IP> logs kubelet | grep -i certificate
+```
+
+**Common Solutions**:
+```bash
+# Regenerate bootstrap token if expired
+kubeadm token create --print-join-command
+
+# Verify discovery service connectivity
+talosctl -n <IP> get affiliates --namespace=cluster-raw
+
+# Check machine configuration matches cluster
+talosctl -n <IP> get machineconfig -o yaml
+```
+
+## Control Plane Issues
+
+### etcd Problems
+
+**etcd Won't Start**:
+```bash
+# Check etcd service status and logs
+talosctl -n <IP> service etcd
+talosctl -n <IP> logs etcd
+
+# Check etcd data directory
+talosctl -n <IP> list /var/lib/etcd
+
+# Check disk space and permissions
+talosctl -n <IP> df
+```
+
+**etcd Quorum Loss**:
+```bash
+# Check member status
+talosctl -n <IP1>,<IP2>,<IP3> etcd status
+talosctl -n <IP> etcd members
+
+# Identify healthy members
+for ip in IP1 IP2 IP3; do
+  echo "=== Node $ip ==="
+  talosctl -n $ip service etcd
+done
+```
+
+**Solution for Quorum Loss**:
+1. If majority available: Remove failed members, add replacements
+2. If majority lost: Follow disaster recovery procedure
+
+### API Server Issues
+
+**API Server Not Responding**:
+```bash
+# Check API server pod status
+kubectl get pods -n kube-system | grep apiserver
+
+# Check API server configuration
+talosctl -n <IP> get apiserverconfig -o yaml
+
+# Check control plane endpoint
+curl -k https://<control-plane-endpoint>:6443/healthz
+```
+
+**Common Solutions**:
+```bash
+# Restart kubelet to reload static pods
+talosctl -n <IP> service kubelet restart
+
+# Check for configuration issues
+talosctl -n <IP> logs kubelet | grep apiserver
+
+# Verify etcd connectivity
+talosctl -n <IP> etcd status
+```
+
+## Node-Level Issues
+
+### Kubelet Problems
+
+**Kubelet Service Issues**:
+```bash
+# Check kubelet status and logs
+talosctl -n <IP> service kubelet
+talosctl -n <IP> logs kubelet | tail -50
+
+# Check kubelet configuration
+talosctl -n <IP> get kubeletconfig -o yaml
+
+# Check container runtime
+talosctl -n <IP> service containerd
+```
+
+**Common Kubelet Issues**:
+1. **Certificate problems**: Check certificate expiration and rotation
+2. **Container runtime issues**: Verify containerd health
+3. **Resource constraints**: Check memory and disk space
+4. **Network connectivity**: Verify API server connectivity
+
+### Container Runtime Issues
+
+**Containerd Problems**:
+```bash
+# Check containerd service
+talosctl -n <IP> service containerd
+talosctl -n <IP> logs containerd
+
+# List containers
+talosctl -n <IP> containers
+talosctl -n <IP> containers -k  # Kubernetes containers
+
+# Check containerd configuration
+talosctl -n <IP> read /etc/cri/conf.d/cri.toml
+```
+
+**Common Solutions**:
+```bash
+# Restart containerd
+talosctl -n <IP> service containerd restart
+
+# Check disk space for container images
+talosctl -n <IP> df
+
+# Clean up unused containers/images
+# (This happens automatically via kubelet GC)
+```
+
+## Network Issues
+
+### Network Connectivity Problems
+
+**Node-to-Node Connectivity**:
+```bash
+# Test basic network connectivity
+talosctl -n <IP1> interfaces
+talosctl -n <IP1> get routes
+
+# Test specific connectivity
+talosctl -n <IP1> read /etc/resolv.conf
+
+# Check network configuration
+talosctl -n <IP> get networkconfig -o yaml
+```
+
+**DNS Resolution Issues**:
+```bash
+# Check DNS configuration
+talosctl -n <IP> read /etc/resolv.conf
+
+# Test DNS resolution
+talosctl -n <IP> exec --kubernetes coredns-pod -- nslookup kubernetes.default.svc.cluster.local
+```
+
+### Discovery Service Issues
+
+**Discovery Not Working**:
+```bash
+# Check discovery configuration
+talosctl get discoveryconfig -o yaml
+
+# Check affiliate discovery
+talosctl get affiliates
+talosctl get affiliates --namespace=cluster-raw
+
+# Test discovery service connectivity
+curl -v https://discovery.talos.dev/
+```
+
+**KubeSpan Issues** (if enabled):
+```bash
+# Check KubeSpan configuration
+talosctl get kubespanconfig -o yaml
+
+# Check peer status
+talosctl get kubespanpeerspecs
+talosctl get kubespanpeerstatuses
+
+# Check WireGuard interface
+talosctl -n <IP> interfaces | grep kubespan
+```
+
+## Upgrade Issues
+
+### OS Upgrade Problems
+
+**Upgrade Fails or Hangs**:
+```bash
+# Check upgrade status
+talosctl -n <IP> dmesg | grep -i upgrade
+talosctl -n <IP> events | grep -i upgrade
+
+# Use staged upgrade for filesystem lock issues
+talosctl upgrade --nodes <IP> --image <image> --stage
+
+# Monitor upgrade progress
+talosctl upgrade --nodes <IP> --image <image> --wait --debug
+```
+
+**Boot Issues After Upgrade**:
+```bash
+# Check boot logs
+talosctl -n <IP> dmesg | head -100
+
+# System automatically rolls back on boot failure
+# Check current version
+talosctl -n <IP> version
+
+# Manual rollback if needed
+talosctl rollback --nodes <IP>
+```
+
+### Kubernetes Upgrade Issues
+
+**K8s Upgrade Failures**:
+```bash
+# Check upgrade status
+talosctl --nodes <controlplane> upgrade-k8s --to <version> --dry-run
+
+# Check individual component status
+kubectl get pods -n kube-system
+talosctl -n <IP> get apiserverconfig -o yaml
+```
+
+**Version Mismatch Issues**:
+```bash
+# Check version consistency
+kubectl get nodes -o wide
+talosctl -n <IP1>,<IP2>,<IP3> version
+
+# Check component versions
+kubectl get pods -n kube-system -o wide
+```
+
+## Resource and Performance Issues
+
+### Memory and Storage Problems
+
+**Out of Memory**:
+```bash
+# Check memory usage
+talosctl -n <IP> memory
+talosctl -n <IP> processes --sort-by=memory | head -20
+
+# Check for memory pressure
+kubectl describe node <node-name> | grep -A 10 Conditions
+
+# Check OOM events
+talosctl -n <IP> dmesg | grep -i "out of memory"
+```
+
+**Disk Space Issues**:
+```bash
+# Check disk usage
+talosctl -n <IP> df
+talosctl -n <IP> disks
+
+# Check specific directories
+talosctl -n <IP> list /var/lib/containerd
+talosctl -n <IP> list /var/lib/etcd
+
+# Clean up if needed (automatic GC usually handles this)
+kubectl describe node <node-name> | grep -A 5 "Disk Pressure"
+```
+
+### Performance Issues
+
+**Slow Cluster Response**:
+```bash
+# Check API server response time
+time kubectl get nodes
+
+# Check etcd performance
+talosctl -n <IP> etcd status
+# Look for high DB size vs IN USE ratio (fragmentation)
+
+# Check system load
+talosctl -n <IP> cpu
+talosctl -n <IP> memory
+```
+
+**High CPU/Memory Usage**:
+```bash
+# Identify resource-heavy processes
+talosctl -n <IP> processes --sort-by=cpu | head -10
+talosctl -n <IP> processes --sort-by=memory | head -10
+
+# Check cgroup usage
+talosctl -n <IP> cgroups --preset memory
+talosctl -n <IP> cgroups --preset cpu
+```
+
+## Configuration Issues
+
+### Machine Configuration Problems
+
+**Invalid Configuration**:
+```bash
+# Validate configuration before applying
+talosctl validate -f machineconfig.yaml
+
+# Check current configuration
+talosctl -n <IP> get machineconfig -o yaml
+
+# Compare with expected configuration
+diff <(talosctl -n <IP> get mc v1alpha1 -o yaml) expected-config.yaml
+```
+
+**Configuration Drift**:
+```bash
+# Check configuration version
+talosctl -n <IP> get machineconfig
+
+# Re-apply configuration if needed
+talosctl apply-config --nodes <IP> --file corrected-config.yaml --dry-run
+talosctl apply-config --nodes <IP> --file corrected-config.yaml
+```
+
+## Emergency Procedures
+
+### Node Unresponsive
+
+**Complete Node Failure**:
+1. **Physical access required**: Power cycle or hardware reset
+2. **Check hardware**: Memory, disk, network interface status
+3. **Boot issues**: May require bootable recovery media
+
+**Partial Connectivity**:
+```bash
+# Try different network interfaces if multiple available
+talosctl -e <alternate-ip> -n <IP> health
+
+# Check if specific services are running
+talosctl -n <IP> service machined
+talosctl -n <IP> service apid
+```
+
+### Cluster-Wide Failures
+
+**All Control Plane Nodes Down**:
+1. **Assess scope**: Determine if data corruption or hardware failure
+2. **Recovery strategy**: Use etcd backup if available
+3. **Rebuild process**: May require complete cluster rebuild
+
+**Follow disaster recovery procedures** as documented in etcd-management.md.
+
+### Emergency Reset Procedures
+
+**Single Node Reset**:
+```bash
+# Graceful reset (preserves some data)
+talosctl -n <IP> reset
+
+# Force reset (wipes all data)
+talosctl -n <IP> reset --graceful=false --reboot
+
+# Selective wipe (preserve STATE partition)
+talosctl -n <IP> reset --system-labels-to-wipe=EPHEMERAL
+```
+
+**Cluster Reset** (DESTRUCTIVE):
+```bash
+# Reset all nodes (DANGER: DATA LOSS)
+for ip in IP1 IP2 IP3; do
+  talosctl -n $ip reset --graceful=false --reboot
+done
+```
+
+## Monitoring and Alerting
+
+### Key Metrics to Monitor
+- Node resource usage (CPU, memory, disk)
+- etcd health and performance
+- Control plane component status
+- Network connectivity
+- Certificate expiration
+- Discovery service connectivity
+
+### Log Locations for External Monitoring
+- Kernel logs: `talosctl dmesg`
+- Service logs: `talosctl logs <service>`
+- System events: `talosctl events`
+- Kubernetes events: `kubectl get events`
+
+This troubleshooting guide provides systematic approaches to identify and resolve the most common issues encountered in Talos cluster operations.