# Talos Cluster Operations Guide This guide covers essential cluster operations for Talos Linux v1.11 administrators. ## Upgrading Operations ### Talos OS Upgrades Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image. #### Upgrade Process ```bash # Upgrade a single node talosctl upgrade --nodes --image ghcr.io/siderolabs/installer:v1.11.x # Use --stage flag if upgrade fails due to open files talosctl upgrade --nodes --image ghcr.io/siderolabs/installer:v1.11.x --stage # Monitor upgrade progress talosctl dmesg -f talosctl upgrade --wait --debug ``` #### Upgrade Sequence 1. Node cordons itself in Kubernetes 2. Node drains existing workloads 3. Internal processes shut down 4. Filesystems unmount 5. Disk verification and image upgrade 6. Bootloader set to boot once with new image 7. Node reboots 8. Node rejoins cluster and uncordons #### Rollback ```bash talosctl rollback --nodes ``` ### Kubernetes Upgrades Kubernetes upgrades are separate from OS upgrades and non-disruptive. #### Automated Upgrade (Recommended) ```bash # Check what will be upgraded talosctl --nodes upgrade-k8s --to v1.34.1 --dry-run # Perform upgrade talosctl --nodes upgrade-k8s --to v1.34.1 ``` #### Manual Component Upgrades For manual control, patch each component individually: **API Server:** ```bash talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]' ``` **Controller Manager:** ```bash talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]' ``` **Scheduler:** ```bash talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]' ``` **Kubelet:** ```bash talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]' ``` ## Node Management ### Adding Control Plane Nodes 1. Apply machine configuration to new node 2. Node automatically joins etcd cluster via control plane endpoint 3. Control plane components start automatically ### Removing Control Plane Nodes ```bash # Recommended approach - reset then delete talosctl -n reset kubectl delete node ``` ### Adding Worker Nodes 1. Apply worker machine configuration 2. Node automatically joins via bootstrap token ### Removing Worker Nodes ```bash kubectl drain --ignore-daemonsets --delete-emptydir-data kubectl delete node talosctl -n reset ``` ## Configuration Management ### Applying Configuration Changes ```bash # Apply config with automatic mode detection talosctl apply-config --nodes --file # Apply with specific modes talosctl apply-config --nodes --file --mode no-reboot talosctl apply-config --nodes --file --mode reboot talosctl apply-config --nodes --file --mode staged # Dry run to preview changes talosctl apply-config --nodes --file --dry-run ``` ### Configuration Patching ```bash # Patch machine configuration talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]' # Patch with file talosctl -n patch mc --patch @patch.yaml ``` ### Retrieving Current Configuration ```bash # Get machine configuration talosctl -n get mc v1alpha1 -o yaml # Get effective configuration talosctl -n get machineconfig -o yaml ``` ## Cluster Health Monitoring ### Node Status ```bash # Check node status talosctl -n get members talosctl -n health # Check system services talosctl -n services talosctl -n service ``` ### Resource Monitoring ```bash # System resources talosctl -n memory talosctl -n cpu talosctl -n disks # Process information talosctl -n processes talosctl -n cgroups --preset memory ``` ### Log Monitoring ```bash # Kernel logs talosctl -n dmesg talosctl -n dmesg -f # Follow mode # Service logs talosctl -n logs talosctl -n logs kubelet ``` ## Control Plane Best Practices ### Cluster Sizing Recommendations - **3 nodes**: Sufficient for most use cases, tolerates 1 node failure - **5 nodes**: Better availability (tolerates 2 node failures), higher resource cost - **Avoid even numbers**: 2 or 4 nodes provide worse availability than odd numbers ### Node Replacement Strategy - **Failed node**: Remove first, then add replacement - **Healthy node**: Add replacement first, then remove old node ### Performance Considerations - etcd performance decreases as cluster scales - 5-node cluster commits ~5% fewer writes than 3-node cluster - Vertically scale nodes for performance, don't add more nodes ## Machine Configuration Versioning ### Reproducible Configuration Workflow Store only: - `secrets.yaml` (generated once at cluster creation) - Patch files (YAML/JSON patches describing differences from defaults) Generate configs when needed: ```bash # Generate fresh configs with existing secrets talosctl gen config --with-secrets secrets.yaml # Apply patches to generated configs talosctl gen config --with-secrets secrets.yaml --config-patch @patch.yaml ``` This prevents configuration drift after automated upgrades. ## Troubleshooting Common Issues ### Upgrade Failures - **Invalid installer image**: Check image reference and network connectivity - **Filesystem unmount failure**: Use `--stage` flag - **Boot failure**: System automatically rolls back to previous version - **Workload issues**: Use `talosctl rollback` to revert ### Node Join Issues - Verify network connectivity to control plane endpoint - Check discovery service configuration - Validate machine configuration syntax - Ensure bootstrap process completed on initial control plane node ### Control Plane Quorum Loss - Identify healthy nodes with `talosctl etcd status` - Follow disaster recovery procedures if quorum cannot be restored - Use etcd snapshots for cluster recovery ## Security Considerations ### Certificate Rotation Talos automatically rotates certificates, but monitor expiration: ```bash talosctl -n get secrets ``` ### Pod Security Control plane nodes are tainted by default to prevent workload scheduling. This protects: - Control plane from resource starvation - Credentials from workload exposure ### Network Security - All API communication uses mutual TLS (mTLS) - Discovery service data is encrypted before transmission - WireGuard (KubeSpan) provides mesh networking security