6.8 KiB
6.8 KiB
Talos Cluster Operations Guide
This guide covers essential cluster operations for Talos Linux v1.11 administrators.
Upgrading Operations
Talos OS Upgrades
Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image.
Upgrade Process
# Upgrade a single node
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
# Use --stage flag if upgrade fails due to open files
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
# Monitor upgrade progress
talosctl dmesg -f
talosctl upgrade --wait --debug
Upgrade Sequence
- Node cordons itself in Kubernetes
- Node drains existing workloads
- Internal processes shut down
- Filesystems unmount
- Disk verification and image upgrade
- Bootloader set to boot once with new image
- Node reboots
- Node rejoins cluster and uncordons
Rollback
talosctl rollback --nodes <IP>
Kubernetes Upgrades
Kubernetes upgrades are separate from OS upgrades and non-disruptive.
Automated Upgrade (Recommended)
# Check what will be upgraded
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
# Perform upgrade
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
Manual Component Upgrades
For manual control, patch each component individually:
API Server:
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]'
Controller Manager:
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]'
Scheduler:
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]'
Kubelet:
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]'
Node Management
Adding Control Plane Nodes
- Apply machine configuration to new node
- Node automatically joins etcd cluster via control plane endpoint
- Control plane components start automatically
Removing Control Plane Nodes
# Recommended approach - reset then delete
talosctl -n <IP.of.node.to.remove> reset
kubectl delete node <node-name>
Adding Worker Nodes
- Apply worker machine configuration
- Node automatically joins via bootstrap token
Removing Worker Nodes
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>
talosctl -n <IP> reset
Configuration Management
Applying Configuration Changes
# Apply config with automatic mode detection
talosctl apply-config --nodes <IP> --file <config.yaml>
# Apply with specific modes
talosctl apply-config --nodes <IP> --file <config.yaml> --mode no-reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode staged
# Dry run to preview changes
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
Configuration Patching
# Patch machine configuration
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
# Patch with file
talosctl -n <IP> patch mc --patch @patch.yaml
Retrieving Current Configuration
# Get machine configuration
talosctl -n <IP> get mc v1alpha1 -o yaml
# Get effective configuration
talosctl -n <IP> get machineconfig -o yaml
Cluster Health Monitoring
Node Status
# Check node status
talosctl -n <IP> get members
talosctl -n <IP> health
# Check system services
talosctl -n <IP> services
talosctl -n <IP> service <service-name>
Resource Monitoring
# System resources
talosctl -n <IP> memory
talosctl -n <IP> cpu
talosctl -n <IP> disks
# Process information
talosctl -n <IP> processes
talosctl -n <IP> cgroups --preset memory
Log Monitoring
# Kernel logs
talosctl -n <IP> dmesg
talosctl -n <IP> dmesg -f # Follow mode
# Service logs
talosctl -n <IP> logs <service-name>
talosctl -n <IP> logs kubelet
Control Plane Best Practices
Cluster Sizing Recommendations
- 3 nodes: Sufficient for most use cases, tolerates 1 node failure
- 5 nodes: Better availability (tolerates 2 node failures), higher resource cost
- Avoid even numbers: 2 or 4 nodes provide worse availability than odd numbers
Node Replacement Strategy
- Failed node: Remove first, then add replacement
- Healthy node: Add replacement first, then remove old node
Performance Considerations
- etcd performance decreases as cluster scales
- 5-node cluster commits ~5% fewer writes than 3-node cluster
- Vertically scale nodes for performance, don't add more nodes
Machine Configuration Versioning
Reproducible Configuration Workflow
Store only:
secrets.yaml(generated once at cluster creation)- Patch files (YAML/JSON patches describing differences from defaults)
Generate configs when needed:
# Generate fresh configs with existing secrets
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml
# Apply patches to generated configs
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml --config-patch @patch.yaml
This prevents configuration drift after automated upgrades.
Troubleshooting Common Issues
Upgrade Failures
- Invalid installer image: Check image reference and network connectivity
- Filesystem unmount failure: Use
--stageflag - Boot failure: System automatically rolls back to previous version
- Workload issues: Use
talosctl rollbackto revert
Node Join Issues
- Verify network connectivity to control plane endpoint
- Check discovery service configuration
- Validate machine configuration syntax
- Ensure bootstrap process completed on initial control plane node
Control Plane Quorum Loss
- Identify healthy nodes with
talosctl etcd status - Follow disaster recovery procedures if quorum cannot be restored
- Use etcd snapshots for cluster recovery
Security Considerations
Certificate Rotation
Talos automatically rotates certificates, but monitor expiration:
talosctl -n <IP> get secrets
Pod Security
Control plane nodes are tainted by default to prevent workload scheduling. This protects:
- Control plane from resource starvation
- Credentials from workload exposure
Network Security
- All API communication uses mutual TLS (mTLS)
- Discovery service data is encrypted before transmission
- WireGuard (KubeSpan) provides mesh networking security