Files
wild-cloud-dev/ai/talos-v1.11/cluster-operations.md
2025-10-11 18:08:04 +00:00

6.8 KiB

Talos Cluster Operations Guide

This guide covers essential cluster operations for Talos Linux v1.11 administrators.

Upgrading Operations

Talos OS Upgrades

Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image.

Upgrade Process

# Upgrade a single node
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x

# Use --stage flag if upgrade fails due to open files
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage

# Monitor upgrade progress
talosctl dmesg -f
talosctl upgrade --wait --debug

Upgrade Sequence

  1. Node cordons itself in Kubernetes
  2. Node drains existing workloads
  3. Internal processes shut down
  4. Filesystems unmount
  5. Disk verification and image upgrade
  6. Bootloader set to boot once with new image
  7. Node reboots
  8. Node rejoins cluster and uncordons

Rollback

talosctl rollback --nodes <IP>

Kubernetes Upgrades

Kubernetes upgrades are separate from OS upgrades and non-disruptive.

# Check what will be upgraded
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run

# Perform upgrade
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1

Manual Component Upgrades

For manual control, patch each component individually:

API Server:

talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]'

Controller Manager:

talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]'

Scheduler:

talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]'

Kubelet:

talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]'

Node Management

Adding Control Plane Nodes

  1. Apply machine configuration to new node
  2. Node automatically joins etcd cluster via control plane endpoint
  3. Control plane components start automatically

Removing Control Plane Nodes

# Recommended approach - reset then delete
talosctl -n <IP.of.node.to.remove> reset
kubectl delete node <node-name>

Adding Worker Nodes

  1. Apply worker machine configuration
  2. Node automatically joins via bootstrap token

Removing Worker Nodes

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>
talosctl -n <IP> reset

Configuration Management

Applying Configuration Changes

# Apply config with automatic mode detection
talosctl apply-config --nodes <IP> --file <config.yaml>

# Apply with specific modes
talosctl apply-config --nodes <IP> --file <config.yaml> --mode no-reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode staged

# Dry run to preview changes
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run

Configuration Patching

# Patch machine configuration
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'

# Patch with file
talosctl -n <IP> patch mc --patch @patch.yaml

Retrieving Current Configuration

# Get machine configuration
talosctl -n <IP> get mc v1alpha1 -o yaml

# Get effective configuration
talosctl -n <IP> get machineconfig -o yaml

Cluster Health Monitoring

Node Status

# Check node status
talosctl -n <IP> get members
talosctl -n <IP> health

# Check system services
talosctl -n <IP> services
talosctl -n <IP> service <service-name>

Resource Monitoring

# System resources
talosctl -n <IP> memory
talosctl -n <IP> cpu
talosctl -n <IP> disks

# Process information
talosctl -n <IP> processes
talosctl -n <IP> cgroups --preset memory

Log Monitoring

# Kernel logs
talosctl -n <IP> dmesg
talosctl -n <IP> dmesg -f  # Follow mode

# Service logs
talosctl -n <IP> logs <service-name>
talosctl -n <IP> logs kubelet

Control Plane Best Practices

Cluster Sizing Recommendations

  • 3 nodes: Sufficient for most use cases, tolerates 1 node failure
  • 5 nodes: Better availability (tolerates 2 node failures), higher resource cost
  • Avoid even numbers: 2 or 4 nodes provide worse availability than odd numbers

Node Replacement Strategy

  • Failed node: Remove first, then add replacement
  • Healthy node: Add replacement first, then remove old node

Performance Considerations

  • etcd performance decreases as cluster scales
  • 5-node cluster commits ~5% fewer writes than 3-node cluster
  • Vertically scale nodes for performance, don't add more nodes

Machine Configuration Versioning

Reproducible Configuration Workflow

Store only:

  • secrets.yaml (generated once at cluster creation)
  • Patch files (YAML/JSON patches describing differences from defaults)

Generate configs when needed:

# Generate fresh configs with existing secrets
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml

# Apply patches to generated configs
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml --config-patch @patch.yaml

This prevents configuration drift after automated upgrades.

Troubleshooting Common Issues

Upgrade Failures

  • Invalid installer image: Check image reference and network connectivity
  • Filesystem unmount failure: Use --stage flag
  • Boot failure: System automatically rolls back to previous version
  • Workload issues: Use talosctl rollback to revert

Node Join Issues

  • Verify network connectivity to control plane endpoint
  • Check discovery service configuration
  • Validate machine configuration syntax
  • Ensure bootstrap process completed on initial control plane node

Control Plane Quorum Loss

  • Identify healthy nodes with talosctl etcd status
  • Follow disaster recovery procedures if quorum cannot be restored
  • Use etcd snapshots for cluster recovery

Security Considerations

Certificate Rotation

Talos automatically rotates certificates, but monitor expiration:

talosctl -n <IP> get secrets

Pod Security

Control plane nodes are tainted by default to prevent workload scheduling. This protects:

  • Control plane from resource starvation
  • Credentials from workload exposure

Network Security

  • All API communication uses mutual TLS (mTLS)
  • Discovery service data is encrypted before transmission
  • WireGuard (KubeSpan) provides mesh networking security