wild-cloud/wild-cloud-dev

Fork 0

Files

Paul Payne 8947da88eb Initial commit.

2025-10-11 18:08:04 +00:00

6.8 KiB

Raw Permalink Blame History

Talos Cluster Operations Guide

This guide covers essential cluster operations for Talos Linux v1.11 administrators.

Upgrading Operations

Talos OS Upgrades

Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image.

Upgrade Process

# Upgrade a single node
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x

# Use --stage flag if upgrade fails due to open files
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage

# Monitor upgrade progress
talosctl dmesg -f
talosctl upgrade --wait --debug

Upgrade Sequence

Node cordons itself in Kubernetes
Node drains existing workloads
Internal processes shut down
Filesystems unmount
Disk verification and image upgrade
Bootloader set to boot once with new image
Node reboots
Node rejoins cluster and uncordons

Rollback

talosctl rollback --nodes <IP>

Kubernetes Upgrades

Kubernetes upgrades are separate from OS upgrades and non-disruptive.

Automated Upgrade (Recommended)

# Check what will be upgraded
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run

# Perform upgrade
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1

Manual Component Upgrades

For manual control, patch each component individually:

API Server:

talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]'

Controller Manager:

talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]'

Scheduler:

talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]'

Kubelet:

talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]'

Node Management

Adding Control Plane Nodes

Apply machine configuration to new node
Node automatically joins etcd cluster via control plane endpoint
Control plane components start automatically

Removing Control Plane Nodes

# Recommended approach - reset then delete
talosctl -n <IP.of.node.to.remove> reset
kubectl delete node <node-name>

Adding Worker Nodes

Apply worker machine configuration
Node automatically joins via bootstrap token

Removing Worker Nodes

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>
talosctl -n <IP> reset

Configuration Management

Applying Configuration Changes

# Apply config with automatic mode detection
talosctl apply-config --nodes <IP> --file <config.yaml>

# Apply with specific modes
talosctl apply-config --nodes <IP> --file <config.yaml> --mode no-reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode staged

# Dry run to preview changes
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run

Configuration Patching

# Patch machine configuration
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'

# Patch with file
talosctl -n <IP> patch mc --patch @patch.yaml

Retrieving Current Configuration

# Get machine configuration
talosctl -n <IP> get mc v1alpha1 -o yaml

# Get effective configuration
talosctl -n <IP> get machineconfig -o yaml

Cluster Health Monitoring

Node Status

# Check node status
talosctl -n <IP> get members
talosctl -n <IP> health

# Check system services
talosctl -n <IP> services
talosctl -n <IP> service <service-name>

Resource Monitoring

# System resources
talosctl -n <IP> memory
talosctl -n <IP> cpu
talosctl -n <IP> disks

# Process information
talosctl -n <IP> processes
talosctl -n <IP> cgroups --preset memory

Log Monitoring

# Kernel logs
talosctl -n <IP> dmesg
talosctl -n <IP> dmesg -f  # Follow mode

# Service logs
talosctl -n <IP> logs <service-name>
talosctl -n <IP> logs kubelet

Control Plane Best Practices

Cluster Sizing Recommendations

3 nodes: Sufficient for most use cases, tolerates 1 node failure
5 nodes: Better availability (tolerates 2 node failures), higher resource cost
Avoid even numbers: 2 or 4 nodes provide worse availability than odd numbers

Node Replacement Strategy

Failed node: Remove first, then add replacement
Healthy node: Add replacement first, then remove old node

Performance Considerations

etcd performance decreases as cluster scales
5-node cluster commits ~5% fewer writes than 3-node cluster
Vertically scale nodes for performance, don't add more nodes

Machine Configuration Versioning

Reproducible Configuration Workflow

Store only:

secrets.yaml (generated once at cluster creation)
Patch files (YAML/JSON patches describing differences from defaults)

Generate configs when needed:

# Generate fresh configs with existing secrets
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml

# Apply patches to generated configs
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml --config-patch @patch.yaml

This prevents configuration drift after automated upgrades.

Troubleshooting Common Issues

Upgrade Failures

Invalid installer image: Check image reference and network connectivity
Filesystem unmount failure: Use --stage flag
Boot failure: System automatically rolls back to previous version
Workload issues: Use talosctl rollback to revert

Node Join Issues

Verify network connectivity to control plane endpoint
Check discovery service configuration
Validate machine configuration syntax
Ensure bootstrap process completed on initial control plane node

Control Plane Quorum Loss

Identify healthy nodes with talosctl etcd status
Follow disaster recovery procedures if quorum cannot be restored
Use etcd snapshots for cluster recovery

Security Considerations

Certificate Rotation

Talos automatically rotates certificates, but monitor expiration:

talosctl -n <IP> get secrets

Pod Security

Control plane nodes are tainted by default to prevent workload scheduling. This protects:

Control plane from resource starvation
Credentials from workload exposure

Network Security

All API communication uses mutual TLS (mTLS)
Discovery service data is encrypted before transmission
WireGuard (KubeSpan) provides mesh networking security

6.8 KiB Raw Permalink Blame History