Settle on v1 setup method. Test run completed successfully from bootstrap to service setup.

- Refactor dnsmasq configuration and scripts for improved variable handling and clarity
- Updated dnsmasq configuration files to use direct variable references instead of data source functions for better readability.
- Modified setup scripts to ensure they are run from the correct environment and directory, checking for the WC_HOME variable.
- Changed paths in README and scripts to reflect the new directory structure.
- Enhanced error handling in setup scripts to provide clearer guidance on required configurations.
- Adjusted kernel and initramfs URLs in boot.ipxe to use the updated variable references.
This commit is contained in:
2025-06-24 15:12:53 -07:00
parent 335cca1eba
commit f1fe4f9cc2
165 changed files with 15838 additions and 1003 deletions

View File

@@ -1,90 +1,235 @@
# Cluster Node Setup
Cluster node setup is WIP. Any kubernetes setup will do. Currently, we have a working cluster using each of these methods and are moving towards Talos.
This directory contains automation for setting up Talos Kubernetes cluster nodes with static IP configuration.
## k3s cluster node setup
## Hardware Detection and Setup (Recommended)
K3s provides a fully-compliant Kubernetes distribution in a small footprint.
The automated setup discovers hardware configuration from nodes in maintenance mode and generates machine configurations with the correct interface names and disk paths.
To set up control nodes:
### Prerequisites
1. `source .env`
2. Boot nodes with Talos ISO in maintenance mode
3. Nodes must be accessible on the network
### Hardware Discovery Workflow
```bash
# Install K3s without the default load balancer (we'll use MetalLB)
curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode=644 --disable servicelb --disable metallb
# ONE-TIME CLUSTER INITIALIZATION (run once per cluster)
./init-cluster.sh
# Set up kubectl configuration
mkdir -p ~/.kube
sudo cat /etc/rancher/k3s/k3s.yaml > ~/.kube/config
chmod 600 ~/.kube/config
# FOR EACH CONTROL PLANE NODE:
# 1. Boot node with Talos ISO (it will get a DHCP IP in maintenance mode)
# 2. Detect hardware and update config.yaml
./detect-node-hardware.sh <maintenance-ip> <node-number>
# Example: Node boots at 192.168.8.168, register as node 1
./detect-node-hardware.sh 192.168.8.168 1
# 3. Generate machine config for registered nodes
./generate-machine-configs.sh
# 4. Apply configuration - node will reboot with static IP
talosctl apply-config --insecure -n 192.168.8.168 --file final/controlplane-node-1.yaml
# 5. Wait for reboot, node should come up at its target static IP (192.168.8.31)
# Repeat steps 1-5 for additional control plane nodes
```
Set up the infrastructure services after these are running, then you can add more worker nodes with:
The `detect-node-hardware.sh` script will:
- Connect to nodes in maintenance mode via talosctl
- Discover active ethernet interfaces (e.g., `enp4s0` instead of hardcoded `eth0`)
- Discover available installation disks (>10GB)
- Update `config.yaml` with per-node hardware configuration
- Provide next steps for machine config generation
The `init-cluster.sh` script will:
- Generate Talos cluster secrets and base configurations (once per cluster)
- Set up talosctl context with cluster certificates
- Configure VIP endpoint for cluster communication
The `generate-machine-configs.sh` script will:
- Check which nodes have been hardware-detected
- Compile network configuration templates with discovered hardware settings
- Create final machine configurations for registered nodes only
- Include system extensions for Longhorn (iscsi-tools, util-linux-tools)
- Update talosctl context with registered node IPs
### Cluster Bootstrap
After all control plane nodes are configured with static IPs:
```bash
# On your master node, get the node token
NODE_TOKEN=`sudo cat /var/lib/rancher/k3s/server/node-token`
MASTER_IP=192.168.8.222
# On each new node, join the cluster
# Bootstrap the cluster using any control node
talosctl bootstrap --nodes 192.168.8.31 --endpoint 192.168.8.31
curl -sfL https://get.k3s.io | K3S_URL=https://$MASTER_IP:6443 K3S_TOKEN=$NODE_TOKEN sh -
```
## Talos cluster node setup
This is a new experimental method for setting up cluster nodes. We're currently working through the simplest bootstrapping experience.
Currently, though, all these steps are manual.
Copy this entire directory to your personal cloud folder and modify it as necessary as you install. We suggest putting it in `cluster/bootstrap`.
```bash
# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# Install talosctl
curl -sL https://talos.dev/install | sh
# In your LAN Router (which is your DHCP server),
CLUSTER_NAME=test-cluster
VIP=192.168.8.20 # Non-DHCP
# Boot your nodes with the ISO and put their IP addresses here. Pin in DHCP.
# Nodes must all be on the same switch.
# TODO: How to set these static on boot?
CONTROL_NODE_1=192.168.8.21
CONTROL_NODE_2=192.168.8.22
CONTROL_NODE_3=192.168.8.23
# Generate cluster config files (including pki and tokens)
cd generated
talosctl gen secrets -o secrets.yaml
talosctl gen config --with-secrets secrets.yaml $CLUSTER_NAME https://$VIP:6443
talosctl config merge ./talosconfig
cd ..
# If the disk you want to install Talos on isn't /dev/sda, you should
# update to the disk you want in patch/controlplane.yml and patch/worker.yaml. If you have already attempted to install a node and received an error about not being able to find /dev/sda, you can see what disks are available on it with:
#
# talosctl -n $VIP get disks --insecure
# See https://www.talos.dev/v1.10/talos-guides/configuration/patching/
talosctl machineconfig patch generated/controlplane.yaml --patch @patch/controlplane.yaml -o final/controlplane.yaml
talosctl machineconfig patch generated/worker.yaml --patch @patch/worker.yaml -o final/worker.yaml
$
# Apply control plane config
talosctl apply-config --insecure -n $CONTROL_NODE_1,$CONTROL_NODE_2,$CONTROL_NODE_3 --file final/controlplane.yaml
# Bootstrap cluster on control plan
talosctl bootstrap -n $VIP
# Merge new cluster information into kubeconfig
# Get kubeconfig
talosctl kubeconfig
# You are now ready to use both `talosctl` and `kubectl` against your new cluster.
# Verify cluster is ready
kubectl get nodes
```
## Complete Example
Here's a complete example of setting up a 3-node control plane:
```bash
# CLUSTER INITIALIZATION (once per cluster)
./init-cluster.sh
# NODE 1
# Boot node with Talos ISO, it gets DHCP IP 192.168.8.168
./detect-node-hardware.sh 192.168.8.168 1
./generate-machine-configs.sh
talosctl apply-config --insecure -n 192.168.8.168 --file final/controlplane-node-1.yaml
# Node reboots and comes up at 192.168.8.31
# NODE 2
# Boot second node with Talos ISO, it gets DHCP IP 192.168.8.169
./detect-node-hardware.sh 192.168.8.169 2
./generate-machine-configs.sh
talosctl apply-config --insecure -n 192.168.8.169 --file final/controlplane-node-2.yaml
# Node reboots and comes up at 192.168.8.32
# NODE 3
# Boot third node with Talos ISO, it gets DHCP IP 192.168.8.170
./detect-node-hardware.sh 192.168.8.170 3
./generate-machine-configs.sh
talosctl apply-config --insecure -n 192.168.8.170 --file final/controlplane-node-3.yaml
# Node reboots and comes up at 192.168.8.33
# CLUSTER BOOTSTRAP
talosctl bootstrap -n 192.168.8.30
talosctl kubeconfig
kubectl get nodes
```
## Configuration Details
### Per-Node Configuration
Each control plane node has its own configuration block in `config.yaml`:
```yaml
cluster:
nodes:
control:
vip: 192.168.8.30
node1:
ip: 192.168.8.31
interface: enp4s0 # Discovered automatically
disk: /dev/sdb # Selected during hardware detection
node2:
ip: 192.168.8.32
# interface and disk added after hardware detection
node3:
ip: 192.168.8.33
# interface and disk added after hardware detection
```
Worker nodes use DHCP by default. You can use the same hardware detection process for worker nodes if static IPs are needed.
## Talosconfig Management
### Context Naming and Conflicts
When running `talosctl config merge ./generated/talosconfig`, if a context with the same name already exists, talosctl will create an enumerated version (e.g., `demo-cluster-2`).
**For a clean setup:**
- Delete existing contexts before merging: `talosctl config contexts` then `talosctl config context <name> --remove`
- Or use `--force` to overwrite: `talosctl config merge ./generated/talosconfig --force`
**Recommended approach for new clusters:**
```bash
# Remove old context if rebuilding cluster
talosctl config context demo-cluster --remove || true
# Merge new configuration
talosctl config merge ./generated/talosconfig
talosctl config endpoint 192.168.8.30
talosctl config node 192.168.8.31 # Add nodes as they are registered
```
### Context Configuration Timeline
1. **After first node hardware detection**: Merge talosconfig and set endpoint/first node
2. **After additional nodes**: Add them to the existing context with `talosctl config node <ip1> <ip2> <ip3>`
3. **Before cluster bootstrap**: Ensure all control plane nodes are in the node list
### System Extensions
All nodes include:
- `siderolabs/iscsi-tools`: Required for Longhorn storage
- `siderolabs/util-linux-tools`: Utility tools for storage operations
### Hardware Detection
The `detect-node-hardware.sh` script automatically discovers:
- **Network interfaces**: Finds active ethernet interfaces (no more hardcoded `eth0`)
- **Installation disks**: Lists available disks >10GB for interactive selection
- **Per-node settings**: Updates `config.yaml` with hardware-specific configuration
This eliminates the need to manually configure hardware settings and handles different hardware configurations across nodes.
### Template Structure
Configuration templates are stored in `patch.templates/` and use gomplate syntax:
- `controlplane-node-1.yaml`: Template for first control plane node
- `controlplane-node-2.yaml`: Template for second control plane node
- `controlplane-node-3.yaml`: Template for third control plane node
- `worker.yaml`: Template for worker nodes
Templates use per-node variables from `config.yaml`:
- `{{ .cluster.nodes.control.node1.ip }}`
- `{{ .cluster.nodes.control.node1.interface }}`
- `{{ .cluster.nodes.control.node1.disk }}`
- `{{ .cluster.nodes.control.vip }}`
The `wild-compile-template-dir` command processes all templates and outputs compiled configurations to the `patch/` directory.
## Troubleshooting
### Hardware Detection Issues
```bash
# Check if node is accessible in maintenance mode
talosctl -n <NODE_IP> version --insecure
# View available network interfaces
talosctl -n <NODE_IP> get links --insecure
# View available disks
talosctl -n <NODE_IP> get disks --insecure
```
### Manual Hardware Discovery
If the automatic detection fails, you can manually inspect hardware:
```bash
# Find active ethernet interfaces
talosctl -n <NODE_IP> get links --insecure -o json | jq -s '.[] | select(.spec.operationalState == "up" and .spec.type == "ether" and .metadata.id != "lo") | .metadata.id'
# Find suitable installation disks
talosctl -n <NODE_IP> get disks --insecure -o json | jq -s '.[] | select(.spec.size > 10000000000) | .metadata.id'
```
### Node Status
```bash
# View machine configuration (only works after config is applied)
talosctl -n <NODE_IP> get machineconfig
```