Settle on v1 setup method. Test run completed successfully from bootstrap to service setup.

- Refactor dnsmasq configuration and scripts for improved variable handling and clarity - Updated dnsmasq configuration files to use direct variable references instead of data source functions for better readability. - Modified setup scripts to ensure they are run from the correct environment and directory, checking for the WC_HOME variable. - Changed paths in README and scripts to reflect the new directory structure. - Enhanced error handling in setup scripts to provide clearer guidance on required configurations. - Adjusted kernel and initramfs URLs in boot.ipxe to use the updated variable references.
2025-06-24 15:12:53 -07:00
parent 335cca1eba
commit f1fe4f9cc2
165 changed files with 15838 additions and 1003 deletions
--- a/setup/cluster-nodes/README.md
+++ b/setup/cluster-nodes/README.md
@@ -1,90 +1,235 @@
 # Cluster Node Setup

-Cluster node setup is WIP. Any kubernetes setup will do. Currently, we have a working cluster using each of these methods and are moving towards Talos.
+This directory contains automation for setting up Talos Kubernetes cluster nodes with static IP configuration.

-## k3s cluster node setup
+## Hardware Detection and Setup (Recommended)

-K3s provides a fully-compliant Kubernetes distribution in a small footprint.
+The automated setup discovers hardware configuration from nodes in maintenance mode and generates machine configurations with the correct interface names and disk paths.

-To set up control nodes:
+### Prerequisites
+
+1. `source .env`
+2. Boot nodes with Talos ISO in maintenance mode
+3. Nodes must be accessible on the network
+
+### Hardware Discovery Workflow

 ```bash
-# Install K3s without the default load balancer (we'll use MetalLB)
-curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode=644 --disable servicelb --disable metallb
+# ONE-TIME CLUSTER INITIALIZATION (run once per cluster)
+./init-cluster.sh

-# Set up kubectl configuration
-mkdir -p ~/.kube
-sudo cat /etc/rancher/k3s/k3s.yaml > ~/.kube/config
-chmod 600 ~/.kube/config
+# FOR EACH CONTROL PLANE NODE:
+
+# 1. Boot node with Talos ISO (it will get a DHCP IP in maintenance mode)
+# 2. Detect hardware and update config.yaml
+./detect-node-hardware.sh <maintenance-ip> <node-number>
+
+# Example: Node boots at 192.168.8.168, register as node 1
+./detect-node-hardware.sh 192.168.8.168 1
+
+# 3. Generate machine config for registered nodes
+./generate-machine-configs.sh
+
+# 4. Apply configuration - node will reboot with static IP
+talosctl apply-config --insecure -n 192.168.8.168 --file final/controlplane-node-1.yaml
+
+# 5. Wait for reboot, node should come up at its target static IP (192.168.8.31)
+
+# Repeat steps 1-5 for additional control plane nodes
 ```

-Set up the infrastructure services after these are running, then you can add more worker nodes with:
+The `detect-node-hardware.sh` script will:
+
+- Connect to nodes in maintenance mode via talosctl
+- Discover active ethernet interfaces (e.g., `enp4s0` instead of hardcoded `eth0`)
+- Discover available installation disks (>10GB)
+- Update `config.yaml` with per-node hardware configuration
+- Provide next steps for machine config generation
+
+The `init-cluster.sh` script will:
+
+- Generate Talos cluster secrets and base configurations (once per cluster)
+- Set up talosctl context with cluster certificates
+- Configure VIP endpoint for cluster communication
+
+The `generate-machine-configs.sh` script will:
+
+- Check which nodes have been hardware-detected
+- Compile network configuration templates with discovered hardware settings
+- Create final machine configurations for registered nodes only
+- Include system extensions for Longhorn (iscsi-tools, util-linux-tools)
+- Update talosctl context with registered node IPs
+
+### Cluster Bootstrap
+
+After all control plane nodes are configured with static IPs:

 ```bash
-# On your master node, get the node token
-NODE_TOKEN=`sudo cat /var/lib/rancher/k3s/server/node-token`
-MASTER_IP=192.168.8.222
-# On each new node, join the cluster
+# Bootstrap the cluster using any control node
+talosctl bootstrap --nodes 192.168.8.31 --endpoint 192.168.8.31

-curl -sfL https://get.k3s.io | K3S_URL=https://$MASTER_IP:6443 K3S_TOKEN=$NODE_TOKEN sh -
-```

-## Talos cluster node setup
-
-This is a new experimental method for setting up cluster nodes. We're currently working through the simplest bootstrapping experience.
-
-Currently, though, all these steps are manual.
-
-Copy this entire directory to your personal cloud folder and modify it as necessary as you install. We suggest putting it in `cluster/bootstrap`.
-
-```bash
-
-# Install kubectl
-curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
-curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
-echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
-sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
-
-# Install talosctl
-curl -sL https://talos.dev/install | sh
-
-# In your LAN Router (which is your DHCP server),
-
-CLUSTER_NAME=test-cluster
-VIP=192.168.8.20 # Non-DHCP
-
-# Boot your nodes with the ISO and put their IP addresses here. Pin in DHCP.
-# Nodes must all be on the same switch.
-# TODO: How to set these static on boot?
-CONTROL_NODE_1=192.168.8.21
-CONTROL_NODE_2=192.168.8.22
-CONTROL_NODE_3=192.168.8.23
-
-# Generate cluster config files (including pki and tokens)
-cd generated
-talosctl gen secrets -o secrets.yaml
-talosctl gen config --with-secrets secrets.yaml $CLUSTER_NAME https://$VIP:6443
-talosctl config merge ./talosconfig
-cd ..
-
-# If the disk you want to install Talos on isn't /dev/sda, you should
-# update to the disk you want in patch/controlplane.yml and patch/worker.yaml. If you have already attempted to install a node and received an error about not being able to find /dev/sda, you can see what disks are available on it with:
-#
-# talosctl -n $VIP get disks --insecure
-
-# See https://www.talos.dev/v1.10/talos-guides/configuration/patching/
-talosctl machineconfig patch generated/controlplane.yaml --patch @patch/controlplane.yaml -o final/controlplane.yaml
-talosctl machineconfig patch generated/worker.yaml --patch @patch/worker.yaml -o final/worker.yaml
-$
-
-# Apply control plane config
-talosctl apply-config --insecure -n $CONTROL_NODE_1,$CONTROL_NODE_2,$CONTROL_NODE_3 --file final/controlplane.yaml
-
-# Bootstrap cluster on control plan
-talosctl bootstrap -n $VIP
-
-# Merge new cluster information into kubeconfig
+# Get kubeconfig
 talosctl kubeconfig

-# You are now ready to use both `talosctl` and `kubectl` against your new cluster.
+# Verify cluster is ready
+kubectl get nodes
+```
+
+## Complete Example
+
+Here's a complete example of setting up a 3-node control plane:
+
+```bash
+# CLUSTER INITIALIZATION (once per cluster)
+./init-cluster.sh
+
+# NODE 1
+# Boot node with Talos ISO, it gets DHCP IP 192.168.8.168
+./detect-node-hardware.sh 192.168.8.168 1
+./generate-machine-configs.sh
+talosctl apply-config --insecure -n 192.168.8.168 --file final/controlplane-node-1.yaml
+# Node reboots and comes up at 192.168.8.31
+
+# NODE 2
+# Boot second node with Talos ISO, it gets DHCP IP 192.168.8.169
+./detect-node-hardware.sh 192.168.8.169 2
+./generate-machine-configs.sh
+talosctl apply-config --insecure -n 192.168.8.169 --file final/controlplane-node-2.yaml
+# Node reboots and comes up at 192.168.8.32
+
+# NODE 3
+# Boot third node with Talos ISO, it gets DHCP IP 192.168.8.170
+./detect-node-hardware.sh 192.168.8.170 3
+./generate-machine-configs.sh
+talosctl apply-config --insecure -n 192.168.8.170 --file final/controlplane-node-3.yaml
+# Node reboots and comes up at 192.168.8.33
+
+# CLUSTER BOOTSTRAP
+talosctl bootstrap -n 192.168.8.30
+talosctl kubeconfig
+kubectl get nodes
+```
+
+## Configuration Details
+
+### Per-Node Configuration
+
+Each control plane node has its own configuration block in `config.yaml`:
+
+```yaml
+cluster:
+  nodes:
+    control:
+      vip: 192.168.8.30
+      node1:
+        ip: 192.168.8.31
+        interface: enp4s0 # Discovered automatically
+        disk: /dev/sdb # Selected during hardware detection
+      node2:
+        ip: 192.168.8.32
+        # interface and disk added after hardware detection
+      node3:
+        ip: 192.168.8.33
+        # interface and disk added after hardware detection
+```
+
+Worker nodes use DHCP by default. You can use the same hardware detection process for worker nodes if static IPs are needed.
+
+## Talosconfig Management
+
+### Context Naming and Conflicts
+
+When running `talosctl config merge ./generated/talosconfig`, if a context with the same name already exists, talosctl will create an enumerated version (e.g., `demo-cluster-2`).
+
+**For a clean setup:**
+
+- Delete existing contexts before merging: `talosctl config contexts` then `talosctl config context <name> --remove`
+- Or use `--force` to overwrite: `talosctl config merge ./generated/talosconfig --force`
+
+**Recommended approach for new clusters:**
+
+```bash
+# Remove old context if rebuilding cluster
+talosctl config context demo-cluster --remove || true
+
+# Merge new configuration
+talosctl config merge ./generated/talosconfig
+talosctl config endpoint 192.168.8.30
+talosctl config node 192.168.8.31  # Add nodes as they are registered
+```
+
+### Context Configuration Timeline
+
+1. **After first node hardware detection**: Merge talosconfig and set endpoint/first node
+2. **After additional nodes**: Add them to the existing context with `talosctl config node <ip1> <ip2> <ip3>`
+3. **Before cluster bootstrap**: Ensure all control plane nodes are in the node list
+
+### System Extensions
+
+All nodes include:
+
+- `siderolabs/iscsi-tools`: Required for Longhorn storage
+- `siderolabs/util-linux-tools`: Utility tools for storage operations
+
+### Hardware Detection
+
+The `detect-node-hardware.sh` script automatically discovers:
+
+- **Network interfaces**: Finds active ethernet interfaces (no more hardcoded `eth0`)
+- **Installation disks**: Lists available disks >10GB for interactive selection
+- **Per-node settings**: Updates `config.yaml` with hardware-specific configuration
+
+This eliminates the need to manually configure hardware settings and handles different hardware configurations across nodes.
+
+### Template Structure
+
+Configuration templates are stored in `patch.templates/` and use gomplate syntax:
+
+- `controlplane-node-1.yaml`: Template for first control plane node
+- `controlplane-node-2.yaml`: Template for second control plane node
+- `controlplane-node-3.yaml`: Template for third control plane node
+- `worker.yaml`: Template for worker nodes
+
+Templates use per-node variables from `config.yaml`:
+
+- `{{ .cluster.nodes.control.node1.ip }}`
+- `{{ .cluster.nodes.control.node1.interface }}`
+- `{{ .cluster.nodes.control.node1.disk }}`
+- `{{ .cluster.nodes.control.vip }}`
+
+The `wild-compile-template-dir` command processes all templates and outputs compiled configurations to the `patch/` directory.
+
+## Troubleshooting
+
+### Hardware Detection Issues
+
+```bash
+# Check if node is accessible in maintenance mode
+talosctl -n <NODE_IP> version --insecure
+
+# View available network interfaces
+talosctl -n <NODE_IP> get links --insecure
+
+# View available disks
+talosctl -n <NODE_IP> get disks --insecure
+```
+
+### Manual Hardware Discovery
+
+If the automatic detection fails, you can manually inspect hardware:
+
+```bash
+# Find active ethernet interfaces
+talosctl -n <NODE_IP> get links --insecure -o json | jq -s '.[] | select(.spec.operationalState == "up" and .spec.type == "ether" and .metadata.id != "lo") | .metadata.id'
+
+# Find suitable installation disks
+talosctl -n <NODE_IP> get disks --insecure -o json | jq -s '.[] | select(.spec.size > 10000000000) | .metadata.id'
+```
+
+### Node Status
+
+```bash
+# View machine configuration (only works after config is applied)
+talosctl -n <NODE_IP> get machineconfig
 ```