Files

Paul Payne 3f97dce86a docs: Update all guides to reflect current CLI, API, and web app

Rewrote backup/restore guides to document current system (native
pg_dump/Longhorn/tar.gz tools, blue-green restore, scheduling) and
remove outdated restic references. Rewrote monitoring guide to replace
K3s/Helm/Velero placeholders with actual capabilities. Filled in all
four upgrade guides (Talos, Kubernetes, applications, Wild Cloud) that
were previously TBD stubs. Expanded troubleshooting guides with correct
namespaces, Wild Cloud CLI commands, and Talos-specific diagnostics.
Added verification commands to cluster networking health checklist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-24 21:54:11 +00:00

9.4 KiB

Raw Blame History

Disaster Recovery

This guide covers recovering a Wild Cloud cluster after catastrophic failure — hardware death, corrupted storage, or any scenario where you need to rebuild from scratch.

What You Need

To rebuild a cluster you need two things:

Cluster config backup — The tar.gz archive from Wild Cloud's cluster config backup feature, containing kubeconfig, talosconfig, config.yaml, secrets.yaml, and Talos node configs.
App backups — The per-app backup archives (database dumps, PVC snapshots, config files) stored at your backup destination (S3, NFS, or local).

If your instance data directory was a git repository (recommended), you also have the full history of compiled manifests and config.yaml in git. The git repo alone is enough to redeploy apps — but without secrets.yaml and kubeconfig, you can't authenticate to the cluster or decrypt app secrets.

Recovery Scenarios

Scenario 1: Wild Central Device Failure (Cluster Intact)

The Raspberry Pi or server running Wild Central died, but the Kubernetes cluster nodes are still running.

Steps:

Set up a new Wild Central device:

sudo dpkg -i wild-cloud-central_*.deb
sudo systemctl enable wild-cloud-central

Restore your data directory from git (for manifests and config) plus your cluster config backup (for secrets and credentials):

# Clone instance data from git
git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central

# Extract cluster config backup over the top
# This restores kubeconfig, secrets.yaml, talosconfig, etc.
tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/

Start Wild Central:

sudo systemctl start wild-cloud-central

Verify connectivity:

wild instance use your-instance
wild cluster status

The cluster is still running — your apps are live. Wild Central is just the management plane.

Scenario 2: Single Node Failure (Cluster Degraded)

One or more nodes died but the cluster still has quorum (at least 2 of 3 control plane nodes, or workers are replaceable).

Steps:

Check cluster health from Wild Central:

talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
  health --nodes <surviving-node-ip>

Remove the dead node from the cluster:

# Remove from Kubernetes
kubectl --kubeconfig /var/lib/wild-central/instances/your-instance/kubeconfig \
  delete node <dead-node-name>

# Remove from etcd (if control plane node)
talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
  etcd remove-member <dead-node-name> --nodes <surviving-node-ip>

PXE boot a replacement node using Wild Central's PXE service, or manually install Talos Linux on the new hardware.
Add the new node through the Wild Cloud web UI or CLI:
```
wild node add --role worker --ip <new-node-ip>
```

Verify workloads reschedule to the new node:

kubectl get pods --all-namespaces -o wide

Scenario 3: Total Cluster Loss (Rebuild from Scratch)

All nodes are gone. You need to rebuild everything.

Prerequisites:

New hardware (or repaired existing hardware) with network boot capability or Talos Linux installed
Your cluster config backup (tar.gz with kubeconfig, talosconfig, secrets.yaml, Talos configs)
Access to your backup destination (S3 bucket, NFS share, etc.)
Your instance data git repo (if available — contains compiled manifests)

Steps:

Set up Wild Central on a fresh device:
```
sudo dpkg -i wild-cloud-central_*.deb
```

Restore your data directory:

# If you have a git repo:
git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central

# Extract cluster config over the top:
tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/

If you don't have a git repo, just extract the cluster config backup into a fresh instance directory. You'll re-add apps from the Wild Directory.

Bootstrap new Talos nodes using the restored Talos configs:

# Apply control plane config to the first node
talosctl apply-config \
  --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
  --nodes <node-ip> \
  --file /var/lib/wild-central/instances/your-instance/talos/generated/controlplane.yaml \
  --insecure

The restored controlplane.yaml and worker.yaml contain your cluster's identity (cluster name, secrets, certificates). Using them ensures the new cluster has the same identity as the old one.

Bootstrap the cluster:

talosctl bootstrap \
  --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
  --nodes <first-control-plane-ip>

Wait for the cluster to be healthy:

talosctl health \
  --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
  --nodes <first-control-plane-ip>

Update kubeconfig (the new cluster may issue a fresh kubeconfig):

talosctl kubeconfig \
  --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
  --nodes <first-control-plane-ip> \
  /var/lib/wild-central/instances/your-instance/kubeconfig

Deploy infrastructure services first (order matters):

wild instance use your-instance
wild service install metallb
wild service install traefik
wild service install cert-manager
wild service install external-dns
wild service install longhorn    # If using Longhorn for PVCs

Deploy apps (dependencies first, then apps):
```
# Deploy database services first
wild app deploy pg
wild app deploy redis

# Then deploy apps
wild app deploy gitea
wild app deploy immich
# ... etc
```
If your git repo has compiled manifests, these deploys apply the exact same manifests that were running before. If not, you'll need to re-add apps from the Wild Directory first:
```
wild app add gitea
wild app deploy gitea
```
Restore app data from backups:
```
# Restore each app's data (database + PVC) from the backup destination
# Use the Web UI: navigate to Backups > [app] > Restore
# Or via CLI:
wild restore gitea --auto
wild restore immich --auto
```
The --auto flag runs the full blue-green restore cycle: restore to standby, switch traffic, then clean up the old namespace. For more control, run each phase separately — see Restoring Backups.

Verify everything is working:

wild app status gitea
wild app status immich
kubectl get pods --all-namespaces

Cluster Config Backup

The cluster config backup feature archives the files that are NOT tracked in git — the credentials and secrets needed to access the cluster.

What Gets Backed Up

File	Purpose
`kubeconfig`	Kubernetes API credentials
`config.yaml`	Full instance configuration
`secrets.yaml`	App secrets (database passwords, API keys)
`talos/generated/talosconfig`	Talos API credentials
`talos/generated/controlplane.yaml`	Control plane node config
`talos/generated/worker.yaml`	Worker node config
`talos/generated/secrets.yaml`	Talos bootstrap secrets (cluster identity)

Creating Cluster Config Backups

Web UI: Navigate to Backups, click "Backup" on the "Cluster Config" row.

CLI:

# Via API
curl -X POST http://localhost:5055/api/v1/instances/your-instance/backup/cluster

Scheduled: Create a backup schedule with target type "cluster" to automatically back up cluster config on a recurring basis. See Making Backups for scheduling details.

Downloading a Cluster Config Backup

Cluster config backups are stored at your configured backup destination under the key cluster-config/{instance}/{timestamp}.tar.gz. To retrieve one:

S3/Azure: Download from the bucket/container using your cloud provider's CLI
NFS: Navigate to the NFS mount point and find the archive
Local: Find it at {data-dir}/instances/{instance}/backups/cluster-config/...

Store a copy of the latest cluster config backup in a secure offsite location (encrypted USB drive, password manager, separate cloud storage). If your primary backup destination is on the cluster itself, a total cluster loss takes the backups with it.

Prevention Checklist

Cluster config backups are scheduled and running
App backups are scheduled for all critical apps
Backup destination is offsite or on separate infrastructure from the cluster
Instance data directory is pushed to a git remote (excludes secrets.yaml)
Cluster config backup archive is stored in a second location (not just on the cluster)
Test a restore periodically — backups are worthless if restore doesn't work

Making Backups — Setting up backup destinations and schedules
Restoring Backups — Blue-green restore process in detail
Upgrade Talos — Talos node upgrade and rollback
Troubleshoot Cluster — Diagnosing cluster issues after recovery

9.4 KiB Raw Blame History

Disaster Recovery

What You Need

Recovery Scenarios

Scenario 1: Wild Central Device Failure (Cluster Intact)

Scenario 2: Single Node Failure (Cluster Degraded)

Scenario 3: Total Cluster Loss (Rebuild from Scratch)

Cluster Config Backup

What Gets Backed Up

Creating Cluster Config Backups

Downloading a Cluster Config Backup

Prevention Checklist

Related Guides

9.4 KiB

Raw Blame History