Rewrote backup/restore guides to document current system (native pg_dump/Longhorn/tar.gz tools, blue-green restore, scheduling) and remove outdated restic references. Rewrote monitoring guide to replace K3s/Helm/Velero placeholders with actual capabilities. Filled in all four upgrade guides (Talos, Kubernetes, applications, Wild Cloud) that were previously TBD stubs. Expanded troubleshooting guides with correct namespaces, Wild Cloud CLI commands, and Talos-specific diagnostics. Added verification commands to cluster networking health checklist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9.4 KiB
Disaster Recovery
This guide covers recovering a Wild Cloud cluster after catastrophic failure — hardware death, corrupted storage, or any scenario where you need to rebuild from scratch.
What You Need
To rebuild a cluster you need two things:
- Cluster config backup — The tar.gz archive from Wild Cloud's cluster config backup feature, containing kubeconfig, talosconfig, config.yaml, secrets.yaml, and Talos node configs.
- App backups — The per-app backup archives (database dumps, PVC snapshots, config files) stored at your backup destination (S3, NFS, or local).
If your instance data directory was a git repository (recommended), you also have the full history of compiled manifests and config.yaml in git. The git repo alone is enough to redeploy apps — but without secrets.yaml and kubeconfig, you can't authenticate to the cluster or decrypt app secrets.
Recovery Scenarios
Scenario 1: Wild Central Device Failure (Cluster Intact)
The Raspberry Pi or server running Wild Central died, but the Kubernetes cluster nodes are still running.
Steps:
-
Set up a new Wild Central device:
sudo dpkg -i wild-cloud-central_*.deb sudo systemctl enable wild-cloud-central -
Restore your data directory from git (for manifests and config) plus your cluster config backup (for secrets and credentials):
# Clone instance data from git git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central # Extract cluster config backup over the top # This restores kubeconfig, secrets.yaml, talosconfig, etc. tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/ -
Start Wild Central:
sudo systemctl start wild-cloud-central -
Verify connectivity:
wild instance use your-instance wild cluster status
The cluster is still running — your apps are live. Wild Central is just the management plane.
Scenario 2: Single Node Failure (Cluster Degraded)
One or more nodes died but the cluster still has quorum (at least 2 of 3 control plane nodes, or workers are replaceable).
Steps:
-
Check cluster health from Wild Central:
talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \ health --nodes <surviving-node-ip> -
Remove the dead node from the cluster:
# Remove from Kubernetes kubectl --kubeconfig /var/lib/wild-central/instances/your-instance/kubeconfig \ delete node <dead-node-name> # Remove from etcd (if control plane node) talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \ etcd remove-member <dead-node-name> --nodes <surviving-node-ip> -
PXE boot a replacement node using Wild Central's PXE service, or manually install Talos Linux on the new hardware.
-
Add the new node through the Wild Cloud web UI or CLI:
wild node add --role worker --ip <new-node-ip> -
Verify workloads reschedule to the new node:
kubectl get pods --all-namespaces -o wide
Scenario 3: Total Cluster Loss (Rebuild from Scratch)
All nodes are gone. You need to rebuild everything.
Prerequisites:
- New hardware (or repaired existing hardware) with network boot capability or Talos Linux installed
- Your cluster config backup (tar.gz with kubeconfig, talosconfig, secrets.yaml, Talos configs)
- Access to your backup destination (S3 bucket, NFS share, etc.)
- Your instance data git repo (if available — contains compiled manifests)
Steps:
-
Set up Wild Central on a fresh device:
sudo dpkg -i wild-cloud-central_*.deb -
Restore your data directory:
# If you have a git repo: git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central # Extract cluster config over the top: tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/If you don't have a git repo, just extract the cluster config backup into a fresh instance directory. You'll re-add apps from the Wild Directory.
-
Bootstrap new Talos nodes using the restored Talos configs:
# Apply control plane config to the first node talosctl apply-config \ --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \ --nodes <node-ip> \ --file /var/lib/wild-central/instances/your-instance/talos/generated/controlplane.yaml \ --insecureThe restored
controlplane.yamlandworker.yamlcontain your cluster's identity (cluster name, secrets, certificates). Using them ensures the new cluster has the same identity as the old one. -
Bootstrap the cluster:
talosctl bootstrap \ --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \ --nodes <first-control-plane-ip> -
Wait for the cluster to be healthy:
talosctl health \ --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \ --nodes <first-control-plane-ip> -
Update kubeconfig (the new cluster may issue a fresh kubeconfig):
talosctl kubeconfig \ --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \ --nodes <first-control-plane-ip> \ /var/lib/wild-central/instances/your-instance/kubeconfig -
Deploy infrastructure services first (order matters):
wild instance use your-instance wild service install metallb wild service install traefik wild service install cert-manager wild service install external-dns wild service install longhorn # If using Longhorn for PVCs -
Deploy apps (dependencies first, then apps):
# Deploy database services first wild app deploy pg wild app deploy redis # Then deploy apps wild app deploy gitea wild app deploy immich # ... etcIf your git repo has compiled manifests, these deploys apply the exact same manifests that were running before. If not, you'll need to re-add apps from the Wild Directory first:
wild app add gitea wild app deploy gitea -
Restore app data from backups:
# Restore each app's data (database + PVC) from the backup destination # Use the Web UI: navigate to Backups > [app] > Restore # Or via CLI: wild restore gitea --auto wild restore immich --autoThe
--autoflag runs the full blue-green restore cycle: restore to standby, switch traffic, then clean up the old namespace. For more control, run each phase separately — see Restoring Backups. -
Verify everything is working:
wild app status gitea wild app status immich kubectl get pods --all-namespaces
Cluster Config Backup
The cluster config backup feature archives the files that are NOT tracked in git — the credentials and secrets needed to access the cluster.
What Gets Backed Up
| File | Purpose |
|---|---|
kubeconfig |
Kubernetes API credentials |
config.yaml |
Full instance configuration |
secrets.yaml |
App secrets (database passwords, API keys) |
talos/generated/talosconfig |
Talos API credentials |
talos/generated/controlplane.yaml |
Control plane node config |
talos/generated/worker.yaml |
Worker node config |
talos/generated/secrets.yaml |
Talos bootstrap secrets (cluster identity) |
Creating Cluster Config Backups
Web UI: Navigate to Backups, click "Backup" on the "Cluster Config" row.
CLI:
# Via API
curl -X POST http://localhost:5055/api/v1/instances/your-instance/backup/cluster
Scheduled: Create a backup schedule with target type "cluster" to automatically back up cluster config on a recurring basis. See Making Backups for scheduling details.
Downloading a Cluster Config Backup
Cluster config backups are stored at your configured backup destination under the key cluster-config/{instance}/{timestamp}.tar.gz. To retrieve one:
- S3/Azure: Download from the bucket/container using your cloud provider's CLI
- NFS: Navigate to the NFS mount point and find the archive
- Local: Find it at
{data-dir}/instances/{instance}/backups/cluster-config/...
Store a copy of the latest cluster config backup in a secure offsite location (encrypted USB drive, password manager, separate cloud storage). If your primary backup destination is on the cluster itself, a total cluster loss takes the backups with it.
Prevention Checklist
- Cluster config backups are scheduled and running
- App backups are scheduled for all critical apps
- Backup destination is offsite or on separate infrastructure from the cluster
- Instance data directory is pushed to a git remote (excludes secrets.yaml)
- Cluster config backup archive is stored in a second location (not just on the cluster)
- Test a restore periodically — backups are worthless if restore doesn't work
Related Guides
- Making Backups — Setting up backup destinations and schedules
- Restoring Backups — Blue-green restore process in detail
- Upgrade Talos — Talos node upgrade and rollback
- Troubleshoot Cluster — Diagnosing cluster issues after recovery