Files
wild-cloud/docs/guides/disaster-recovery.md
Paul Payne 3f97dce86a docs: Update all guides to reflect current CLI, API, and web app
Rewrote backup/restore guides to document current system (native
pg_dump/Longhorn/tar.gz tools, blue-green restore, scheduling) and
remove outdated restic references. Rewrote monitoring guide to replace
K3s/Helm/Velero placeholders with actual capabilities. Filled in all
four upgrade guides (Talos, Kubernetes, applications, Wild Cloud) that
were previously TBD stubs. Expanded troubleshooting guides with correct
namespaces, Wild Cloud CLI commands, and Talos-specific diagnostics.
Added verification commands to cluster networking health checklist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-24 21:54:11 +00:00

9.4 KiB

Disaster Recovery

This guide covers recovering a Wild Cloud cluster after catastrophic failure — hardware death, corrupted storage, or any scenario where you need to rebuild from scratch.

What You Need

To rebuild a cluster you need two things:

  1. Cluster config backup — The tar.gz archive from Wild Cloud's cluster config backup feature, containing kubeconfig, talosconfig, config.yaml, secrets.yaml, and Talos node configs.
  2. App backups — The per-app backup archives (database dumps, PVC snapshots, config files) stored at your backup destination (S3, NFS, or local).

If your instance data directory was a git repository (recommended), you also have the full history of compiled manifests and config.yaml in git. The git repo alone is enough to redeploy apps — but without secrets.yaml and kubeconfig, you can't authenticate to the cluster or decrypt app secrets.

Recovery Scenarios

Scenario 1: Wild Central Device Failure (Cluster Intact)

The Raspberry Pi or server running Wild Central died, but the Kubernetes cluster nodes are still running.

Steps:

  1. Set up a new Wild Central device:

    sudo dpkg -i wild-cloud-central_*.deb
    sudo systemctl enable wild-cloud-central
    
  2. Restore your data directory from git (for manifests and config) plus your cluster config backup (for secrets and credentials):

    # Clone instance data from git
    git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central
    
    # Extract cluster config backup over the top
    # This restores kubeconfig, secrets.yaml, talosconfig, etc.
    tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/
    
  3. Start Wild Central:

    sudo systemctl start wild-cloud-central
    
  4. Verify connectivity:

    wild instance use your-instance
    wild cluster status
    

The cluster is still running — your apps are live. Wild Central is just the management plane.

Scenario 2: Single Node Failure (Cluster Degraded)

One or more nodes died but the cluster still has quorum (at least 2 of 3 control plane nodes, or workers are replaceable).

Steps:

  1. Check cluster health from Wild Central:

    talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
      health --nodes <surviving-node-ip>
    
  2. Remove the dead node from the cluster:

    # Remove from Kubernetes
    kubectl --kubeconfig /var/lib/wild-central/instances/your-instance/kubeconfig \
      delete node <dead-node-name>
    
    # Remove from etcd (if control plane node)
    talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
      etcd remove-member <dead-node-name> --nodes <surviving-node-ip>
    
  3. PXE boot a replacement node using Wild Central's PXE service, or manually install Talos Linux on the new hardware.

  4. Add the new node through the Wild Cloud web UI or CLI:

    wild node add --role worker --ip <new-node-ip>
    
  5. Verify workloads reschedule to the new node:

    kubectl get pods --all-namespaces -o wide
    

Scenario 3: Total Cluster Loss (Rebuild from Scratch)

All nodes are gone. You need to rebuild everything.

Prerequisites:

  • New hardware (or repaired existing hardware) with network boot capability or Talos Linux installed
  • Your cluster config backup (tar.gz with kubeconfig, talosconfig, secrets.yaml, Talos configs)
  • Access to your backup destination (S3 bucket, NFS share, etc.)
  • Your instance data git repo (if available — contains compiled manifests)

Steps:

  1. Set up Wild Central on a fresh device:

    sudo dpkg -i wild-cloud-central_*.deb
    
  2. Restore your data directory:

    # If you have a git repo:
    git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central
    
    # Extract cluster config over the top:
    tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/
    

    If you don't have a git repo, just extract the cluster config backup into a fresh instance directory. You'll re-add apps from the Wild Directory.

  3. Bootstrap new Talos nodes using the restored Talos configs:

    # Apply control plane config to the first node
    talosctl apply-config \
      --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
      --nodes <node-ip> \
      --file /var/lib/wild-central/instances/your-instance/talos/generated/controlplane.yaml \
      --insecure
    

    The restored controlplane.yaml and worker.yaml contain your cluster's identity (cluster name, secrets, certificates). Using them ensures the new cluster has the same identity as the old one.

  4. Bootstrap the cluster:

    talosctl bootstrap \
      --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
      --nodes <first-control-plane-ip>
    
  5. Wait for the cluster to be healthy:

    talosctl health \
      --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
      --nodes <first-control-plane-ip>
    
  6. Update kubeconfig (the new cluster may issue a fresh kubeconfig):

    talosctl kubeconfig \
      --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
      --nodes <first-control-plane-ip> \
      /var/lib/wild-central/instances/your-instance/kubeconfig
    
  7. Deploy infrastructure services first (order matters):

    wild instance use your-instance
    wild service install metallb
    wild service install traefik
    wild service install cert-manager
    wild service install external-dns
    wild service install longhorn    # If using Longhorn for PVCs
    
  8. Deploy apps (dependencies first, then apps):

    # Deploy database services first
    wild app deploy pg
    wild app deploy redis
    
    # Then deploy apps
    wild app deploy gitea
    wild app deploy immich
    # ... etc
    

    If your git repo has compiled manifests, these deploys apply the exact same manifests that were running before. If not, you'll need to re-add apps from the Wild Directory first:

    wild app add gitea
    wild app deploy gitea
    
  9. Restore app data from backups:

    # Restore each app's data (database + PVC) from the backup destination
    # Use the Web UI: navigate to Backups > [app] > Restore
    # Or via CLI:
    wild restore gitea --auto
    wild restore immich --auto
    

    The --auto flag runs the full blue-green restore cycle: restore to standby, switch traffic, then clean up the old namespace. For more control, run each phase separately — see Restoring Backups.

  10. Verify everything is working:

    wild app status gitea
    wild app status immich
    kubectl get pods --all-namespaces
    

Cluster Config Backup

The cluster config backup feature archives the files that are NOT tracked in git — the credentials and secrets needed to access the cluster.

What Gets Backed Up

File Purpose
kubeconfig Kubernetes API credentials
config.yaml Full instance configuration
secrets.yaml App secrets (database passwords, API keys)
talos/generated/talosconfig Talos API credentials
talos/generated/controlplane.yaml Control plane node config
talos/generated/worker.yaml Worker node config
talos/generated/secrets.yaml Talos bootstrap secrets (cluster identity)

Creating Cluster Config Backups

Web UI: Navigate to Backups, click "Backup" on the "Cluster Config" row.

CLI:

# Via API
curl -X POST http://localhost:5055/api/v1/instances/your-instance/backup/cluster

Scheduled: Create a backup schedule with target type "cluster" to automatically back up cluster config on a recurring basis. See Making Backups for scheduling details.

Downloading a Cluster Config Backup

Cluster config backups are stored at your configured backup destination under the key cluster-config/{instance}/{timestamp}.tar.gz. To retrieve one:

  • S3/Azure: Download from the bucket/container using your cloud provider's CLI
  • NFS: Navigate to the NFS mount point and find the archive
  • Local: Find it at {data-dir}/instances/{instance}/backups/cluster-config/...

Store a copy of the latest cluster config backup in a secure offsite location (encrypted USB drive, password manager, separate cloud storage). If your primary backup destination is on the cluster itself, a total cluster loss takes the backups with it.

Prevention Checklist

  • Cluster config backups are scheduled and running
  • App backups are scheduled for all critical apps
  • Backup destination is offsite or on separate infrastructure from the cluster
  • Instance data directory is pushed to a git remote (excludes secrets.yaml)
  • Cluster config backup archive is stored in a second location (not just on the cluster)
  • Test a restore periodically — backups are worthless if restore doesn't work