wild-cloud/docs/guides/disaster-recovery.md

# Disaster Recovery

This guide covers recovering a Wild Cloud cluster after catastrophic failure — hardware death, corrupted storage, or any scenario where you need to rebuild from scratch.

## What You Need

To rebuild a cluster you need two things:

1. **Cluster config backup** — The tar.gz archive from Wild Cloud's cluster config backup feature, containing kubeconfig, talosconfig, config.yaml, secrets.yaml, and Talos node configs.
2. **App backups** — The per-app backup archives (database dumps, PVC snapshots, config files) stored at your backup destination (S3, NFS, or local).

If your instance data directory was a git repository (recommended), you also have the full history of compiled manifests and config.yaml in git. The git repo alone is enough to redeploy apps — but without secrets.yaml and kubeconfig, you can't authenticate to the cluster or decrypt app secrets.

## Recovery Scenarios

### Scenario 1: Wild Central Device Failure (Cluster Intact)

The Raspberry Pi or server running Wild Central died, but the Kubernetes cluster nodes are still running.

**Steps:**

1. **Set up a new Wild Central device**:
   ```bash
   sudo dpkg -i wild-cloud-central_*.deb
   sudo systemctl enable wild-cloud-central
   ```

2. **Restore your data directory** from git (for manifests and config) plus your cluster config backup (for secrets and credentials):
   ```bash
   # Clone instance data from git
   git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central

   # Extract cluster config backup over the top
   # This restores kubeconfig, secrets.yaml, talosconfig, etc.
   tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/
   ```

3. **Start Wild Central**:
   ```bash
   sudo systemctl start wild-cloud-central
   ```

4. **Verify connectivity**:
   ```bash
   wild instance use your-instance
   wild cluster status
   ```

The cluster is still running — your apps are live. Wild Central is just the management plane.

### Scenario 2: Single Node Failure (Cluster Degraded)

One or more nodes died but the cluster still has quorum (at least 2 of 3 control plane nodes, or workers are replaceable).

**Steps:**

1. **Check cluster health** from Wild Central:
   ```bash
   talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
     health --nodes <surviving-node-ip>
   ```

2. **Remove the dead node** from the cluster:
   ```bash
   # Remove from Kubernetes
   kubectl --kubeconfig /var/lib/wild-central/instances/your-instance/kubeconfig \
     delete node <dead-node-name>

   # Remove from etcd (if control plane node)
   talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
     etcd remove-member <dead-node-name> --nodes <surviving-node-ip>
   ```

3. **PXE boot a replacement node** using Wild Central's PXE service, or manually install Talos Linux on the new hardware.

4. **Add the new node** through the Wild Cloud web UI or CLI:
   ```bash
   wild node add --role worker --ip <new-node-ip>
   ```

5. **Verify workloads reschedule** to the new node:
   ```bash
   kubectl get pods --all-namespaces -o wide
   ```

### Scenario 3: Total Cluster Loss (Rebuild from Scratch)

All nodes are gone. You need to rebuild everything.

**Prerequisites:**
- New hardware (or repaired existing hardware) with network boot capability or Talos Linux installed
- Your cluster config backup (tar.gz with kubeconfig, talosconfig, secrets.yaml, Talos configs)
- Access to your backup destination (S3 bucket, NFS share, etc.)
- Your instance data git repo (if available — contains compiled manifests)

**Steps:**

1. **Set up Wild Central** on a fresh device:
   ```bash
   sudo dpkg -i wild-cloud-central_*.deb
   ```

2. **Restore your data directory**:
   ```bash
   # If you have a git repo:
   git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central

   # Extract cluster config over the top:
   tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/
   ```

   If you don't have a git repo, just extract the cluster config backup into a fresh instance directory. You'll re-add apps from the Wild Directory.

3. **Bootstrap new Talos nodes** using the restored Talos configs:
   ```bash
   # Apply control plane config to the first node
   talosctl apply-config \
     --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
     --nodes <node-ip> \
     --file /var/lib/wild-central/instances/your-instance/talos/generated/controlplane.yaml \
     --insecure
   ```

   The restored `controlplane.yaml` and `worker.yaml` contain your cluster's identity (cluster name, secrets, certificates). Using them ensures the new cluster has the same identity as the old one.

4. **Bootstrap the cluster**:
   ```bash
   talosctl bootstrap \
     --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
     --nodes <first-control-plane-ip>
   ```

5. **Wait for the cluster to be healthy**:
   ```bash
   talosctl health \
     --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
     --nodes <first-control-plane-ip>
   ```

6. **Update kubeconfig** (the new cluster may issue a fresh kubeconfig):
   ```bash
   talosctl kubeconfig \
     --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
     --nodes <first-control-plane-ip> \
     /var/lib/wild-central/instances/your-instance/kubeconfig
   ```

7. **Deploy infrastructure services first** (order matters):
   ```bash
   wild instance use your-instance
   wild service install metallb
   wild service install traefik
   wild service install cert-manager
   wild service install external-dns
   wild service install longhorn    # If using Longhorn for PVCs
   ```

8. **Deploy apps** (dependencies first, then apps):
   ```bash
   # Deploy database services first
   wild app deploy pg
   wild app deploy redis

   # Then deploy apps
   wild app deploy gitea
   wild app deploy immich
   # ... etc
   ```

   If your git repo has compiled manifests, these deploys apply the exact same manifests that were running before. If not, you'll need to re-add apps from the Wild Directory first:
   ```bash
   wild app add gitea
   wild app deploy gitea
   ```

9. **Restore app data from backups**:
   ```bash
   # Restore each app's data (database + PVC) from the backup destination
   # Use the Web UI: navigate to Backups > [app] > Restore
   # Or via CLI:
   wild restore gitea --auto
   wild restore immich --auto
   ```

   The `--auto` flag runs the full blue-green restore cycle: restore to standby, switch traffic, then clean up the old namespace. For more control, run each phase separately — see [Restoring Backups](restoring-backups.md).

10. **Verify everything is working**:
    ```bash
    wild app status gitea
    wild app status immich
    kubectl get pods --all-namespaces
    ```

## Cluster Config Backup

The cluster config backup feature archives the files that are NOT tracked in git — the credentials and secrets needed to access the cluster.

### What Gets Backed Up

| File | Purpose |
|------|---------|
| `kubeconfig` | Kubernetes API credentials |
| `config.yaml` | Full instance configuration |
| `secrets.yaml` | App secrets (database passwords, API keys) |
| `talos/generated/talosconfig` | Talos API credentials |
| `talos/generated/controlplane.yaml` | Control plane node config |
| `talos/generated/worker.yaml` | Worker node config |
| `talos/generated/secrets.yaml` | Talos bootstrap secrets (cluster identity) |

### Creating Cluster Config Backups

**Web UI:** Navigate to Backups, click "Backup" on the "Cluster Config" row.

**CLI:**
```bash
# Via API
curl -X POST http://localhost:5055/api/v1/instances/your-instance/backup/cluster
```

**Scheduled:** Create a backup schedule with target type "cluster" to automatically back up cluster config on a recurring basis. See [Making Backups](making-backups.md) for scheduling details.

### Downloading a Cluster Config Backup

Cluster config backups are stored at your configured backup destination under the key `cluster-config/{instance}/{timestamp}.tar.gz`. To retrieve one:

- **S3/Azure:** Download from the bucket/container using your cloud provider's CLI
- **NFS:** Navigate to the NFS mount point and find the archive
- **Local:** Find it at `{data-dir}/instances/{instance}/backups/cluster-config/...`

Store a copy of the latest cluster config backup in a secure offsite location (encrypted USB drive, password manager, separate cloud storage). If your primary backup destination is on the cluster itself, a total cluster loss takes the backups with it.

## Prevention Checklist

- [ ] **Cluster config backups** are scheduled and running
- [ ] **App backups** are scheduled for all critical apps
- [ ] **Backup destination** is offsite or on separate infrastructure from the cluster
- [ ] **Instance data directory** is pushed to a git remote (excludes secrets.yaml)
- [ ] **Cluster config backup archive** is stored in a second location (not just on the cluster)
- [ ] **Test a restore** periodically — backups are worthless if restore doesn't work

## Related Guides

- [Making Backups](making-backups.md) — Setting up backup destinations and schedules
- [Restoring Backups](restoring-backups.md) — Blue-green restore process in detail
- [Upgrade Talos](upgrade-talos.md) — Talos node upgrade and rollback
- [Troubleshoot Cluster](troubleshoot-cluster.md) — Diagnosing cluster issues after recovery