Rewrote backup/restore guides to document current system (native pg_dump/Longhorn/tar.gz tools, blue-green restore, scheduling) and remove outdated restic references. Rewrote monitoring guide to replace K3s/Helm/Velero placeholders with actual capabilities. Filled in all four upgrade guides (Talos, Kubernetes, applications, Wild Cloud) that were previously TBD stubs. Expanded troubleshooting guides with correct namespaces, Wild Cloud CLI commands, and Talos-specific diagnostics. Added verification commands to cluster networking health checklist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
247 lines
9.4 KiB
Markdown
247 lines
9.4 KiB
Markdown
# Disaster Recovery
|
|
|
|
This guide covers recovering a Wild Cloud cluster after catastrophic failure — hardware death, corrupted storage, or any scenario where you need to rebuild from scratch.
|
|
|
|
## What You Need
|
|
|
|
To rebuild a cluster you need two things:
|
|
|
|
1. **Cluster config backup** — The tar.gz archive from Wild Cloud's cluster config backup feature, containing kubeconfig, talosconfig, config.yaml, secrets.yaml, and Talos node configs.
|
|
2. **App backups** — The per-app backup archives (database dumps, PVC snapshots, config files) stored at your backup destination (S3, NFS, or local).
|
|
|
|
If your instance data directory was a git repository (recommended), you also have the full history of compiled manifests and config.yaml in git. The git repo alone is enough to redeploy apps — but without secrets.yaml and kubeconfig, you can't authenticate to the cluster or decrypt app secrets.
|
|
|
|
## Recovery Scenarios
|
|
|
|
### Scenario 1: Wild Central Device Failure (Cluster Intact)
|
|
|
|
The Raspberry Pi or server running Wild Central died, but the Kubernetes cluster nodes are still running.
|
|
|
|
**Steps:**
|
|
|
|
1. **Set up a new Wild Central device**:
|
|
```bash
|
|
sudo dpkg -i wild-cloud-central_*.deb
|
|
sudo systemctl enable wild-cloud-central
|
|
```
|
|
|
|
2. **Restore your data directory** from git (for manifests and config) plus your cluster config backup (for secrets and credentials):
|
|
```bash
|
|
# Clone instance data from git
|
|
git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central
|
|
|
|
# Extract cluster config backup over the top
|
|
# This restores kubeconfig, secrets.yaml, talosconfig, etc.
|
|
tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/
|
|
```
|
|
|
|
3. **Start Wild Central**:
|
|
```bash
|
|
sudo systemctl start wild-cloud-central
|
|
```
|
|
|
|
4. **Verify connectivity**:
|
|
```bash
|
|
wild instance use your-instance
|
|
wild cluster status
|
|
```
|
|
|
|
The cluster is still running — your apps are live. Wild Central is just the management plane.
|
|
|
|
### Scenario 2: Single Node Failure (Cluster Degraded)
|
|
|
|
One or more nodes died but the cluster still has quorum (at least 2 of 3 control plane nodes, or workers are replaceable).
|
|
|
|
**Steps:**
|
|
|
|
1. **Check cluster health** from Wild Central:
|
|
```bash
|
|
talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
|
|
health --nodes <surviving-node-ip>
|
|
```
|
|
|
|
2. **Remove the dead node** from the cluster:
|
|
```bash
|
|
# Remove from Kubernetes
|
|
kubectl --kubeconfig /var/lib/wild-central/instances/your-instance/kubeconfig \
|
|
delete node <dead-node-name>
|
|
|
|
# Remove from etcd (if control plane node)
|
|
talosctl --talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
|
|
etcd remove-member <dead-node-name> --nodes <surviving-node-ip>
|
|
```
|
|
|
|
3. **PXE boot a replacement node** using Wild Central's PXE service, or manually install Talos Linux on the new hardware.
|
|
|
|
4. **Add the new node** through the Wild Cloud web UI or CLI:
|
|
```bash
|
|
wild node add --role worker --ip <new-node-ip>
|
|
```
|
|
|
|
5. **Verify workloads reschedule** to the new node:
|
|
```bash
|
|
kubectl get pods --all-namespaces -o wide
|
|
```
|
|
|
|
### Scenario 3: Total Cluster Loss (Rebuild from Scratch)
|
|
|
|
All nodes are gone. You need to rebuild everything.
|
|
|
|
**Prerequisites:**
|
|
- New hardware (or repaired existing hardware) with network boot capability or Talos Linux installed
|
|
- Your cluster config backup (tar.gz with kubeconfig, talosconfig, secrets.yaml, Talos configs)
|
|
- Access to your backup destination (S3 bucket, NFS share, etc.)
|
|
- Your instance data git repo (if available — contains compiled manifests)
|
|
|
|
**Steps:**
|
|
|
|
1. **Set up Wild Central** on a fresh device:
|
|
```bash
|
|
sudo dpkg -i wild-cloud-central_*.deb
|
|
```
|
|
|
|
2. **Restore your data directory**:
|
|
```bash
|
|
# If you have a git repo:
|
|
git clone https://your-git-server/wild-cloud-data.git /var/lib/wild-central
|
|
|
|
# Extract cluster config over the top:
|
|
tar -xzf cluster-config-backup.tar.gz -C /var/lib/wild-central/instances/your-instance/
|
|
```
|
|
|
|
If you don't have a git repo, just extract the cluster config backup into a fresh instance directory. You'll re-add apps from the Wild Directory.
|
|
|
|
3. **Bootstrap new Talos nodes** using the restored Talos configs:
|
|
```bash
|
|
# Apply control plane config to the first node
|
|
talosctl apply-config \
|
|
--talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
|
|
--nodes <node-ip> \
|
|
--file /var/lib/wild-central/instances/your-instance/talos/generated/controlplane.yaml \
|
|
--insecure
|
|
```
|
|
|
|
The restored `controlplane.yaml` and `worker.yaml` contain your cluster's identity (cluster name, secrets, certificates). Using them ensures the new cluster has the same identity as the old one.
|
|
|
|
4. **Bootstrap the cluster**:
|
|
```bash
|
|
talosctl bootstrap \
|
|
--talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
|
|
--nodes <first-control-plane-ip>
|
|
```
|
|
|
|
5. **Wait for the cluster to be healthy**:
|
|
```bash
|
|
talosctl health \
|
|
--talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
|
|
--nodes <first-control-plane-ip>
|
|
```
|
|
|
|
6. **Update kubeconfig** (the new cluster may issue a fresh kubeconfig):
|
|
```bash
|
|
talosctl kubeconfig \
|
|
--talosconfig /var/lib/wild-central/instances/your-instance/talos/generated/talosconfig \
|
|
--nodes <first-control-plane-ip> \
|
|
/var/lib/wild-central/instances/your-instance/kubeconfig
|
|
```
|
|
|
|
7. **Deploy infrastructure services first** (order matters):
|
|
```bash
|
|
wild instance use your-instance
|
|
wild service install metallb
|
|
wild service install traefik
|
|
wild service install cert-manager
|
|
wild service install external-dns
|
|
wild service install longhorn # If using Longhorn for PVCs
|
|
```
|
|
|
|
8. **Deploy apps** (dependencies first, then apps):
|
|
```bash
|
|
# Deploy database services first
|
|
wild app deploy pg
|
|
wild app deploy redis
|
|
|
|
# Then deploy apps
|
|
wild app deploy gitea
|
|
wild app deploy immich
|
|
# ... etc
|
|
```
|
|
|
|
If your git repo has compiled manifests, these deploys apply the exact same manifests that were running before. If not, you'll need to re-add apps from the Wild Directory first:
|
|
```bash
|
|
wild app add gitea
|
|
wild app deploy gitea
|
|
```
|
|
|
|
9. **Restore app data from backups**:
|
|
```bash
|
|
# Restore each app's data (database + PVC) from the backup destination
|
|
# Use the Web UI: navigate to Backups > [app] > Restore
|
|
# Or via CLI:
|
|
wild restore gitea --auto
|
|
wild restore immich --auto
|
|
```
|
|
|
|
The `--auto` flag runs the full blue-green restore cycle: restore to standby, switch traffic, then clean up the old namespace. For more control, run each phase separately — see [Restoring Backups](restoring-backups.md).
|
|
|
|
10. **Verify everything is working**:
|
|
```bash
|
|
wild app status gitea
|
|
wild app status immich
|
|
kubectl get pods --all-namespaces
|
|
```
|
|
|
|
## Cluster Config Backup
|
|
|
|
The cluster config backup feature archives the files that are NOT tracked in git — the credentials and secrets needed to access the cluster.
|
|
|
|
### What Gets Backed Up
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `kubeconfig` | Kubernetes API credentials |
|
|
| `config.yaml` | Full instance configuration |
|
|
| `secrets.yaml` | App secrets (database passwords, API keys) |
|
|
| `talos/generated/talosconfig` | Talos API credentials |
|
|
| `talos/generated/controlplane.yaml` | Control plane node config |
|
|
| `talos/generated/worker.yaml` | Worker node config |
|
|
| `talos/generated/secrets.yaml` | Talos bootstrap secrets (cluster identity) |
|
|
|
|
### Creating Cluster Config Backups
|
|
|
|
**Web UI:** Navigate to Backups, click "Backup" on the "Cluster Config" row.
|
|
|
|
**CLI:**
|
|
```bash
|
|
# Via API
|
|
curl -X POST http://localhost:5055/api/v1/instances/your-instance/backup/cluster
|
|
```
|
|
|
|
**Scheduled:** Create a backup schedule with target type "cluster" to automatically back up cluster config on a recurring basis. See [Making Backups](making-backups.md) for scheduling details.
|
|
|
|
### Downloading a Cluster Config Backup
|
|
|
|
Cluster config backups are stored at your configured backup destination under the key `cluster-config/{instance}/{timestamp}.tar.gz`. To retrieve one:
|
|
|
|
- **S3/Azure:** Download from the bucket/container using your cloud provider's CLI
|
|
- **NFS:** Navigate to the NFS mount point and find the archive
|
|
- **Local:** Find it at `{data-dir}/instances/{instance}/backups/cluster-config/...`
|
|
|
|
Store a copy of the latest cluster config backup in a secure offsite location (encrypted USB drive, password manager, separate cloud storage). If your primary backup destination is on the cluster itself, a total cluster loss takes the backups with it.
|
|
|
|
## Prevention Checklist
|
|
|
|
- [ ] **Cluster config backups** are scheduled and running
|
|
- [ ] **App backups** are scheduled for all critical apps
|
|
- [ ] **Backup destination** is offsite or on separate infrastructure from the cluster
|
|
- [ ] **Instance data directory** is pushed to a git remote (excludes secrets.yaml)
|
|
- [ ] **Cluster config backup archive** is stored in a second location (not just on the cluster)
|
|
- [ ] **Test a restore** periodically — backups are worthless if restore doesn't work
|
|
|
|
## Related Guides
|
|
|
|
- [Making Backups](making-backups.md) — Setting up backup destinations and schedules
|
|
- [Restoring Backups](restoring-backups.md) — Blue-green restore process in detail
|
|
- [Upgrade Talos](upgrade-talos.md) — Talos node upgrade and rollback
|
|
- [Troubleshoot Cluster](troubleshoot-cluster.md) — Diagnosing cluster issues after recovery
|