Disaster Recovery Backup System
Core Requirements
- True disaster recovery: All backup data on NFS (or other external destination)
- Migration capability: Restore apps from one instance/cluster to another
- Simplicity: Keep only the latest backup per app/cluster
- Incremental: Use Longhorn's incremental backup capability to minimize storage and transfer time
- Cluster backup: Include kubeconfig, talosconfig, and cluster-level configs
Backup Structure on Destination
nfs:/data/{instance-name}/backups/
├── cluster/ # Cluster-level backup (latest only)
│ ├── kubeconfig # Kubernetes access
│ ├── talosconfig # Talos node access
│ ├── config.yaml # Instance configuration
│ └── setup/ # Cluster services configs
│ └── cluster-services/
└── apps/
└── {app-name}/ # Per-app backup (latest only)
├── manifest.yaml # App manifest with dependencies
├── config.tar.gz # App YAML files from apps/{app}/
├── app-config.yaml # App section from config.yaml
├── app-secrets.yaml # App section from secrets.yaml
└── volumes/
└── {pvc-name}.qcow2 # Longhorn volume export
Blue-Green Backup-Restore Algorithm
Strategies
Cluster
What to backup:
- kubeconfig (cluster access)
- talosconfig (node management)
- config.yaml (minus apps section)
- secrets.yaml (minus apps section)
- setup/cluster-services/* (all service configs)
Cluster Backup Process:
- Copy kubeconfig to NFS
- Copy talosconfig to NFS
- Extract non-app config → cluster-config.yaml
- Extract non-app secrets → cluster-secrets.yaml
- Tar cluster-services → setup.tar.gz
Cluster Restore Process:
- Verify cluster is accessible
- Restore kubeconfig and talosconfig
- Merge cluster config (preserve existing apps)
- Merge cluster secrets (preserve existing apps)
- Extract and apply cluster services
App Config & Secrets
Config Restore:
- Load existing config.yaml
- Extract app section from backup
- Merge: existingConfig["apps"][appName] = backupAppConfig
- Write back config.yaml (preserving other apps)
Secret Restore:
- Load existing secrets.yaml
- Extract app section from backup
- Merge: existingSecrets["apps"][appName] = backupAppSecrets
- Write back secrets.yaml
Longhorn
Backup:
- Create Longhorn Backup CRD pointing to volume
- Longhorn handles snapshot + export to NFS automatically
- Track backup name and metadata locally
- Stream progress via SSE using operations package
- Cleanup old backups (keep only latest)
Restore:
- Create new namespace: {app}-restore
- Create PVCs from Longhorn backup
- Deploy app to restore namespace via kubectl apply -k
- Wait for pods to be ready
- Copy apps/{app}/ to apps/{app}-restore/
- Deploy from apps/{app}-restore/
- After verification, swap directories:
- mv apps/{app} apps/{app}-old
- mv apps/{app}-restore apps/{app}
- Switch ingress to restored namespace
Longhorn qcow2 Export
- Longhorn supports direct export to NFS via URL
- Incremental backups track only changed blocks
- Format:
nfs://server/path/file.qcow2 - Authentication: Uses node's NFS mount permissions
CLI
# Backup (no timestamp needed)
wild app backup gitea
wild app backup --all # All apps
wild cluster backup
# Restore (blue-green deployment)
wild app restore gitea # Creates gitea-restore namespace
wild app restore gitea --from-instance prod-cloud # Migration
wild cluster restore
# After verification
wild app restore-switch gitea # Switch to restored version
wild app restore-cleanup gitea # Remove old deployment
API Endpoints
Backup Endpoints
POST /api/v1/instances/{instance}/apps/{app}/backup
- Creates backup to NFS using Longhorn native
- Returns operation ID for SSE tracking
POST /api/v1/instances/{instance}/backup
- Backs up all apps and cluster config
- Returns operation ID
GET /api/v1/instances/{instance}/apps/{app}/backup
- Returns latest backup metadata (time, size, location)
GET /api/v1/instances/{instance}/backups
- Lists all backups (apps + cluster)
Restore Endpoints
POST /api/v1/instances/{instance}/apps/{app}/restore
Body: {
"fromInstance": "source-instance", // Optional, for migration
"strategy": "blue-green" // Default
}
- Creates restore namespace and copies to apps/{app}-restore/
- Returns operation ID
POST /api/v1/instances/{instance}/apps/{app}/restore/switch
- Switches ingress to restored namespace
- Swaps directories: apps/{app} → apps/{app}-old, apps/{app}-restore → apps/{app}
POST /api/v1/instances/{instance}/apps/{app}/restore/cleanup
- Deletes old namespace and apps/{app}-old/
GET /api/v1/instances/{instance}/apps/{app}/restore/status
- Returns restore operation status and health checks
Error Handling
- NFS unavailable: "Cannot reach backup storage. Check network connection."
- Backup corruption: Keep previous backup until new one verified
- Restore failures: Blue-green means old app keeps running
- Timeout: 5 min default, extend for large PVCs
- No space: "Not enough space on backup storage (need X GB)"
Operations Tracking
// Leverage existing operations package:
1. Create operation: operations.NewOperation("backup_app_gitea")
2. Update progress: op.UpdateProgress(45, "Exporting volume...")
3. Stream via SSE: events.Publish(Event{Type: "operation:progress", Data: op})
4. Complete: op.Complete(result) or op.Fail(error)
5. Auto-cleanup: Operations older than 24h are purged
// Progress milestones for backup:
- 10%: Creating snapshot
- 30%: Connecting to NFS
- 50-90%: Exporting data (based on size)
- 95%: Verifying backup
- 100%: Cleanup and complete
// Progress milestones for restore:
- 10%: Validating backup
- 20%: Creating restore namespace
- 30-70%: Importing volumes from NFS
- 80%: Deploying application
- 90%: Waiting for pods ready
- 100%: Restore complete