Files
wild-cloud-dev/docs/future/backup-restore.md
2026-05-16 22:24:30 +00:00

227 lines
9.5 KiB
Markdown

# Wild Cloud Backup & Restore System
**Status**: Core system implemented and tested. Deterministic patching enhancement planned.
---
## What's Been Built
Wild Cloud has a blue-green backup and restore system for apps. It backs up an app's data (config files, database, persistent volumes), then restores to a colored standby environment for validation before switching traffic.
### Architecture
```
Backup: [Active App] → config snapshot + DB dump + volume snapshot → [Backup Storage]
Restore: [Backup Storage] → colored standby namespace + colored DB + colored volumes → [Standby App]
Switch: [Standby App] becomes active, old active cleaned up
```
### Strategy Pattern
Each app's backup is composed of one or more strategies that run independently:
| Strategy | What it backs up | How it restores |
|----------|-----------------|-----------------|
| `config` | `config.yaml`, `secrets.yaml`, app manifests | Copies files to standby app directory |
| `postgres` | `pg_dump` of the app's database | Creates `{dbName}_{color}` database, `pg_restore` into it |
| `mysql` | `mysqldump` of the app's database | Creates `{dbName}_{color}` database, restores into it |
| `longhorn-native` | Longhorn volume snapshots via API | Creates `{pvcName}-{color}` volumes from backup |
Strategies are selected automatically based on the app's manifest dependencies (e.g., app requires `pg` triggers `postgres` strategy).
### RecoveryPlan
A YAML coordination record tracks the entire lifecycle:
```yaml
app: listmonk
instance: test-cloud
timestamp: "20260308T231537Z"
status: restored
standbyColor: green
source:
activeColor: blue
namespace: listmonk
appDir: instances/test-cloud/apps/listmonk
standby:
namespace: listmonk-green
appDir: instances/test-cloud/backups/listmonk/20260308T231537Z/standby-app
strategies:
- name: config
status: restored
- name: postgres
status: restored
params: { dbName: listmonk }
restore: { dbName: listmonk_green }
- name: longhorn-native
status: restored
backup:
volumes:
- pvcName: listmonk-data
backupURL: backup://pvc-xxx/backup-yyy
restore:
volumes:
- pvcName: listmonk-data
volumeName: listmonk-data-green
phases:
backup: { startedAt: ..., completedAt: ... }
restore: { startedAt: ..., completedAt: ... }
```
### Standby Deployment
When restoring, the system:
1. Copies app manifests to a standby directory
2. Patches `kustomization.yaml` namespace to `{app}-{color}`
3. Patches `namespace.yaml` to match
4. Patches database references in env vars (db name and connection URLs) using kustomize JSON patches
5. Runs `kubectl apply -k` on the standby directory
6. Creates Kubernetes secrets from `secrets.yaml` (source of truth) in the standby namespace
### Key Files
```
api/internal/backup/
├── backup.go # Core orchestration (Backup, Restore, Switch, Cleanup)
├── types/types.go # RecoveryPlan, Strategy interface, StrategyEntry
├── strategies/
│ ├── config.go # Config file backup/restore
│ ├── postgres.go # PostgreSQL dump/restore
│ ├── mysql.go # MySQL dump/restore
│ └── longhorn_native.go # Longhorn volume snapshot backup/restore
└── destinations/
├── nfs.go # NFS backup storage
└── local.go # Local filesystem backup storage
```
### API Endpoints
```
POST /api/v1/instances/{name}/apps/{app}/backup # Create backup
POST /api/v1/instances/{name}/apps/{app}/restore # Restore from latest backup
POST /api/v1/instances/{name}/apps/{app}/switch # Switch traffic to standby
POST /api/v1/instances/{name}/apps/{app}/cleanup # Clean up old active
GET /api/v1/instances/{name}/apps/{app}/backups # List backups
GET /api/v1/instances/{name}/apps/{app}/backup/{ts} # Get specific backup plan
```
### What's Been Tested
Full end-to-end backup/restore cycle for listmonk on `test-cloud`:
- Config strategy: config.yaml, secrets.yaml, app manifests backed up and restored
- Postgres strategy: database dumped, colored database created, data restored
- Longhorn-native strategy: PVC snapshot created, colored volume restored from backup
- Secrets deployed from `secrets.yaml` (source of truth) to standby namespace
- Database name patching in env vars (both exact match and connection string URLs)
### Bugs Found and Fixed During Testing
1. **Secret deployment ordering**: Secrets were being created before `kubectl apply -k` which creates the namespace. Fixed by moving secret creation to after the apply.
2. **Secret source of truth**: Was copying Kubernetes secrets between namespaces. Changed to create secrets from `secrets.yaml` (matching the normal deploy flow).
3. **Longhorn stale snapshots**: Engine-level snapshots persisted after deleting Kubernetes CRDs. Required Longhorn API `snapshotDelete` + `snapshotPurge` to clean up before new backups could succeed.
---
## What Remains To Be Done
### 1. Deterministic Blue-Green Patching (Next Priority)
**Problem**: The current system uses heuristics to decide which env vars need their database name replaced during restore. It checks env var names for patterns like `DATABASE`, `DB_NAME`, etc., and excludes names containing `USER`, `HOST`, etc. This works for common cases but is fragile:
- Env var names that don't follow conventions get missed
- Apps with multiple databases need multiple distinct mappings
- PVC name patching for volumes is not yet wired into the standby deployment
- Secret values containing database URLs need rewriting but currently aren't
**Proposed solution**: Add a `restore` field to app manifests that explicitly declares what needs patching:
```yaml
# In manifest.yaml
restore:
databases:
- configKey: dbName # Path in defaultConfig holding the database name
secretKeys: [dbUrl] # Secret keys containing the db name (for URL rewriting)
volumes:
- listmonk-data # PVC names to create colored copies of
```
**Implementation plan**: See `~/.claude/plans/floofy-waddling-locket.md` for detailed design.
**Changes needed**:
- Add `RestoreConfig` struct to `api/internal/apps/models.go`
- Refactor `updateDatabaseRefsFromPlan` in `backup.go` to use manifest declarations (with heuristic fallback)
- Refactor `deploySecretsToNamespace` to rewrite secret values for declared secret keys
- Add `updateVolumeRefsFromManifest` for PVC name patching in standby kustomize
- Add `restore` field to ~13 app manifests in `wild-directory/`
- Document the `restore` field in `wild-directory/ADDING-APPS.md`
### 2. Volume Patching in Standby Deployment
**Problem**: The longhorn-native strategy creates colored volumes (`{pvcName}-{color}`) during restore, but the standby deployment's kustomize files still reference the original PVC names. The standby pods mount the original volumes, not the restored colored ones.
**Solution**: Part of the deterministic patching work above. Use kustomize JSON patches to rewrite `claimName` references in the standby `kustomization.yaml`, similar to how database env vars are patched today.
### 3. Longhorn Backup Cleanup
**Problem**: `cleanupOldBackups` in `longhorn_native.go` is a no-op. Old Longhorn backups and engine-level snapshots accumulate.
**Solution**: Implement retention-based cleanup that:
- Keeps the N most recent backups per volume (configurable via retention policy)
- Deletes old Longhorn Backup CRDs
- Cleans up engine-level snapshots via Longhorn API (`snapshotDelete` + `snapshotPurge`)
### 4. Longhorn Port-Forward Cleanup
**Problem**: `getLonghornAPIEndpoint` starts a `kubectl port-forward` process on port 8080 that is never cleaned up. These orphan processes accumulate during backup/restore operations.
**Solution**: Track the port-forward process and kill it when the operation completes. Use a `defer` pattern or add cleanup to the strategy's `Cleanup` method.
### 5. Switch and Cleanup Phases
**Status**: Implemented but not thoroughly tested end-to-end.
**Remaining work**:
- Test the full switch flow (DNS/ingress cutover from old namespace to standby)
- Test cleanup of the old active namespace after switch
- Handle edge cases: switch failure mid-way, partial cleanup
### 6. Scheduled Backups
**Status**: Data model exists (`BackupConfiguration` with schedules and retention), but no scheduler is implemented.
**Remaining work**:
- Implement a cron-based scheduler that reads `BackupConfiguration` from instance config
- Trigger backups on schedule
- Apply retention policies to clean up old backups
- Web UI for configuring backup schedules
### 7. Backup Verification
**Status**: `Verify` method exists on the Strategy interface but implementations are minimal.
**Remaining work**:
- Implement meaningful verification (e.g., check pg_dump integrity, verify Longhorn backup exists and is restorable)
- Optional scheduled verification per `VerificationConfig`
### 8. Web UI Integration
**Status**: API endpoints exist but no web UI for backup/restore.
**Remaining work**:
- Backup management page (list backups, trigger backup, view recovery plan)
- Restore workflow (select backup, monitor restore progress, confirm switch)
- Backup schedule configuration
- Backup status in app detail view
### 9. S3/Azure Backup Destinations
**Status**: Types defined (`S3Config`, `AzureConfig`) but only NFS and local filesystem destinations are implemented.
**Remaining work**:
- Implement S3-compatible destination (for MinIO, AWS S3, etc.)
- Implement Azure Blob Storage destination
- Test with real cloud storage providers