Future work.
This commit is contained in:
226
docs/future/backup-restore.md
Normal file
226
docs/future/backup-restore.md
Normal file
@@ -0,0 +1,226 @@
|
||||
# Wild Cloud Backup & Restore System
|
||||
|
||||
**Status**: Core system implemented and tested. Deterministic patching enhancement planned.
|
||||
|
||||
---
|
||||
|
||||
## What's Been Built
|
||||
|
||||
Wild Cloud has a blue-green backup and restore system for apps. It backs up an app's data (config files, database, persistent volumes), then restores to a colored standby environment for validation before switching traffic.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
Backup: [Active App] → config snapshot + DB dump + volume snapshot → [Backup Storage]
|
||||
Restore: [Backup Storage] → colored standby namespace + colored DB + colored volumes → [Standby App]
|
||||
Switch: [Standby App] becomes active, old active cleaned up
|
||||
```
|
||||
|
||||
### Strategy Pattern
|
||||
|
||||
Each app's backup is composed of one or more strategies that run independently:
|
||||
|
||||
| Strategy | What it backs up | How it restores |
|
||||
|----------|-----------------|-----------------|
|
||||
| `config` | `config.yaml`, `secrets.yaml`, app manifests | Copies files to standby app directory |
|
||||
| `postgres` | `pg_dump` of the app's database | Creates `{dbName}_{color}` database, `pg_restore` into it |
|
||||
| `mysql` | `mysqldump` of the app's database | Creates `{dbName}_{color}` database, restores into it |
|
||||
| `longhorn-native` | Longhorn volume snapshots via API | Creates `{pvcName}-{color}` volumes from backup |
|
||||
|
||||
Strategies are selected automatically based on the app's manifest dependencies (e.g., app requires `pg` triggers `postgres` strategy).
|
||||
|
||||
### RecoveryPlan
|
||||
|
||||
A YAML coordination record tracks the entire lifecycle:
|
||||
|
||||
```yaml
|
||||
app: listmonk
|
||||
instance: test-cloud
|
||||
timestamp: "20260308T231537Z"
|
||||
status: restored
|
||||
standbyColor: green
|
||||
source:
|
||||
activeColor: blue
|
||||
namespace: listmonk
|
||||
appDir: instances/test-cloud/apps/listmonk
|
||||
standby:
|
||||
namespace: listmonk-green
|
||||
appDir: instances/test-cloud/backups/listmonk/20260308T231537Z/standby-app
|
||||
strategies:
|
||||
- name: config
|
||||
status: restored
|
||||
- name: postgres
|
||||
status: restored
|
||||
params: { dbName: listmonk }
|
||||
restore: { dbName: listmonk_green }
|
||||
- name: longhorn-native
|
||||
status: restored
|
||||
backup:
|
||||
volumes:
|
||||
- pvcName: listmonk-data
|
||||
backupURL: backup://pvc-xxx/backup-yyy
|
||||
restore:
|
||||
volumes:
|
||||
- pvcName: listmonk-data
|
||||
volumeName: listmonk-data-green
|
||||
phases:
|
||||
backup: { startedAt: ..., completedAt: ... }
|
||||
restore: { startedAt: ..., completedAt: ... }
|
||||
```
|
||||
|
||||
### Standby Deployment
|
||||
|
||||
When restoring, the system:
|
||||
|
||||
1. Copies app manifests to a standby directory
|
||||
2. Patches `kustomization.yaml` namespace to `{app}-{color}`
|
||||
3. Patches `namespace.yaml` to match
|
||||
4. Patches database references in env vars (db name and connection URLs) using kustomize JSON patches
|
||||
5. Runs `kubectl apply -k` on the standby directory
|
||||
6. Creates Kubernetes secrets from `secrets.yaml` (source of truth) in the standby namespace
|
||||
|
||||
### Key Files
|
||||
|
||||
```
|
||||
api/internal/backup/
|
||||
├── backup.go # Core orchestration (Backup, Restore, Switch, Cleanup)
|
||||
├── types/types.go # RecoveryPlan, Strategy interface, StrategyEntry
|
||||
├── strategies/
|
||||
│ ├── config.go # Config file backup/restore
|
||||
│ ├── postgres.go # PostgreSQL dump/restore
|
||||
│ ├── mysql.go # MySQL dump/restore
|
||||
│ └── longhorn_native.go # Longhorn volume snapshot backup/restore
|
||||
└── destinations/
|
||||
├── nfs.go # NFS backup storage
|
||||
└── local.go # Local filesystem backup storage
|
||||
```
|
||||
|
||||
### API Endpoints
|
||||
|
||||
```
|
||||
POST /api/v1/instances/{name}/apps/{app}/backup # Create backup
|
||||
POST /api/v1/instances/{name}/apps/{app}/restore # Restore from latest backup
|
||||
POST /api/v1/instances/{name}/apps/{app}/switch # Switch traffic to standby
|
||||
POST /api/v1/instances/{name}/apps/{app}/cleanup # Clean up old active
|
||||
GET /api/v1/instances/{name}/apps/{app}/backups # List backups
|
||||
GET /api/v1/instances/{name}/apps/{app}/backup/{ts} # Get specific backup plan
|
||||
```
|
||||
|
||||
### What's Been Tested
|
||||
|
||||
Full end-to-end backup/restore cycle for listmonk on `test-cloud`:
|
||||
- Config strategy: config.yaml, secrets.yaml, app manifests backed up and restored
|
||||
- Postgres strategy: database dumped, colored database created, data restored
|
||||
- Longhorn-native strategy: PVC snapshot created, colored volume restored from backup
|
||||
- Secrets deployed from `secrets.yaml` (source of truth) to standby namespace
|
||||
- Database name patching in env vars (both exact match and connection string URLs)
|
||||
|
||||
### Bugs Found and Fixed During Testing
|
||||
|
||||
1. **Secret deployment ordering**: Secrets were being created before `kubectl apply -k` which creates the namespace. Fixed by moving secret creation to after the apply.
|
||||
|
||||
2. **Secret source of truth**: Was copying Kubernetes secrets between namespaces. Changed to create secrets from `secrets.yaml` (matching the normal deploy flow).
|
||||
|
||||
3. **Longhorn stale snapshots**: Engine-level snapshots persisted after deleting Kubernetes CRDs. Required Longhorn API `snapshotDelete` + `snapshotPurge` to clean up before new backups could succeed.
|
||||
|
||||
---
|
||||
|
||||
## What Remains To Be Done
|
||||
|
||||
### 1. Deterministic Blue-Green Patching (Next Priority)
|
||||
|
||||
**Problem**: The current system uses heuristics to decide which env vars need their database name replaced during restore. It checks env var names for patterns like `DATABASE`, `DB_NAME`, etc., and excludes names containing `USER`, `HOST`, etc. This works for common cases but is fragile:
|
||||
|
||||
- Env var names that don't follow conventions get missed
|
||||
- Apps with multiple databases need multiple distinct mappings
|
||||
- PVC name patching for volumes is not yet wired into the standby deployment
|
||||
- Secret values containing database URLs need rewriting but currently aren't
|
||||
|
||||
**Proposed solution**: Add a `restore` field to app manifests that explicitly declares what needs patching:
|
||||
|
||||
```yaml
|
||||
# In manifest.yaml
|
||||
restore:
|
||||
databases:
|
||||
- configKey: dbName # Path in defaultConfig holding the database name
|
||||
secretKeys: [dbUrl] # Secret keys containing the db name (for URL rewriting)
|
||||
volumes:
|
||||
- listmonk-data # PVC names to create colored copies of
|
||||
```
|
||||
|
||||
**Implementation plan**: See `~/.claude/plans/floofy-waddling-locket.md` for detailed design.
|
||||
|
||||
**Changes needed**:
|
||||
- Add `RestoreConfig` struct to `api/internal/apps/models.go`
|
||||
- Refactor `updateDatabaseRefsFromPlan` in `backup.go` to use manifest declarations (with heuristic fallback)
|
||||
- Refactor `deploySecretsToNamespace` to rewrite secret values for declared secret keys
|
||||
- Add `updateVolumeRefsFromManifest` for PVC name patching in standby kustomize
|
||||
- Add `restore` field to ~13 app manifests in `wild-directory/`
|
||||
- Document the `restore` field in `wild-directory/ADDING-APPS.md`
|
||||
|
||||
### 2. Volume Patching in Standby Deployment
|
||||
|
||||
**Problem**: The longhorn-native strategy creates colored volumes (`{pvcName}-{color}`) during restore, but the standby deployment's kustomize files still reference the original PVC names. The standby pods mount the original volumes, not the restored colored ones.
|
||||
|
||||
**Solution**: Part of the deterministic patching work above. Use kustomize JSON patches to rewrite `claimName` references in the standby `kustomization.yaml`, similar to how database env vars are patched today.
|
||||
|
||||
### 3. Longhorn Backup Cleanup
|
||||
|
||||
**Problem**: `cleanupOldBackups` in `longhorn_native.go` is a no-op. Old Longhorn backups and engine-level snapshots accumulate.
|
||||
|
||||
**Solution**: Implement retention-based cleanup that:
|
||||
- Keeps the N most recent backups per volume (configurable via retention policy)
|
||||
- Deletes old Longhorn Backup CRDs
|
||||
- Cleans up engine-level snapshots via Longhorn API (`snapshotDelete` + `snapshotPurge`)
|
||||
|
||||
### 4. Longhorn Port-Forward Cleanup
|
||||
|
||||
**Problem**: `getLonghornAPIEndpoint` starts a `kubectl port-forward` process on port 8080 that is never cleaned up. These orphan processes accumulate during backup/restore operations.
|
||||
|
||||
**Solution**: Track the port-forward process and kill it when the operation completes. Use a `defer` pattern or add cleanup to the strategy's `Cleanup` method.
|
||||
|
||||
### 5. Switch and Cleanup Phases
|
||||
|
||||
**Status**: Implemented but not thoroughly tested end-to-end.
|
||||
|
||||
**Remaining work**:
|
||||
- Test the full switch flow (DNS/ingress cutover from old namespace to standby)
|
||||
- Test cleanup of the old active namespace after switch
|
||||
- Handle edge cases: switch failure mid-way, partial cleanup
|
||||
|
||||
### 6. Scheduled Backups
|
||||
|
||||
**Status**: Data model exists (`BackupConfiguration` with schedules and retention), but no scheduler is implemented.
|
||||
|
||||
**Remaining work**:
|
||||
- Implement a cron-based scheduler that reads `BackupConfiguration` from instance config
|
||||
- Trigger backups on schedule
|
||||
- Apply retention policies to clean up old backups
|
||||
- Web UI for configuring backup schedules
|
||||
|
||||
### 7. Backup Verification
|
||||
|
||||
**Status**: `Verify` method exists on the Strategy interface but implementations are minimal.
|
||||
|
||||
**Remaining work**:
|
||||
- Implement meaningful verification (e.g., check pg_dump integrity, verify Longhorn backup exists and is restorable)
|
||||
- Optional scheduled verification per `VerificationConfig`
|
||||
|
||||
### 8. Web UI Integration
|
||||
|
||||
**Status**: API endpoints exist but no web UI for backup/restore.
|
||||
|
||||
**Remaining work**:
|
||||
- Backup management page (list backups, trigger backup, view recovery plan)
|
||||
- Restore workflow (select backup, monitor restore progress, confirm switch)
|
||||
- Backup schedule configuration
|
||||
- Backup status in app detail view
|
||||
|
||||
### 9. S3/Azure Backup Destinations
|
||||
|
||||
**Status**: Types defined (`S3Config`, `AzureConfig`) but only NFS and local filesystem destinations are implemented.
|
||||
|
||||
**Remaining work**:
|
||||
- Implement S3-compatible destination (for MinIO, AWS S3, etc.)
|
||||
- Implement Azure Blob Storage destination
|
||||
- Test with real cloud storage providers
|
||||
Reference in New Issue
Block a user