227 lines
9.5 KiB
Markdown
227 lines
9.5 KiB
Markdown
# Wild Cloud Backup & Restore System
|
|
|
|
**Status**: Core system implemented and tested. Deterministic patching enhancement planned.
|
|
|
|
---
|
|
|
|
## What's Been Built
|
|
|
|
Wild Cloud has a blue-green backup and restore system for apps. It backs up an app's data (config files, database, persistent volumes), then restores to a colored standby environment for validation before switching traffic.
|
|
|
|
### Architecture
|
|
|
|
```
|
|
Backup: [Active App] → config snapshot + DB dump + volume snapshot → [Backup Storage]
|
|
Restore: [Backup Storage] → colored standby namespace + colored DB + colored volumes → [Standby App]
|
|
Switch: [Standby App] becomes active, old active cleaned up
|
|
```
|
|
|
|
### Strategy Pattern
|
|
|
|
Each app's backup is composed of one or more strategies that run independently:
|
|
|
|
| Strategy | What it backs up | How it restores |
|
|
|----------|-----------------|-----------------|
|
|
| `config` | `config.yaml`, `secrets.yaml`, app manifests | Copies files to standby app directory |
|
|
| `postgres` | `pg_dump` of the app's database | Creates `{dbName}_{color}` database, `pg_restore` into it |
|
|
| `mysql` | `mysqldump` of the app's database | Creates `{dbName}_{color}` database, restores into it |
|
|
| `longhorn-native` | Longhorn volume snapshots via API | Creates `{pvcName}-{color}` volumes from backup |
|
|
|
|
Strategies are selected automatically based on the app's manifest dependencies (e.g., app requires `pg` triggers `postgres` strategy).
|
|
|
|
### RecoveryPlan
|
|
|
|
A YAML coordination record tracks the entire lifecycle:
|
|
|
|
```yaml
|
|
app: listmonk
|
|
instance: test-cloud
|
|
timestamp: "20260308T231537Z"
|
|
status: restored
|
|
standbyColor: green
|
|
source:
|
|
activeColor: blue
|
|
namespace: listmonk
|
|
appDir: instances/test-cloud/apps/listmonk
|
|
standby:
|
|
namespace: listmonk-green
|
|
appDir: instances/test-cloud/backups/listmonk/20260308T231537Z/standby-app
|
|
strategies:
|
|
- name: config
|
|
status: restored
|
|
- name: postgres
|
|
status: restored
|
|
params: { dbName: listmonk }
|
|
restore: { dbName: listmonk_green }
|
|
- name: longhorn-native
|
|
status: restored
|
|
backup:
|
|
volumes:
|
|
- pvcName: listmonk-data
|
|
backupURL: backup://pvc-xxx/backup-yyy
|
|
restore:
|
|
volumes:
|
|
- pvcName: listmonk-data
|
|
volumeName: listmonk-data-green
|
|
phases:
|
|
backup: { startedAt: ..., completedAt: ... }
|
|
restore: { startedAt: ..., completedAt: ... }
|
|
```
|
|
|
|
### Standby Deployment
|
|
|
|
When restoring, the system:
|
|
|
|
1. Copies app manifests to a standby directory
|
|
2. Patches `kustomization.yaml` namespace to `{app}-{color}`
|
|
3. Patches `namespace.yaml` to match
|
|
4. Patches database references in env vars (db name and connection URLs) using kustomize JSON patches
|
|
5. Runs `kubectl apply -k` on the standby directory
|
|
6. Creates Kubernetes secrets from `secrets.yaml` (source of truth) in the standby namespace
|
|
|
|
### Key Files
|
|
|
|
```
|
|
api/internal/backup/
|
|
├── backup.go # Core orchestration (Backup, Restore, Switch, Cleanup)
|
|
├── types/types.go # RecoveryPlan, Strategy interface, StrategyEntry
|
|
├── strategies/
|
|
│ ├── config.go # Config file backup/restore
|
|
│ ├── postgres.go # PostgreSQL dump/restore
|
|
│ ├── mysql.go # MySQL dump/restore
|
|
│ └── longhorn_native.go # Longhorn volume snapshot backup/restore
|
|
└── destinations/
|
|
├── nfs.go # NFS backup storage
|
|
└── local.go # Local filesystem backup storage
|
|
```
|
|
|
|
### API Endpoints
|
|
|
|
```
|
|
POST /api/v1/instances/{name}/apps/{app}/backup # Create backup
|
|
POST /api/v1/instances/{name}/apps/{app}/restore # Restore from latest backup
|
|
POST /api/v1/instances/{name}/apps/{app}/switch # Switch traffic to standby
|
|
POST /api/v1/instances/{name}/apps/{app}/cleanup # Clean up old active
|
|
GET /api/v1/instances/{name}/apps/{app}/backups # List backups
|
|
GET /api/v1/instances/{name}/apps/{app}/backup/{ts} # Get specific backup plan
|
|
```
|
|
|
|
### What's Been Tested
|
|
|
|
Full end-to-end backup/restore cycle for listmonk on `test-cloud`:
|
|
- Config strategy: config.yaml, secrets.yaml, app manifests backed up and restored
|
|
- Postgres strategy: database dumped, colored database created, data restored
|
|
- Longhorn-native strategy: PVC snapshot created, colored volume restored from backup
|
|
- Secrets deployed from `secrets.yaml` (source of truth) to standby namespace
|
|
- Database name patching in env vars (both exact match and connection string URLs)
|
|
|
|
### Bugs Found and Fixed During Testing
|
|
|
|
1. **Secret deployment ordering**: Secrets were being created before `kubectl apply -k` which creates the namespace. Fixed by moving secret creation to after the apply.
|
|
|
|
2. **Secret source of truth**: Was copying Kubernetes secrets between namespaces. Changed to create secrets from `secrets.yaml` (matching the normal deploy flow).
|
|
|
|
3. **Longhorn stale snapshots**: Engine-level snapshots persisted after deleting Kubernetes CRDs. Required Longhorn API `snapshotDelete` + `snapshotPurge` to clean up before new backups could succeed.
|
|
|
|
---
|
|
|
|
## What Remains To Be Done
|
|
|
|
### 1. Deterministic Blue-Green Patching (Next Priority)
|
|
|
|
**Problem**: The current system uses heuristics to decide which env vars need their database name replaced during restore. It checks env var names for patterns like `DATABASE`, `DB_NAME`, etc., and excludes names containing `USER`, `HOST`, etc. This works for common cases but is fragile:
|
|
|
|
- Env var names that don't follow conventions get missed
|
|
- Apps with multiple databases need multiple distinct mappings
|
|
- PVC name patching for volumes is not yet wired into the standby deployment
|
|
- Secret values containing database URLs need rewriting but currently aren't
|
|
|
|
**Proposed solution**: Add a `restore` field to app manifests that explicitly declares what needs patching:
|
|
|
|
```yaml
|
|
# In manifest.yaml
|
|
restore:
|
|
databases:
|
|
- configKey: dbName # Path in defaultConfig holding the database name
|
|
secretKeys: [dbUrl] # Secret keys containing the db name (for URL rewriting)
|
|
volumes:
|
|
- listmonk-data # PVC names to create colored copies of
|
|
```
|
|
|
|
**Implementation plan**: See `~/.claude/plans/floofy-waddling-locket.md` for detailed design.
|
|
|
|
**Changes needed**:
|
|
- Add `RestoreConfig` struct to `api/internal/apps/models.go`
|
|
- Refactor `updateDatabaseRefsFromPlan` in `backup.go` to use manifest declarations (with heuristic fallback)
|
|
- Refactor `deploySecretsToNamespace` to rewrite secret values for declared secret keys
|
|
- Add `updateVolumeRefsFromManifest` for PVC name patching in standby kustomize
|
|
- Add `restore` field to ~13 app manifests in `wild-directory/`
|
|
- Document the `restore` field in `wild-directory/ADDING-APPS.md`
|
|
|
|
### 2. Volume Patching in Standby Deployment
|
|
|
|
**Problem**: The longhorn-native strategy creates colored volumes (`{pvcName}-{color}`) during restore, but the standby deployment's kustomize files still reference the original PVC names. The standby pods mount the original volumes, not the restored colored ones.
|
|
|
|
**Solution**: Part of the deterministic patching work above. Use kustomize JSON patches to rewrite `claimName` references in the standby `kustomization.yaml`, similar to how database env vars are patched today.
|
|
|
|
### 3. Longhorn Backup Cleanup
|
|
|
|
**Problem**: `cleanupOldBackups` in `longhorn_native.go` is a no-op. Old Longhorn backups and engine-level snapshots accumulate.
|
|
|
|
**Solution**: Implement retention-based cleanup that:
|
|
- Keeps the N most recent backups per volume (configurable via retention policy)
|
|
- Deletes old Longhorn Backup CRDs
|
|
- Cleans up engine-level snapshots via Longhorn API (`snapshotDelete` + `snapshotPurge`)
|
|
|
|
### 4. Longhorn Port-Forward Cleanup
|
|
|
|
**Problem**: `getLonghornAPIEndpoint` starts a `kubectl port-forward` process on port 8080 that is never cleaned up. These orphan processes accumulate during backup/restore operations.
|
|
|
|
**Solution**: Track the port-forward process and kill it when the operation completes. Use a `defer` pattern or add cleanup to the strategy's `Cleanup` method.
|
|
|
|
### 5. Switch and Cleanup Phases
|
|
|
|
**Status**: Implemented but not thoroughly tested end-to-end.
|
|
|
|
**Remaining work**:
|
|
- Test the full switch flow (DNS/ingress cutover from old namespace to standby)
|
|
- Test cleanup of the old active namespace after switch
|
|
- Handle edge cases: switch failure mid-way, partial cleanup
|
|
|
|
### 6. Scheduled Backups
|
|
|
|
**Status**: Data model exists (`BackupConfiguration` with schedules and retention), but no scheduler is implemented.
|
|
|
|
**Remaining work**:
|
|
- Implement a cron-based scheduler that reads `BackupConfiguration` from instance config
|
|
- Trigger backups on schedule
|
|
- Apply retention policies to clean up old backups
|
|
- Web UI for configuring backup schedules
|
|
|
|
### 7. Backup Verification
|
|
|
|
**Status**: `Verify` method exists on the Strategy interface but implementations are minimal.
|
|
|
|
**Remaining work**:
|
|
- Implement meaningful verification (e.g., check pg_dump integrity, verify Longhorn backup exists and is restorable)
|
|
- Optional scheduled verification per `VerificationConfig`
|
|
|
|
### 8. Web UI Integration
|
|
|
|
**Status**: API endpoints exist but no web UI for backup/restore.
|
|
|
|
**Remaining work**:
|
|
- Backup management page (list backups, trigger backup, view recovery plan)
|
|
- Restore workflow (select backup, monitor restore progress, confirm switch)
|
|
- Backup schedule configuration
|
|
- Backup status in app detail view
|
|
|
|
### 9. S3/Azure Backup Destinations
|
|
|
|
**Status**: Types defined (`S3Config`, `AzureConfig`) but only NFS and local filesystem destinations are implemented.
|
|
|
|
**Remaining work**:
|
|
- Implement S3-compatible destination (for MinIO, AWS S3, etc.)
|
|
- Implement Azure Blob Storage destination
|
|
- Test with real cloud storage providers
|