# Wild Cloud Backup & Restore System **Status**: Core system implemented and tested. Deterministic patching enhancement planned. --- ## What's Been Built Wild Cloud has a blue-green backup and restore system for apps. It backs up an app's data (config files, database, persistent volumes), then restores to a colored standby environment for validation before switching traffic. ### Architecture ``` Backup: [Active App] → config snapshot + DB dump + volume snapshot → [Backup Storage] Restore: [Backup Storage] → colored standby namespace + colored DB + colored volumes → [Standby App] Switch: [Standby App] becomes active, old active cleaned up ``` ### Strategy Pattern Each app's backup is composed of one or more strategies that run independently: | Strategy | What it backs up | How it restores | |----------|-----------------|-----------------| | `config` | `config.yaml`, `secrets.yaml`, app manifests | Copies files to standby app directory | | `postgres` | `pg_dump` of the app's database | Creates `{dbName}_{color}` database, `pg_restore` into it | | `mysql` | `mysqldump` of the app's database | Creates `{dbName}_{color}` database, restores into it | | `longhorn-native` | Longhorn volume snapshots via API | Creates `{pvcName}-{color}` volumes from backup | Strategies are selected automatically based on the app's manifest dependencies (e.g., app requires `pg` triggers `postgres` strategy). ### RecoveryPlan A YAML coordination record tracks the entire lifecycle: ```yaml app: listmonk instance: test-cloud timestamp: "20260308T231537Z" status: restored standbyColor: green source: activeColor: blue namespace: listmonk appDir: instances/test-cloud/apps/listmonk standby: namespace: listmonk-green appDir: instances/test-cloud/backups/listmonk/20260308T231537Z/standby-app strategies: - name: config status: restored - name: postgres status: restored params: { dbName: listmonk } restore: { dbName: listmonk_green } - name: longhorn-native status: restored backup: volumes: - pvcName: listmonk-data backupURL: backup://pvc-xxx/backup-yyy restore: volumes: - pvcName: listmonk-data volumeName: listmonk-data-green phases: backup: { startedAt: ..., completedAt: ... } restore: { startedAt: ..., completedAt: ... } ``` ### Standby Deployment When restoring, the system: 1. Copies app manifests to a standby directory 2. Patches `kustomization.yaml` namespace to `{app}-{color}` 3. Patches `namespace.yaml` to match 4. Patches database references in env vars (db name and connection URLs) using kustomize JSON patches 5. Runs `kubectl apply -k` on the standby directory 6. Creates Kubernetes secrets from `secrets.yaml` (source of truth) in the standby namespace ### Key Files ``` api/internal/backup/ ├── backup.go # Core orchestration (Backup, Restore, Switch, Cleanup) ├── types/types.go # RecoveryPlan, Strategy interface, StrategyEntry ├── strategies/ │ ├── config.go # Config file backup/restore │ ├── postgres.go # PostgreSQL dump/restore │ ├── mysql.go # MySQL dump/restore │ └── longhorn_native.go # Longhorn volume snapshot backup/restore └── destinations/ ├── nfs.go # NFS backup storage └── local.go # Local filesystem backup storage ``` ### API Endpoints ``` POST /api/v1/instances/{name}/apps/{app}/backup # Create backup POST /api/v1/instances/{name}/apps/{app}/restore # Restore from latest backup POST /api/v1/instances/{name}/apps/{app}/switch # Switch traffic to standby POST /api/v1/instances/{name}/apps/{app}/cleanup # Clean up old active GET /api/v1/instances/{name}/apps/{app}/backups # List backups GET /api/v1/instances/{name}/apps/{app}/backup/{ts} # Get specific backup plan ``` ### What's Been Tested Full end-to-end backup/restore cycle for listmonk on `test-cloud`: - Config strategy: config.yaml, secrets.yaml, app manifests backed up and restored - Postgres strategy: database dumped, colored database created, data restored - Longhorn-native strategy: PVC snapshot created, colored volume restored from backup - Secrets deployed from `secrets.yaml` (source of truth) to standby namespace - Database name patching in env vars (both exact match and connection string URLs) ### Bugs Found and Fixed During Testing 1. **Secret deployment ordering**: Secrets were being created before `kubectl apply -k` which creates the namespace. Fixed by moving secret creation to after the apply. 2. **Secret source of truth**: Was copying Kubernetes secrets between namespaces. Changed to create secrets from `secrets.yaml` (matching the normal deploy flow). 3. **Longhorn stale snapshots**: Engine-level snapshots persisted after deleting Kubernetes CRDs. Required Longhorn API `snapshotDelete` + `snapshotPurge` to clean up before new backups could succeed. --- ## What Remains To Be Done ### 1. Deterministic Blue-Green Patching (Next Priority) **Problem**: The current system uses heuristics to decide which env vars need their database name replaced during restore. It checks env var names for patterns like `DATABASE`, `DB_NAME`, etc., and excludes names containing `USER`, `HOST`, etc. This works for common cases but is fragile: - Env var names that don't follow conventions get missed - Apps with multiple databases need multiple distinct mappings - PVC name patching for volumes is not yet wired into the standby deployment - Secret values containing database URLs need rewriting but currently aren't **Proposed solution**: Add a `restore` field to app manifests that explicitly declares what needs patching: ```yaml # In manifest.yaml restore: databases: - configKey: dbName # Path in defaultConfig holding the database name secretKeys: [dbUrl] # Secret keys containing the db name (for URL rewriting) volumes: - listmonk-data # PVC names to create colored copies of ``` **Implementation plan**: See `~/.claude/plans/floofy-waddling-locket.md` for detailed design. **Changes needed**: - Add `RestoreConfig` struct to `api/internal/apps/models.go` - Refactor `updateDatabaseRefsFromPlan` in `backup.go` to use manifest declarations (with heuristic fallback) - Refactor `deploySecretsToNamespace` to rewrite secret values for declared secret keys - Add `updateVolumeRefsFromManifest` for PVC name patching in standby kustomize - Add `restore` field to ~13 app manifests in `wild-directory/` - Document the `restore` field in `wild-directory/ADDING-APPS.md` ### 2. Volume Patching in Standby Deployment **Problem**: The longhorn-native strategy creates colored volumes (`{pvcName}-{color}`) during restore, but the standby deployment's kustomize files still reference the original PVC names. The standby pods mount the original volumes, not the restored colored ones. **Solution**: Part of the deterministic patching work above. Use kustomize JSON patches to rewrite `claimName` references in the standby `kustomization.yaml`, similar to how database env vars are patched today. ### 3. Longhorn Backup Cleanup **Problem**: `cleanupOldBackups` in `longhorn_native.go` is a no-op. Old Longhorn backups and engine-level snapshots accumulate. **Solution**: Implement retention-based cleanup that: - Keeps the N most recent backups per volume (configurable via retention policy) - Deletes old Longhorn Backup CRDs - Cleans up engine-level snapshots via Longhorn API (`snapshotDelete` + `snapshotPurge`) ### 4. Longhorn Port-Forward Cleanup **Problem**: `getLonghornAPIEndpoint` starts a `kubectl port-forward` process on port 8080 that is never cleaned up. These orphan processes accumulate during backup/restore operations. **Solution**: Track the port-forward process and kill it when the operation completes. Use a `defer` pattern or add cleanup to the strategy's `Cleanup` method. ### 5. Switch and Cleanup Phases **Status**: Implemented but not thoroughly tested end-to-end. **Remaining work**: - Test the full switch flow (DNS/ingress cutover from old namespace to standby) - Test cleanup of the old active namespace after switch - Handle edge cases: switch failure mid-way, partial cleanup ### 6. Scheduled Backups **Status**: Data model exists (`BackupConfiguration` with schedules and retention), but no scheduler is implemented. **Remaining work**: - Implement a cron-based scheduler that reads `BackupConfiguration` from instance config - Trigger backups on schedule - Apply retention policies to clean up old backups - Web UI for configuring backup schedules ### 7. Backup Verification **Status**: `Verify` method exists on the Strategy interface but implementations are minimal. **Remaining work**: - Implement meaningful verification (e.g., check pg_dump integrity, verify Longhorn backup exists and is restorable) - Optional scheduled verification per `VerificationConfig` ### 8. Web UI Integration **Status**: API endpoints exist but no web UI for backup/restore. **Remaining work**: - Backup management page (list backups, trigger backup, view recovery plan) - Restore workflow (select backup, monitor restore progress, confirm switch) - Backup schedule configuration - Backup status in app detail view ### 9. S3/Azure Backup Destinations **Status**: Types defined (`S3Config`, `AzureConfig`) but only NFS and local filesystem destinations are implemented. **Remaining work**: - Implement S3-compatible destination (for MinIO, AWS S3, etc.) - Implement Azure Blob Storage destination - Test with real cloud storage providers