9.5 KiB
Wild Cloud Backup & Restore System
Status: Core system implemented and tested. Deterministic patching enhancement planned.
What's Been Built
Wild Cloud has a blue-green backup and restore system for apps. It backs up an app's data (config files, database, persistent volumes), then restores to a colored standby environment for validation before switching traffic.
Architecture
Backup: [Active App] → config snapshot + DB dump + volume snapshot → [Backup Storage]
Restore: [Backup Storage] → colored standby namespace + colored DB + colored volumes → [Standby App]
Switch: [Standby App] becomes active, old active cleaned up
Strategy Pattern
Each app's backup is composed of one or more strategies that run independently:
| Strategy | What it backs up | How it restores |
|---|---|---|
config |
config.yaml, secrets.yaml, app manifests |
Copies files to standby app directory |
postgres |
pg_dump of the app's database |
Creates {dbName}_{color} database, pg_restore into it |
mysql |
mysqldump of the app's database |
Creates {dbName}_{color} database, restores into it |
longhorn-native |
Longhorn volume snapshots via API | Creates {pvcName}-{color} volumes from backup |
Strategies are selected automatically based on the app's manifest dependencies (e.g., app requires pg triggers postgres strategy).
RecoveryPlan
A YAML coordination record tracks the entire lifecycle:
app: listmonk
instance: test-cloud
timestamp: "20260308T231537Z"
status: restored
standbyColor: green
source:
activeColor: blue
namespace: listmonk
appDir: instances/test-cloud/apps/listmonk
standby:
namespace: listmonk-green
appDir: instances/test-cloud/backups/listmonk/20260308T231537Z/standby-app
strategies:
- name: config
status: restored
- name: postgres
status: restored
params: { dbName: listmonk }
restore: { dbName: listmonk_green }
- name: longhorn-native
status: restored
backup:
volumes:
- pvcName: listmonk-data
backupURL: backup://pvc-xxx/backup-yyy
restore:
volumes:
- pvcName: listmonk-data
volumeName: listmonk-data-green
phases:
backup: { startedAt: ..., completedAt: ... }
restore: { startedAt: ..., completedAt: ... }
Standby Deployment
When restoring, the system:
- Copies app manifests to a standby directory
- Patches
kustomization.yamlnamespace to{app}-{color} - Patches
namespace.yamlto match - Patches database references in env vars (db name and connection URLs) using kustomize JSON patches
- Runs
kubectl apply -kon the standby directory - Creates Kubernetes secrets from
secrets.yaml(source of truth) in the standby namespace
Key Files
api/internal/backup/
├── backup.go # Core orchestration (Backup, Restore, Switch, Cleanup)
├── types/types.go # RecoveryPlan, Strategy interface, StrategyEntry
├── strategies/
│ ├── config.go # Config file backup/restore
│ ├── postgres.go # PostgreSQL dump/restore
│ ├── mysql.go # MySQL dump/restore
│ └── longhorn_native.go # Longhorn volume snapshot backup/restore
└── destinations/
├── nfs.go # NFS backup storage
└── local.go # Local filesystem backup storage
API Endpoints
POST /api/v1/instances/{name}/apps/{app}/backup # Create backup
POST /api/v1/instances/{name}/apps/{app}/restore # Restore from latest backup
POST /api/v1/instances/{name}/apps/{app}/switch # Switch traffic to standby
POST /api/v1/instances/{name}/apps/{app}/cleanup # Clean up old active
GET /api/v1/instances/{name}/apps/{app}/backups # List backups
GET /api/v1/instances/{name}/apps/{app}/backup/{ts} # Get specific backup plan
What's Been Tested
Full end-to-end backup/restore cycle for listmonk on test-cloud:
- Config strategy: config.yaml, secrets.yaml, app manifests backed up and restored
- Postgres strategy: database dumped, colored database created, data restored
- Longhorn-native strategy: PVC snapshot created, colored volume restored from backup
- Secrets deployed from
secrets.yaml(source of truth) to standby namespace - Database name patching in env vars (both exact match and connection string URLs)
Bugs Found and Fixed During Testing
-
Secret deployment ordering: Secrets were being created before
kubectl apply -kwhich creates the namespace. Fixed by moving secret creation to after the apply. -
Secret source of truth: Was copying Kubernetes secrets between namespaces. Changed to create secrets from
secrets.yaml(matching the normal deploy flow). -
Longhorn stale snapshots: Engine-level snapshots persisted after deleting Kubernetes CRDs. Required Longhorn API
snapshotDelete+snapshotPurgeto clean up before new backups could succeed.
What Remains To Be Done
1. Deterministic Blue-Green Patching (Next Priority)
Problem: The current system uses heuristics to decide which env vars need their database name replaced during restore. It checks env var names for patterns like DATABASE, DB_NAME, etc., and excludes names containing USER, HOST, etc. This works for common cases but is fragile:
- Env var names that don't follow conventions get missed
- Apps with multiple databases need multiple distinct mappings
- PVC name patching for volumes is not yet wired into the standby deployment
- Secret values containing database URLs need rewriting but currently aren't
Proposed solution: Add a restore field to app manifests that explicitly declares what needs patching:
# In manifest.yaml
restore:
databases:
- configKey: dbName # Path in defaultConfig holding the database name
secretKeys: [dbUrl] # Secret keys containing the db name (for URL rewriting)
volumes:
- listmonk-data # PVC names to create colored copies of
Implementation plan: See ~/.claude/plans/floofy-waddling-locket.md for detailed design.
Changes needed:
- Add
RestoreConfigstruct toapi/internal/apps/models.go - Refactor
updateDatabaseRefsFromPlaninbackup.goto use manifest declarations (with heuristic fallback) - Refactor
deploySecretsToNamespaceto rewrite secret values for declared secret keys - Add
updateVolumeRefsFromManifestfor PVC name patching in standby kustomize - Add
restorefield to ~13 app manifests inwild-directory/ - Document the
restorefield inwild-directory/ADDING-APPS.md
2. Volume Patching in Standby Deployment
Problem: The longhorn-native strategy creates colored volumes ({pvcName}-{color}) during restore, but the standby deployment's kustomize files still reference the original PVC names. The standby pods mount the original volumes, not the restored colored ones.
Solution: Part of the deterministic patching work above. Use kustomize JSON patches to rewrite claimName references in the standby kustomization.yaml, similar to how database env vars are patched today.
3. Longhorn Backup Cleanup
Problem: cleanupOldBackups in longhorn_native.go is a no-op. Old Longhorn backups and engine-level snapshots accumulate.
Solution: Implement retention-based cleanup that:
- Keeps the N most recent backups per volume (configurable via retention policy)
- Deletes old Longhorn Backup CRDs
- Cleans up engine-level snapshots via Longhorn API (
snapshotDelete+snapshotPurge)
4. Longhorn Port-Forward Cleanup
Problem: getLonghornAPIEndpoint starts a kubectl port-forward process on port 8080 that is never cleaned up. These orphan processes accumulate during backup/restore operations.
Solution: Track the port-forward process and kill it when the operation completes. Use a defer pattern or add cleanup to the strategy's Cleanup method.
5. Switch and Cleanup Phases
Status: Implemented but not thoroughly tested end-to-end.
Remaining work:
- Test the full switch flow (DNS/ingress cutover from old namespace to standby)
- Test cleanup of the old active namespace after switch
- Handle edge cases: switch failure mid-way, partial cleanup
6. Scheduled Backups
Status: Data model exists (BackupConfiguration with schedules and retention), but no scheduler is implemented.
Remaining work:
- Implement a cron-based scheduler that reads
BackupConfigurationfrom instance config - Trigger backups on schedule
- Apply retention policies to clean up old backups
- Web UI for configuring backup schedules
7. Backup Verification
Status: Verify method exists on the Strategy interface but implementations are minimal.
Remaining work:
- Implement meaningful verification (e.g., check pg_dump integrity, verify Longhorn backup exists and is restorable)
- Optional scheduled verification per
VerificationConfig
8. Web UI Integration
Status: API endpoints exist but no web UI for backup/restore.
Remaining work:
- Backup management page (list backups, trigger backup, view recovery plan)
- Restore workflow (select backup, monitor restore progress, confirm switch)
- Backup schedule configuration
- Backup status in app detail view
9. S3/Azure Backup Destinations
Status: Types defined (S3Config, AzureConfig) but only NFS and local filesystem destinations are implemented.
Remaining work:
- Implement S3-compatible destination (for MinIO, AWS S3, etc.)
- Implement Azure Blob Storage destination
- Test with real cloud storage providers