Files
wild-cloud-dev/docs/future/backup-restore.md
2026-05-16 22:24:30 +00:00

9.5 KiB

Wild Cloud Backup & Restore System

Status: Core system implemented and tested. Deterministic patching enhancement planned.


What's Been Built

Wild Cloud has a blue-green backup and restore system for apps. It backs up an app's data (config files, database, persistent volumes), then restores to a colored standby environment for validation before switching traffic.

Architecture

Backup:  [Active App] → config snapshot + DB dump + volume snapshot → [Backup Storage]
Restore: [Backup Storage] → colored standby namespace + colored DB + colored volumes → [Standby App]
Switch:  [Standby App] becomes active, old active cleaned up

Strategy Pattern

Each app's backup is composed of one or more strategies that run independently:

Strategy What it backs up How it restores
config config.yaml, secrets.yaml, app manifests Copies files to standby app directory
postgres pg_dump of the app's database Creates {dbName}_{color} database, pg_restore into it
mysql mysqldump of the app's database Creates {dbName}_{color} database, restores into it
longhorn-native Longhorn volume snapshots via API Creates {pvcName}-{color} volumes from backup

Strategies are selected automatically based on the app's manifest dependencies (e.g., app requires pg triggers postgres strategy).

RecoveryPlan

A YAML coordination record tracks the entire lifecycle:

app: listmonk
instance: test-cloud
timestamp: "20260308T231537Z"
status: restored
standbyColor: green
source:
  activeColor: blue
  namespace: listmonk
  appDir: instances/test-cloud/apps/listmonk
standby:
  namespace: listmonk-green
  appDir: instances/test-cloud/backups/listmonk/20260308T231537Z/standby-app
strategies:
  - name: config
    status: restored
  - name: postgres
    status: restored
    params: { dbName: listmonk }
    restore: { dbName: listmonk_green }
  - name: longhorn-native
    status: restored
    backup:
      volumes:
        - pvcName: listmonk-data
          backupURL: backup://pvc-xxx/backup-yyy
    restore:
      volumes:
        - pvcName: listmonk-data
          volumeName: listmonk-data-green
phases:
  backup: { startedAt: ..., completedAt: ... }
  restore: { startedAt: ..., completedAt: ... }

Standby Deployment

When restoring, the system:

  1. Copies app manifests to a standby directory
  2. Patches kustomization.yaml namespace to {app}-{color}
  3. Patches namespace.yaml to match
  4. Patches database references in env vars (db name and connection URLs) using kustomize JSON patches
  5. Runs kubectl apply -k on the standby directory
  6. Creates Kubernetes secrets from secrets.yaml (source of truth) in the standby namespace

Key Files

api/internal/backup/
├── backup.go                    # Core orchestration (Backup, Restore, Switch, Cleanup)
├── types/types.go               # RecoveryPlan, Strategy interface, StrategyEntry
├── strategies/
│   ├── config.go                # Config file backup/restore
│   ├── postgres.go              # PostgreSQL dump/restore
│   ├── mysql.go                 # MySQL dump/restore
│   └── longhorn_native.go       # Longhorn volume snapshot backup/restore
└── destinations/
    ├── nfs.go                   # NFS backup storage
    └── local.go                 # Local filesystem backup storage

API Endpoints

POST /api/v1/instances/{name}/apps/{app}/backup     # Create backup
POST /api/v1/instances/{name}/apps/{app}/restore     # Restore from latest backup
POST /api/v1/instances/{name}/apps/{app}/switch      # Switch traffic to standby
POST /api/v1/instances/{name}/apps/{app}/cleanup     # Clean up old active
GET  /api/v1/instances/{name}/apps/{app}/backups     # List backups
GET  /api/v1/instances/{name}/apps/{app}/backup/{ts} # Get specific backup plan

What's Been Tested

Full end-to-end backup/restore cycle for listmonk on test-cloud:

  • Config strategy: config.yaml, secrets.yaml, app manifests backed up and restored
  • Postgres strategy: database dumped, colored database created, data restored
  • Longhorn-native strategy: PVC snapshot created, colored volume restored from backup
  • Secrets deployed from secrets.yaml (source of truth) to standby namespace
  • Database name patching in env vars (both exact match and connection string URLs)

Bugs Found and Fixed During Testing

  1. Secret deployment ordering: Secrets were being created before kubectl apply -k which creates the namespace. Fixed by moving secret creation to after the apply.

  2. Secret source of truth: Was copying Kubernetes secrets between namespaces. Changed to create secrets from secrets.yaml (matching the normal deploy flow).

  3. Longhorn stale snapshots: Engine-level snapshots persisted after deleting Kubernetes CRDs. Required Longhorn API snapshotDelete + snapshotPurge to clean up before new backups could succeed.


What Remains To Be Done

1. Deterministic Blue-Green Patching (Next Priority)

Problem: The current system uses heuristics to decide which env vars need their database name replaced during restore. It checks env var names for patterns like DATABASE, DB_NAME, etc., and excludes names containing USER, HOST, etc. This works for common cases but is fragile:

  • Env var names that don't follow conventions get missed
  • Apps with multiple databases need multiple distinct mappings
  • PVC name patching for volumes is not yet wired into the standby deployment
  • Secret values containing database URLs need rewriting but currently aren't

Proposed solution: Add a restore field to app manifests that explicitly declares what needs patching:

# In manifest.yaml
restore:
  databases:
    - configKey: dbName           # Path in defaultConfig holding the database name
      secretKeys: [dbUrl]         # Secret keys containing the db name (for URL rewriting)
  volumes:
    - listmonk-data              # PVC names to create colored copies of

Implementation plan: See ~/.claude/plans/floofy-waddling-locket.md for detailed design.

Changes needed:

  • Add RestoreConfig struct to api/internal/apps/models.go
  • Refactor updateDatabaseRefsFromPlan in backup.go to use manifest declarations (with heuristic fallback)
  • Refactor deploySecretsToNamespace to rewrite secret values for declared secret keys
  • Add updateVolumeRefsFromManifest for PVC name patching in standby kustomize
  • Add restore field to ~13 app manifests in wild-directory/
  • Document the restore field in wild-directory/ADDING-APPS.md

2. Volume Patching in Standby Deployment

Problem: The longhorn-native strategy creates colored volumes ({pvcName}-{color}) during restore, but the standby deployment's kustomize files still reference the original PVC names. The standby pods mount the original volumes, not the restored colored ones.

Solution: Part of the deterministic patching work above. Use kustomize JSON patches to rewrite claimName references in the standby kustomization.yaml, similar to how database env vars are patched today.

3. Longhorn Backup Cleanup

Problem: cleanupOldBackups in longhorn_native.go is a no-op. Old Longhorn backups and engine-level snapshots accumulate.

Solution: Implement retention-based cleanup that:

  • Keeps the N most recent backups per volume (configurable via retention policy)
  • Deletes old Longhorn Backup CRDs
  • Cleans up engine-level snapshots via Longhorn API (snapshotDelete + snapshotPurge)

4. Longhorn Port-Forward Cleanup

Problem: getLonghornAPIEndpoint starts a kubectl port-forward process on port 8080 that is never cleaned up. These orphan processes accumulate during backup/restore operations.

Solution: Track the port-forward process and kill it when the operation completes. Use a defer pattern or add cleanup to the strategy's Cleanup method.

5. Switch and Cleanup Phases

Status: Implemented but not thoroughly tested end-to-end.

Remaining work:

  • Test the full switch flow (DNS/ingress cutover from old namespace to standby)
  • Test cleanup of the old active namespace after switch
  • Handle edge cases: switch failure mid-way, partial cleanup

6. Scheduled Backups

Status: Data model exists (BackupConfiguration with schedules and retention), but no scheduler is implemented.

Remaining work:

  • Implement a cron-based scheduler that reads BackupConfiguration from instance config
  • Trigger backups on schedule
  • Apply retention policies to clean up old backups
  • Web UI for configuring backup schedules

7. Backup Verification

Status: Verify method exists on the Strategy interface but implementations are minimal.

Remaining work:

  • Implement meaningful verification (e.g., check pg_dump integrity, verify Longhorn backup exists and is restorable)
  • Optional scheduled verification per VerificationConfig

8. Web UI Integration

Status: API endpoints exist but no web UI for backup/restore.

Remaining work:

  • Backup management page (list backups, trigger backup, view recovery plan)
  • Restore workflow (select backup, monitor restore progress, confirm switch)
  • Backup schedule configuration
  • Backup status in app detail view

9. S3/Azure Backup Destinations

Status: Types defined (S3Config, AzureConfig) but only NFS and local filesystem destinations are implemented.

Remaining work:

  • Implement S3-compatible destination (for MinIO, AWS S3, etc.)
  • Implement Azure Blob Storage destination
  • Test with real cloud storage providers