wild-cloud-dev/docs/future/backup-restore.md

# Wild Cloud Backup & Restore System

**Status**: Core system implemented and tested. Deterministic patching enhancement planned.

---

## What's Been Built

Wild Cloud has a blue-green backup and restore system for apps. It backs up an app's data (config files, database, persistent volumes), then restores to a colored standby environment for validation before switching traffic.

### Architecture

```
Backup:  [Active App] → config snapshot + DB dump + volume snapshot → [Backup Storage]
Restore: [Backup Storage] → colored standby namespace + colored DB + colored volumes → [Standby App]
Switch:  [Standby App] becomes active, old active cleaned up
```

### Strategy Pattern

Each app's backup is composed of one or more strategies that run independently:

| Strategy | What it backs up | How it restores |
|----------|-----------------|-----------------|
| `config` | `config.yaml`, `secrets.yaml`, app manifests | Copies files to standby app directory |
| `postgres` | `pg_dump` of the app's database | Creates `{dbName}_{color}` database, `pg_restore` into it |
| `mysql` | `mysqldump` of the app's database | Creates `{dbName}_{color}` database, restores into it |
| `longhorn-native` | Longhorn volume snapshots via API | Creates `{pvcName}-{color}` volumes from backup |

Strategies are selected automatically based on the app's manifest dependencies (e.g., app requires `pg` triggers `postgres` strategy).

### RecoveryPlan

A YAML coordination record tracks the entire lifecycle:

```yaml
app: listmonk
instance: test-cloud
timestamp: "20260308T231537Z"
status: restored
standbyColor: green
source:
  activeColor: blue
  namespace: listmonk
  appDir: instances/test-cloud/apps/listmonk
standby:
  namespace: listmonk-green
  appDir: instances/test-cloud/backups/listmonk/20260308T231537Z/standby-app
strategies:
  - name: config
    status: restored
  - name: postgres
    status: restored
    params: { dbName: listmonk }
    restore: { dbName: listmonk_green }
  - name: longhorn-native
    status: restored
    backup:
      volumes:
        - pvcName: listmonk-data
          backupURL: backup://pvc-xxx/backup-yyy
    restore:
      volumes:
        - pvcName: listmonk-data
          volumeName: listmonk-data-green
phases:
  backup: { startedAt: ..., completedAt: ... }
  restore: { startedAt: ..., completedAt: ... }
```

### Standby Deployment

When restoring, the system:

1. Copies app manifests to a standby directory
2. Patches `kustomization.yaml` namespace to `{app}-{color}`
3. Patches `namespace.yaml` to match
4. Patches database references in env vars (db name and connection URLs) using kustomize JSON patches
5. Runs `kubectl apply -k` on the standby directory
6. Creates Kubernetes secrets from `secrets.yaml` (source of truth) in the standby namespace

### Key Files

```
api/internal/backup/
├── backup.go                    # Core orchestration (Backup, Restore, Switch, Cleanup)
├── types/types.go               # RecoveryPlan, Strategy interface, StrategyEntry
├── strategies/
│   ├── config.go                # Config file backup/restore
│   ├── postgres.go              # PostgreSQL dump/restore
│   ├── mysql.go                 # MySQL dump/restore
│   └── longhorn_native.go       # Longhorn volume snapshot backup/restore
└── destinations/
    ├── nfs.go                   # NFS backup storage
    └── local.go                 # Local filesystem backup storage
```

### API Endpoints

```
POST /api/v1/instances/{name}/apps/{app}/backup     # Create backup
POST /api/v1/instances/{name}/apps/{app}/restore     # Restore from latest backup
POST /api/v1/instances/{name}/apps/{app}/switch      # Switch traffic to standby
POST /api/v1/instances/{name}/apps/{app}/cleanup     # Clean up old active
GET  /api/v1/instances/{name}/apps/{app}/backups     # List backups
GET  /api/v1/instances/{name}/apps/{app}/backup/{ts} # Get specific backup plan
```

### What's Been Tested

Full end-to-end backup/restore cycle for listmonk on `test-cloud`:
- Config strategy: config.yaml, secrets.yaml, app manifests backed up and restored
- Postgres strategy: database dumped, colored database created, data restored
- Longhorn-native strategy: PVC snapshot created, colored volume restored from backup
- Secrets deployed from `secrets.yaml` (source of truth) to standby namespace
- Database name patching in env vars (both exact match and connection string URLs)

### Bugs Found and Fixed During Testing

1. **Secret deployment ordering**: Secrets were being created before `kubectl apply -k` which creates the namespace. Fixed by moving secret creation to after the apply.

2. **Secret source of truth**: Was copying Kubernetes secrets between namespaces. Changed to create secrets from `secrets.yaml` (matching the normal deploy flow).

3. **Longhorn stale snapshots**: Engine-level snapshots persisted after deleting Kubernetes CRDs. Required Longhorn API `snapshotDelete` + `snapshotPurge` to clean up before new backups could succeed.

---

## What Remains To Be Done

### 1. Deterministic Blue-Green Patching (Next Priority)

**Problem**: The current system uses heuristics to decide which env vars need their database name replaced during restore. It checks env var names for patterns like `DATABASE`, `DB_NAME`, etc., and excludes names containing `USER`, `HOST`, etc. This works for common cases but is fragile:

- Env var names that don't follow conventions get missed
- Apps with multiple databases need multiple distinct mappings
- PVC name patching for volumes is not yet wired into the standby deployment
- Secret values containing database URLs need rewriting but currently aren't

**Proposed solution**: Add a `restore` field to app manifests that explicitly declares what needs patching:

```yaml
# In manifest.yaml
restore:
  databases:
    - configKey: dbName           # Path in defaultConfig holding the database name
      secretKeys: [dbUrl]         # Secret keys containing the db name (for URL rewriting)
  volumes:
    - listmonk-data              # PVC names to create colored copies of
```

**Implementation plan**: See `~/.claude/plans/floofy-waddling-locket.md` for detailed design.

**Changes needed**:
- Add `RestoreConfig` struct to `api/internal/apps/models.go`
- Refactor `updateDatabaseRefsFromPlan` in `backup.go` to use manifest declarations (with heuristic fallback)
- Refactor `deploySecretsToNamespace` to rewrite secret values for declared secret keys
- Add `updateVolumeRefsFromManifest` for PVC name patching in standby kustomize
- Add `restore` field to ~13 app manifests in `wild-directory/`
- Document the `restore` field in `wild-directory/ADDING-APPS.md`

### 2. Volume Patching in Standby Deployment

**Problem**: The longhorn-native strategy creates colored volumes (`{pvcName}-{color}`) during restore, but the standby deployment's kustomize files still reference the original PVC names. The standby pods mount the original volumes, not the restored colored ones.

**Solution**: Part of the deterministic patching work above. Use kustomize JSON patches to rewrite `claimName` references in the standby `kustomization.yaml`, similar to how database env vars are patched today.

### 3. Longhorn Backup Cleanup

**Problem**: `cleanupOldBackups` in `longhorn_native.go` is a no-op. Old Longhorn backups and engine-level snapshots accumulate.

**Solution**: Implement retention-based cleanup that:
- Keeps the N most recent backups per volume (configurable via retention policy)
- Deletes old Longhorn Backup CRDs
- Cleans up engine-level snapshots via Longhorn API (`snapshotDelete` + `snapshotPurge`)

### 4. Longhorn Port-Forward Cleanup

**Problem**: `getLonghornAPIEndpoint` starts a `kubectl port-forward` process on port 8080 that is never cleaned up. These orphan processes accumulate during backup/restore operations.

**Solution**: Track the port-forward process and kill it when the operation completes. Use a `defer` pattern or add cleanup to the strategy's `Cleanup` method.

### 5. Switch and Cleanup Phases

**Status**: Implemented but not thoroughly tested end-to-end.

**Remaining work**:
- Test the full switch flow (DNS/ingress cutover from old namespace to standby)
- Test cleanup of the old active namespace after switch
- Handle edge cases: switch failure mid-way, partial cleanup

### 6. Scheduled Backups

**Status**: Data model exists (`BackupConfiguration` with schedules and retention), but no scheduler is implemented.

**Remaining work**:
- Implement a cron-based scheduler that reads `BackupConfiguration` from instance config
- Trigger backups on schedule
- Apply retention policies to clean up old backups
- Web UI for configuring backup schedules

### 7. Backup Verification

**Status**: `Verify` method exists on the Strategy interface but implementations are minimal.

**Remaining work**:
- Implement meaningful verification (e.g., check pg_dump integrity, verify Longhorn backup exists and is restorable)
- Optional scheduled verification per `VerificationConfig`

### 8. Web UI Integration

**Status**: API endpoints exist but no web UI for backup/restore.

**Remaining work**:
- Backup management page (list backups, trigger backup, view recovery plan)
- Restore workflow (select backup, monitor restore progress, confirm switch)
- Backup schedule configuration
- Backup status in app detail view

### 9. S3/Azure Backup Destinations

**Status**: Types defined (`S3Config`, `AzureConfig`) but only NFS and local filesystem destinations are implemented.

**Remaining work**:
- Implement S3-compatible destination (for MinIO, AWS S3, etc.)
- Implement Azure Blob Storage destination
- Test with real cloud storage providers