Future work.

2026-05-16 22:24:30 +00:00
parent b9e212fbde
commit bb7f7b0577
3 changed files with 491 additions and 0 deletions
--- a/docs/future/backup-restore.md
+++ b/docs/future/backup-restore.md
@@ -0,0 +1,226 @@
+# Wild Cloud Backup & Restore System
+
+**Status**: Core system implemented and tested. Deterministic patching enhancement planned.
+
+---
+
+## What's Been Built
+
+Wild Cloud has a blue-green backup and restore system for apps. It backs up an app's data (config files, database, persistent volumes), then restores to a colored standby environment for validation before switching traffic.
+
+### Architecture
+
+```
+Backup:  [Active App] → config snapshot + DB dump + volume snapshot → [Backup Storage]
+Restore: [Backup Storage] → colored standby namespace + colored DB + colored volumes → [Standby App]
+Switch:  [Standby App] becomes active, old active cleaned up
+```
+
+### Strategy Pattern
+
+Each app's backup is composed of one or more strategies that run independently:
+
+| Strategy | What it backs up | How it restores |
+|----------|-----------------|-----------------|
+| `config` | `config.yaml`, `secrets.yaml`, app manifests | Copies files to standby app directory |
+| `postgres` | `pg_dump` of the app's database | Creates `{dbName}_{color}` database, `pg_restore` into it |
+| `mysql` | `mysqldump` of the app's database | Creates `{dbName}_{color}` database, restores into it |
+| `longhorn-native` | Longhorn volume snapshots via API | Creates `{pvcName}-{color}` volumes from backup |
+
+Strategies are selected automatically based on the app's manifest dependencies (e.g., app requires `pg` triggers `postgres` strategy).
+
+### RecoveryPlan
+
+A YAML coordination record tracks the entire lifecycle:
+
+```yaml
+app: listmonk
+instance: test-cloud
+timestamp: "20260308T231537Z"
+status: restored
+standbyColor: green
+source:
+  activeColor: blue
+  namespace: listmonk
+  appDir: instances/test-cloud/apps/listmonk
+standby:
+  namespace: listmonk-green
+  appDir: instances/test-cloud/backups/listmonk/20260308T231537Z/standby-app
+strategies:
+  - name: config
+    status: restored
+  - name: postgres
+    status: restored
+    params: { dbName: listmonk }
+    restore: { dbName: listmonk_green }
+  - name: longhorn-native
+    status: restored
+    backup:
+      volumes:
+        - pvcName: listmonk-data
+          backupURL: backup://pvc-xxx/backup-yyy
+    restore:
+      volumes:
+        - pvcName: listmonk-data
+          volumeName: listmonk-data-green
+phases:
+  backup: { startedAt: ..., completedAt: ... }
+  restore: { startedAt: ..., completedAt: ... }
+```
+
+### Standby Deployment
+
+When restoring, the system:
+
+1. Copies app manifests to a standby directory
+2. Patches `kustomization.yaml` namespace to `{app}-{color}`
+3. Patches `namespace.yaml` to match
+4. Patches database references in env vars (db name and connection URLs) using kustomize JSON patches
+5. Runs `kubectl apply -k` on the standby directory
+6. Creates Kubernetes secrets from `secrets.yaml` (source of truth) in the standby namespace
+
+### Key Files
+
+```
+api/internal/backup/
+├── backup.go                    # Core orchestration (Backup, Restore, Switch, Cleanup)
+├── types/types.go               # RecoveryPlan, Strategy interface, StrategyEntry
+├── strategies/
+│   ├── config.go                # Config file backup/restore
+│   ├── postgres.go              # PostgreSQL dump/restore
+│   ├── mysql.go                 # MySQL dump/restore
+│   └── longhorn_native.go       # Longhorn volume snapshot backup/restore
+└── destinations/
+    ├── nfs.go                   # NFS backup storage
+    └── local.go                 # Local filesystem backup storage
+```
+
+### API Endpoints
+
+```
+POST /api/v1/instances/{name}/apps/{app}/backup     # Create backup
+POST /api/v1/instances/{name}/apps/{app}/restore     # Restore from latest backup
+POST /api/v1/instances/{name}/apps/{app}/switch      # Switch traffic to standby
+POST /api/v1/instances/{name}/apps/{app}/cleanup     # Clean up old active
+GET  /api/v1/instances/{name}/apps/{app}/backups     # List backups
+GET  /api/v1/instances/{name}/apps/{app}/backup/{ts} # Get specific backup plan
+```
+
+### What's Been Tested
+
+Full end-to-end backup/restore cycle for listmonk on `test-cloud`:
+- Config strategy: config.yaml, secrets.yaml, app manifests backed up and restored
+- Postgres strategy: database dumped, colored database created, data restored
+- Longhorn-native strategy: PVC snapshot created, colored volume restored from backup
+- Secrets deployed from `secrets.yaml` (source of truth) to standby namespace
+- Database name patching in env vars (both exact match and connection string URLs)
+
+### Bugs Found and Fixed During Testing
+
+1. **Secret deployment ordering**: Secrets were being created before `kubectl apply -k` which creates the namespace. Fixed by moving secret creation to after the apply.
+
+2. **Secret source of truth**: Was copying Kubernetes secrets between namespaces. Changed to create secrets from `secrets.yaml` (matching the normal deploy flow).
+
+3. **Longhorn stale snapshots**: Engine-level snapshots persisted after deleting Kubernetes CRDs. Required Longhorn API `snapshotDelete` + `snapshotPurge` to clean up before new backups could succeed.
+
+---
+
+## What Remains To Be Done
+
+### 1. Deterministic Blue-Green Patching (Next Priority)
+
+**Problem**: The current system uses heuristics to decide which env vars need their database name replaced during restore. It checks env var names for patterns like `DATABASE`, `DB_NAME`, etc., and excludes names containing `USER`, `HOST`, etc. This works for common cases but is fragile:
+
+- Env var names that don't follow conventions get missed
+- Apps with multiple databases need multiple distinct mappings
+- PVC name patching for volumes is not yet wired into the standby deployment
+- Secret values containing database URLs need rewriting but currently aren't
+
+**Proposed solution**: Add a `restore` field to app manifests that explicitly declares what needs patching:
+
+```yaml
+# In manifest.yaml
+restore:
+  databases:
+    - configKey: dbName           # Path in defaultConfig holding the database name
+      secretKeys: [dbUrl]         # Secret keys containing the db name (for URL rewriting)
+  volumes:
+    - listmonk-data              # PVC names to create colored copies of
+```
+
+**Implementation plan**: See `~/.claude/plans/floofy-waddling-locket.md` for detailed design.
+
+**Changes needed**:
+- Add `RestoreConfig` struct to `api/internal/apps/models.go`
+- Refactor `updateDatabaseRefsFromPlan` in `backup.go` to use manifest declarations (with heuristic fallback)
+- Refactor `deploySecretsToNamespace` to rewrite secret values for declared secret keys
+- Add `updateVolumeRefsFromManifest` for PVC name patching in standby kustomize
+- Add `restore` field to ~13 app manifests in `wild-directory/`
+- Document the `restore` field in `wild-directory/ADDING-APPS.md`
+
+### 2. Volume Patching in Standby Deployment
+
+**Problem**: The longhorn-native strategy creates colored volumes (`{pvcName}-{color}`) during restore, but the standby deployment's kustomize files still reference the original PVC names. The standby pods mount the original volumes, not the restored colored ones.
+
+**Solution**: Part of the deterministic patching work above. Use kustomize JSON patches to rewrite `claimName` references in the standby `kustomization.yaml`, similar to how database env vars are patched today.
+
+### 3. Longhorn Backup Cleanup
+
+**Problem**: `cleanupOldBackups` in `longhorn_native.go` is a no-op. Old Longhorn backups and engine-level snapshots accumulate.
+
+**Solution**: Implement retention-based cleanup that:
+- Keeps the N most recent backups per volume (configurable via retention policy)
+- Deletes old Longhorn Backup CRDs
+- Cleans up engine-level snapshots via Longhorn API (`snapshotDelete` + `snapshotPurge`)
+
+### 4. Longhorn Port-Forward Cleanup
+
+**Problem**: `getLonghornAPIEndpoint` starts a `kubectl port-forward` process on port 8080 that is never cleaned up. These orphan processes accumulate during backup/restore operations.
+
+**Solution**: Track the port-forward process and kill it when the operation completes. Use a `defer` pattern or add cleanup to the strategy's `Cleanup` method.
+
+### 5. Switch and Cleanup Phases
+
+**Status**: Implemented but not thoroughly tested end-to-end.
+
+**Remaining work**:
+- Test the full switch flow (DNS/ingress cutover from old namespace to standby)
+- Test cleanup of the old active namespace after switch
+- Handle edge cases: switch failure mid-way, partial cleanup
+
+### 6. Scheduled Backups
+
+**Status**: Data model exists (`BackupConfiguration` with schedules and retention), but no scheduler is implemented.
+
+**Remaining work**:
+- Implement a cron-based scheduler that reads `BackupConfiguration` from instance config
+- Trigger backups on schedule
+- Apply retention policies to clean up old backups
+- Web UI for configuring backup schedules
+
+### 7. Backup Verification
+
+**Status**: `Verify` method exists on the Strategy interface but implementations are minimal.
+
+**Remaining work**:
+- Implement meaningful verification (e.g., check pg_dump integrity, verify Longhorn backup exists and is restorable)
+- Optional scheduled verification per `VerificationConfig`
+
+### 8. Web UI Integration
+
+**Status**: API endpoints exist but no web UI for backup/restore.
+
+**Remaining work**:
+- Backup management page (list backups, trigger backup, view recovery plan)
+- Restore workflow (select backup, monitor restore progress, confirm switch)
+- Backup schedule configuration
+- Backup status in app detail view
+
+### 9. S3/Azure Backup Destinations
+
+**Status**: Types defined (`S3Config`, `AzureConfig`) but only NFS and local filesystem destinations are implemented.
+
+**Remaining work**:
+- Implement S3-compatible destination (for MinIO, AWS S3, etc.)
+- Implement Azure Blob Storage destination
+- Test with real cloud storage providers
--- a/docs/future/longhorn-disk-configuration.md
+++ b/docs/future/longhorn-disk-configuration.md
@@ -0,0 +1,265 @@
+# Longhorn Storage Disk Configuration
+
+## Current Problem
+
+Wild Cloud currently doesn't properly configure additional storage disks for Longhorn during node setup, causing Longhorn to use the OS disk instead of dedicated storage disks. This leads to:
+
+- **Insufficient storage capacity** - OS disks are typically small (100-200GB)
+- **Performance issues** - OS and storage I/O compete for the same disk
+- **Disk pressure warnings** - Longhorn marks nodes as unschedulable when OS disk fills up
+
+### Example Case
+
+Worker-1 in production has three disks:
+- `/dev/sdb`: 117GB (OS disk) - **Previously used by Longhorn** (now removed)
+- `/dev/nvme0n1`: 976GB NVMe - **Now configured as Longhorn storage** via ExistingVolumeConfig
+- `/dev/sda`: 1.9TB SATA (unused)
+
+## Root Cause
+
+### 1. Talos Doesn't Auto-Mount Additional Disks
+
+Talos Linux requires explicit configuration for additional disks. They don't automatically mount or become available for use.
+
+### 2. Wild Cloud's Incomplete Configuration
+
+The current worker patch template (`api/internal/setup/cluster-nodes/patch.templates/worker.yaml`) only configures a self-referencing bind mount:
+
+```yaml
+machine:
+  kubelet:
+    extraMounts:
+      - destination: /var/lib/longhorn
+        type: bind
+        source: /var/lib/longhorn  # This just binds to itself!
+        options:
+          - bind
+          - rshared
+          - rw
+```
+
+This doesn't actually mount a different disk - it just creates a bind mount from `/var/lib/longhorn` to itself, which remains on the OS disk.
+
+### 3. No Disk Detection or Configuration
+
+Wild Cloud doesn't:
+- Detect available storage disks during node configuration
+- Configure the disk specified in `config.yaml` (e.g., `disk: /dev/sda`)
+- Provide UI/CLI options for selecting which disk to use for storage
+
+## Implemented Solution (Worker-1)
+
+Worker-1 has been manually configured as a reference implementation. The approach uses Talos ExistingVolumeConfig (v1.11+) combined with Longhorn DaemonSet hostPath volumes.
+
+### How It Works
+
+#### 1. Talos ExistingVolumeConfig
+
+Mounts an existing partition by UUID at `/var/mnt/longhorn-nvme` without reformatting:
+
+```yaml
+---
+apiVersion: v1alpha1
+kind: ExistingVolumeConfig
+name: longhorn-nvme
+discovery:
+  volumeSelector:
+    match: volume.partition_uuid == "54e9771a-74d6-4242-bdcb-9c2398ef5d91"
+mount: {}
+```
+
+This is applied as a second config document alongside the main machine config via `talosctl apply-config`. The NVMe mounts at `/var/mnt/longhorn-nvme` (Talos convention for ExistingVolumeConfig).
+
+**Key details:**
+- Requires Talos v1.11+ (ExistingVolumeConfig not available in v1.10)
+- The partition must already exist with a filesystem (XFS in our case)
+- Get the partition UUID with: `talosctl get discoveredvolumes <partition> -o yaml`
+- Validates with: `talosctl validate -m metal -c <config-file>`
+- Apply with: `talosctl apply-config -f <config-file> --mode no-reboot`
+
+#### 2. Longhorn DaemonSet hostPath Volume
+
+The longhorn-manager DaemonSet needs a hostPath volume to access the NVMe:
+
+```yaml
+# Volume definition
+volumes:
+- hostPath:
+    path: /var/mnt/longhorn-nvme    # Host path (where Talos mounts the NVMe)
+  name: longhorn-nvme
+
+# Volume mount in container
+volumeMounts:
+- mountPath: /var/mnt/longhorn-nvme  # Container path (MUST match host path)
+  mountPropagation: Bidirectional
+  name: longhorn-nvme
+```
+
+#### 3. Longhorn Node Disk Configuration
+
+The Longhorn node spec points to the mount path:
+
+```yaml
+spec:
+  disks:
+    nvme-disk:
+      allowScheduling: true
+      diskType: filesystem
+      path: /var/mnt/longhorn-nvme/
+      storageReserved: 0
+      tags:
+      - nvme
+```
+
+The `longhorn-disk.cfg` file at the root of the NVMe filesystem stores the disk identity:
+```json
+{"diskName":"nvme-disk","diskUUID":"3dd490e4-5c5f-4422-bcd8-f11d18580431","diskDriver":""}
+```
+
+#### 4. No kubelet extraMounts Needed
+
+The kubelet `extraMounts` are NOT needed for this approach. The DaemonSet hostPath volume handles the mount directly. kubelet extraMounts only affect the kubelet container's mount namespace and do not propagate to pod hostPath volumes.
+
+### Critical Lessons Learned
+
+#### Path Alignment Between Pods
+
+Longhorn uses multiple pod types that access the disk:
+- **longhorn-manager**: Accesses the disk via its DaemonSet hostPath volume
+- **instance-manager**: Accesses the host filesystem via `/host/` mount (host root)
+
+The disk path in the Longhorn node spec must work from BOTH perspectives:
+- longhorn-manager sees it via the DaemonSet volumeMount
+- instance-manager sees it via `/host/<path>`
+
+**The container mountPath MUST equal the host path.** If the DaemonSet maps host `/var/mnt/longhorn-nvme` to container `/var/lib/longhorn-nvme`, the longhorn-manager sees the NVMe at `/var/lib/longhorn-nvme` but the instance-manager sees the OS disk at `/host/var/lib/longhorn-nvme`. This causes wrong storage capacity reporting.
+
+The fix: use the same path everywhere (`/var/mnt/longhorn-nvme`).
+
+#### kubelet extraMounts Don't Affect hostPath Volumes
+
+Talos `machine.kubelet.extraMounts` add bind mounts to the kubelet container's CRI sandbox. They do NOT affect pod hostPath volume resolution. Pods with hostPath volumes always resolve from the actual host filesystem. Don't use extraMounts for Longhorn disk mounting.
+
+#### longhorn-disk.cfg Regeneration
+
+When the NVMe is unmounted (e.g., after a reboot before ExistingVolumeConfig is applied), the longhorn-manager may write a new `longhorn-disk.cfg` with a fresh UUID to the mount point on the EPHEMERAL partition. When the NVMe is remounted, the old `longhorn-disk.cfg` on the NVMe still has the correct UUID. If the cfg on the NVMe gets overwritten with a wrong UUID, fix it:
+
+```bash
+# Inside the longhorn-manager pod
+echo '{"diskName":"nvme-disk","diskUUID":"<correct-uuid>","diskDriver":""}' > /var/mnt/longhorn-nvme/longhorn-disk.cfg
+```
+
+#### Talos Version Requirements
+
+- ExistingVolumeConfig requires Talos v1.11+
+- Upgrade path: v1.10 -> v1.11 (one minor version at a time)
+- After upgrading a worker, the talosctl endpoint must use the node's actual current IP (check `kubectl get nodes -o wide`), not the config's `targetIp`
+- Control plane nodes on v1.10 can still manage workers on v1.11 via the VIP proxy
+
+#### Replica Recovery After Path Changes
+
+When changing the Longhorn disk path, some replicas may be left with stale `dataDirectoryName` values that don't exist on disk. These stopped replicas should be deleted so Longhorn creates fresh replacements that rebuild from healthy replicas on other nodes.
+
+## Proposed Automation
+
+### 1. Automatic Storage Disk Detection
+
+During node configuration, Wild Cloud should:
+
+```go
+// Detect available disks on the node
+disks := detectAvailableDisks(nodeIP)
+
+// Filter out OS disk and find suitable storage disks
+storageDisk := selectBestStorageDisk(disks, config.Disk)
+
+// Generate ExistingVolumeConfig for the storage disk partition
+if storageDisk != "" {
+    partitionUUID := getPartitionUUID(storageDisk)
+    // Add ExistingVolumeConfig document to the node's Talos config
+}
+```
+
+### 2. Configuration Schema Updates
+
+Add storage disk configuration to the node configuration:
+
+```yaml
+cluster:
+  nodes:
+    active:
+      worker-1:
+        role: worker
+        disk: /dev/sdb          # OS installation disk
+        storageDisk: /dev/nvme0n1  # Dedicated storage disk
+        storagePartitionUUID: 54e9771a-74d6-4242-bdcb-9c2398ef5d91
+        currentIp: 192.168.8.158
+        targetIp: 192.168.8.158
+```
+
+### 3. Longhorn DaemonSet Management
+
+When a storage disk is configured, Wild Cloud should:
+1. Add a hostPath volume to the longhorn-manager DaemonSet for the mount path
+2. Configure the Longhorn node spec with the disk path and tags
+3. Write the `longhorn-disk.cfg` to the disk if not present
+
+### 4. Web UI Enhancements
+
+Add storage disk selection during node configuration:
+
+- Show available disks when configuring a node
+- Allow selection of OS disk and storage disk separately
+- Validate disk selections (ensure they're different)
+- Show disk sizes to help users make informed choices
+
+### 5. Migration Path for Existing Clusters
+
+For clusters already using the wrong disk:
+
+1. Add new disk to Longhorn via ExistingVolumeConfig + DaemonSet update
+2. Evict replicas from OS disk to storage disk (disable scheduling, request eviction)
+3. Remove OS disk from Longhorn node spec
+
+## Implementation Steps
+
+### Phase 1: Detection and Configuration (Priority: High)
+1. Add disk detection to node configuration API
+2. Update node configuration to include storage disk selection
+3. Generate ExistingVolumeConfig documents for worker nodes with storage disks
+4. Manage longhorn-manager DaemonSet hostPath volumes
+5. Update Web UI to show disk options during node setup
+
+### Phase 2: Validation and Safety (Priority: Medium)
+1. Validate disk isn't already in use
+2. Check disk size meets minimum requirements (>100GB)
+3. Prevent selection of OS disk as storage disk
+4. Add warnings when storage disk isn't configured
+5. Validate `longhorn-disk.cfg` UUID consistency
+
+### Phase 3: Migration Tools (Priority: Low)
+1. Create tools to migrate existing Longhorn data between disks
+2. Add disk reconfiguration workflow for existing nodes
+3. Provide backup/restore path for disk changes
+
+## Testing Requirements
+
+1. **New Installation**: Verify storage disk is properly configured during initial setup
+2. **Upgrade Path**: Ensure existing clusters continue working without breaking changes
+3. **Multi-Disk Scenarios**: Test with various disk configurations (NVMe, SATA, mixed)
+4. **Failure Cases**: Test behavior when storage disk fails or is removed
+5. **Reboot Persistence**: Verify NVMe mount survives node reboots via ExistingVolumeConfig
+
+## References
+
+- [Talos Disk Configuration](https://www.talos.dev/v1.11/reference/configuration/)
+- [Talos ExistingVolumeConfig](https://www.talos.dev/v1.11/reference/configuration/extensions/existingvolumeconfig/)
+- [Longhorn Best Practices](https://longhorn.io/docs/latest/best-practices/)
+
+## Timeline
+
+- **Done**: Manual fix for worker-1 (ExistingVolumeConfig + DaemonSet + Longhorn node spec)
+- **Next**: Apply same pattern to worker-2 and worker-3
+- **v0.2.0**: Implement basic storage disk selection in UI
+- **v0.3.0**: Add automatic disk detection and validation
+- **v0.4.0**: Provide migration tools for existing clusters
--- a/docs/future/security.md
+++ b/docs/future/security.md