diff --git a/docs/app-states.md b/docs/app-states.md new file mode 100644 index 0000000..7fcea7f --- /dev/null +++ b/docs/app-states.md @@ -0,0 +1,1902 @@ +# Wild Cloud App Lifecycle: State and Operations + +## Overview + +Wild Cloud manages applications across multiple independent systems with different consistency guarantees. Understanding these systems, their interactions, and how app packages are structured is critical for reliable app lifecycle management. + +This document covers: +- **System architecture**: The three independent systems managing app state +- **User workflows**: Two distinct approaches (git-based vs Web UI) +- **App package structure**: How apps are defined in Wild Directory +- **State lifecycle**: Complete state transitions from add to delete +- **Operations**: How each lifecycle operation works across systems +- **Edge cases**: Common failure modes and automatic recovery + +## User Workflows + +Wild Cloud supports two fundamentally different workflows for managing app lifecycle: + +### Advanced Users: Git-Based Infrastructure-as-Code + +**Target Audience**: DevOps engineers, systems administrators, users comfortable with git and command-line tools. + +**Key Characteristics**: +- Instance data directory is a git repository +- Wild Directory tracked as upstream remote +- Manual edits tracked in git with commit messages +- Wild Directory updates merged using standard git workflows +- Full version control and audit trail +- SSH/command-line access to Wild Central device + +**Typical Workflow**: +```bash +# Clone instance repository +git clone user@wild-central:/var/lib/wild-central/instances/my-cloud + +# Make custom changes +vim apps/myapp/deployment.yaml +git commit -m "Increase CPU limits for production" + +# Merge upstream Wild Directory updates +git remote add wild-directory https://github.com/wildcloud/wild-directory.git +git fetch wild-directory +git merge wild-directory/main +# Resolve conflicts if needed + +# Deploy changes +wild app deploy myapp +``` + +**Philosophy**: Treat cluster configuration like application code - version controlled, reviewed, tested, and deployed through established git workflows. + +### Regular Users: Web UI-Based Management + +**Target Audience**: Non-technical users, small teams, users who prefer graphical interfaces. + +**Key Characteristics**: +- All management through Web UI or CLI (no SSH access) +- Configuration changes via forms (config.yaml, secrets.yaml) +- Wild Directory updates applied automatically with config merging +- Cannot directly edit manifest files (prevents divergence) +- Simplified workflow with automatic safety checks + +**Typical Workflow**: +1. Browse available apps in Web UI +2. Click "Add" to add app to instance +3. Configure via form fields (port, storage, domain, etc.) +4. Click "Deploy" to deploy to cluster +5. System notifies when Wild Directory updates available +6. Click "Update" to merge changes (config preserved) +7. Review changes in diff view +8. Click "Deploy" to apply updates + +**Philosophy**: Abstract away complexity - users manage apps like installing software, not like managing infrastructure code. + +### Key Differences + +| Aspect | Advanced Users (Git) | Regular Users (Web UI) | +|--------|----------------------|------------------------| +| **Access** | SSH + command line | Web UI + CLI | +| **Manifest Editing** | Direct file editing | Via config forms only | +| **Version Control** | Git (full history) | System managed | +| **Wild Directory Updates** | Manual git merge | Automatic merge with review | +| **Customization** | Unlimited | Configuration-based only | +| **Drift** | Intentional (git-tracked) | Unintentional (reconcile) | +| **Collaboration** | Git branches/PRs | Shared Web UI access | +| **Rollback** | `git revert` | Re-deploy previous state | + +The rest of this document covers both workflows, with sections clearly marked for each user type where behavior differs. + +## System Architecture + +### The Multi-System Challenge + +Wild Cloud app state spans **three independent systems**: + +1. **Wild Directory** (Source of Truth) + - Location: `/path/to/wild-directory/{app-name}/` + - Consistency: Immutable, version controlled + - Purpose: Template definitions shared across all instances + +2. **Instance Data** (Local State) + - Location: `/path/to/data-dir/instances/{instance}/` + - Consistency: Immediately consistent, file-system based + - Purpose: Instance-specific configuration and compiled manifests + +3. **Kubernetes Cluster** (Runtime State) + - Location: Kubernetes API and etcd + - Consistency: Eventually consistent + - Purpose: Running application workloads + +**Critical Insight**: These systems have fundamentally different consistency models, creating inherent challenges for atomic operations across system boundaries. + +## State Components + +### 1. Wild Directory (Immutable Source) + +``` +wild-directory/ +└── {app-name}/ + ├── manifest.yaml # App metadata, dependencies, defaults + ├── kustomization.yaml # Kustomize configuration + ├── deployment.yaml # Kubernetes workload (template) + ├── service.yaml # Kubernetes service (template) + ├── ingress.yaml # Kubernetes ingress (template) + ├── namespace.yaml # Namespace definition (template) + ├── pvc.yaml # Storage claims (template) + ├── db-init-job.yaml # Database initialization (optional) + └── README.md # Documentation +``` + +**Characteristics**: +- Read-only during operations +- Contains gomplate template variables: `{{ .cloud.domain }}`, `{{ .app.port }}` +- Shared across all Wild Cloud instances +- Version controlled (git) + +#### App Manifest Structure + +The `manifest.yaml` file defines everything about an app: + +```yaml +name: myapp # App identifier (matches directory name) +is: myapp # Unique app type identifier +description: Brief description +version: 1.0.0 # Follow upstream versioning +icon: https://example.com/icon.svg + +requires: # Dependencies (optional) + - name: postgres # Dependency app type (matches 'is' field) + alias: db # Optional: reference name in templates + - name: redis # No alias = use 'redis' as reference + +defaultConfig: # Merged into instance config.yaml + namespace: myapp + image: myapp/myapp:latest + port: "8080" + storage: 10Gi + domain: myapp.{{ .cloud.domain }} + # Can reference dependencies: + dbHost: "{{ .apps.db.host }}" + redisHost: "{{ .apps.redis.host }}" + +defaultSecrets: # App's own secrets + - key: apiKey # Auto-generated random if no default + - key: dbUrl # Can use template with config/secrets + default: "postgresql://{{ .app.dbUser }}:{{ .secrets.dbPassword }}@{{ .app.dbHost }}/{{ .app.dbName }}" + +requiredSecrets: # Secrets from dependencies + - db.password # Format: . + - redis.auth # Copied from dependency's secrets +``` + +**Template Variable Resolution**: + +In `manifest.yaml` only: +- `{{ .cloud.* }}` - Infrastructure config (domain, smtp, etc.) +- `{{ .cluster.* }}` - Cluster config (IPs, versions, etc.) +- `{{ .operator.* }}` - Operator info (email) +- `{{ .app.* }}` - This app's config from defaultConfig +- `{{ .apps..* }}` - Dependency app's config (via requires mapping) +- `{{ .secrets.* }}` - This app's secrets (in defaultSecrets default only) + +In `*.yaml` resource templates: +- `{{ .* }}` - Only this app's config (all from defaultConfig) +- No access to secrets, cluster config, or other apps + +**Dependency Resolution**: +1. `requires` lists app types needed (matches `is` field) +2. At add time, user maps to actual installed apps +3. System stores mapping in `installedAs` field in instance manifest +4. Templates resolve `{{ .apps.db.* }}` using this mapping + +### 2. Instance Data (Local State) + +``` +data-dir/instances/{instance}/ +├── config.yaml # App configuration (user-editable) +├── secrets.yaml # App secrets (generated + user-editable) +├── kubeconfig # Cluster access credentials +├── apps/ +│ └── {app-name}/ +│ ├── manifest.yaml # Copy with installedAs mappings +│ ├── deployment.yaml # Compiled (variables resolved) +│ ├── service.yaml # Compiled +│ ├── ingress.yaml # Compiled +│ └── ... # All manifests compiled +└── operations/ + └── op_{action}_app_{app-name}_{timestamp}.json +``` + +#### config.yaml Structure + +```yaml +apps: + postgres: + namespace: postgres + image: pgvector/pgvector:pg15 + port: "5432" + storage: 10Gi + host: postgres.postgres.svc.cluster.local + # ... all defaultConfig values from manifest +``` + +#### secrets.yaml Structure + +```yaml +apps: + postgres: + password: + ghost: + dbPassword: + adminPassword: + smtpPassword: + # defaultSecrets + requiredSecrets +``` + +**Characteristics**: +- Immediately consistent (filesystem) +- File-locked during updates (`config.yaml.lock`, `secrets.yaml.lock`) +- Version controlled (recommended but optional) +- User-editable (advanced users can SSH and modify) + +### 3. Kubernetes Cluster (Runtime State) + +``` +Kubernetes Cluster +└── Namespace: {app-name} + ├── Deployment: {app-name}-* + ├── ReplicaSet: {app-name}-* + ├── Pod: {app-name}-* + ├── Service: {app-name} + ├── Ingress: {app-name} + ├── PVC: {app-name}-pvc + ├── Secret: {app-name}-secrets + ├── ConfigMap: {app-name}-* (optional) + └── Job: {app-name}-db-init (optional) +``` + +**Namespace Lifecycle**: +- `Active`: Normal operating state +- `Terminating`: Deletion in progress (may take time) +- Finalizers: `[kubernetes]` prevents deletion until resources cleaned + +**Characteristics**: +- Eventually consistent (distributed system) +- Cascade deletion: Deleting namespace deletes all child resources +- Finalizers block deletion until cleared +- May enter stuck states requiring automatic intervention + +#### Kubernetes Resource Labeling + +All Wild Cloud apps use standard labels automatically applied via Kustomize: + +```yaml +# In kustomization.yaml +labels: + - includeSelectors: true # Apply to resources AND selectors + pairs: + app: myapp # App name + managedBy: kustomize + partOf: wild-cloud +``` + +This auto-expands selectors: +```yaml +# You write: +selector: + component: web + +# Kustomize expands to: +selector: + app: myapp + managedBy: kustomize + partOf: wild-cloud + component: web +``` + +**Important**: Use simple component labels (`component: web`), not Helm-style labels (`app.kubernetes.io/name`). + +### 4. External System State (Kubernetes Controller-Managed) + +These systems are not directly controlled by Wild Cloud API but are integral to app lifecycle: + +#### External DNS (via external-dns controller) + +**Location**: External DNS provider (Cloudflare, Route53, etc.) + +**Trigger**: Ingress with external-dns annotations +```yaml +annotations: + external-dns.alpha.kubernetes.io/target: {{ .domain }} +``` + +**State Flow**: +``` +1. Deploy creates Ingress with annotations +2. external-dns controller watches Ingress resources +3. Controller creates DNS records at provider +4. DNS propagates (eventual consistency, 30-300 seconds) +5. Domain resolves to cluster load balancer IP +``` + +**Lifecycle**: +- **Create**: Automatic when Ingress deployed +- **Update**: Automatic when Ingress annotations change +- **Delete**: Automatic when Ingress deleted (DNS records cleaned up) + +**Eventual Consistency**: DNS changes take 30s-5min to propagate globally. + +**Edge Cases**: +- DNS propagation delay (app deployed but domain not resolving yet) +- Provider rate limits (too many updates) +- Stale records if external-dns controller down during deletion +- Multiple ingresses with same hostname (last write wins) + +**Debugging**: +```bash +# View external-dns logs +kubectl logs -n external-dns deployment/external-dns + +# Check what DNS records external-dns is managing +kubectl get ingress -A -o yaml | grep external-dns +``` + +#### TLS Certificates (via cert-manager) + +**Location**: Both cluster (Kubernetes Secret) and external CA (Let's Encrypt) + +**Trigger**: Ingress with cert-manager annotations +```yaml +annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod +spec: + tls: + - hosts: + - myapp.cloud.example.com + secretName: myapp-tls +``` + +**State Flow**: +``` +1. Deploy creates Ingress with TLS config +2. cert-manager creates Certificate resource +3. cert-manager creates Order with ACME DNS-01 challenge +4. cert-manager updates DNS via provider (for challenge) +5. Let's Encrypt validates domain ownership via DNS +6. cert-manager receives certificate and stores in Secret +7. Ingress controller uses Secret for TLS termination +``` + +**Lifecycle**: +- **Create**: Automatic when Ingress with cert-manager annotation deployed +- **Renew**: Automatic (starts 30 days before expiry) +- **Delete**: Secret deleted with namespace, CA record persists + +**Eventual Consistency**: Certificate issuance takes 30s-2min (DNS challenge + CA validation). + +**Edge Cases**: +- DNS-01 challenge timeout (DNS not propagated yet) +- Rate limits (Let's Encrypt: 50 certs/domain/week, 5 failed validations/hour) +- Expired certificates (cert-manager should auto-renew but may fail) +- Namespace stuck terminating (cert-manager challenges may block finalizers) + +**Debugging**: +```bash +# View certificates and their status +kubectl get certificate -n myapp +kubectl describe certificate myapp-tls -n myapp + +# View ACME challenge progress +kubectl get certificaterequest -n myapp +kubectl get order -n myapp +kubectl get challenge -n myapp + +# Check cert-manager logs +kubectl logs -n cert-manager deployment/cert-manager +``` + +#### Wildcard Certificates (Shared Resource Pattern) + +Wild Cloud uses **two shared wildcard certificates** to avoid rate limits: + +**1. Public Wildcard Certificate** (`wildcard-wild-cloud-tls`) +```yaml +# Created once in cert-manager namespace +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: wildcard-wild-cloud-tls + namespace: cert-manager +spec: + secretName: wildcard-wild-cloud-tls + dnsNames: + - "*.cloud.example.com" +``` + +**2. Internal Wildcard Certificate** (`wildcard-internal-wild-cloud-tls`) +```yaml +# For internal-only apps not exposed via external-dns +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: wildcard-internal-wild-cloud-tls + namespace: cert-manager +spec: + secretName: wildcard-internal-wild-cloud-tls + dnsNames: + - "*.internal.cloud.example.com" +``` + +**Usage Pattern**: +- **Public apps** (exposed externally): Use `wildcard-wild-cloud-tls` + - Domain: `myapp.cloud.example.com` + - Has external-dns annotation (creates public DNS record) + +- **Internal apps** (cluster-only): Use `wildcard-internal-wild-cloud-tls` + - Domain: `myapp.internal.cloud.example.com` + - No external-dns annotation (only accessible within cluster/LAN) + - Examples: Docker registry, internal dashboards + +**Shared Pattern**: +1. One wildcard cert per domain covers all subdomains +2. Apps reference via `tlsSecretName: wildcard-wild-cloud-tls` (or `wildcard-internal-wild-cloud-tls`) +3. Deploy operation copies secret from cert-manager namespace to app namespace +4. All apps on same domain share the certificate + +**Advantages**: +- Avoids Let's Encrypt rate limits (50 certs/domain/week) +- Faster deployment (no ACME challenge per app) +- Survives app delete/redeploy (cert persists in cert-manager namespace) + +**Trade-offs**: +- All apps on same domain share same cert (if compromised, affects all apps) +- Cert must be copied to each app namespace (handled by Deploy operation) + +**Copy Operation**: +```go +// In apps.Deploy() +// Copies both wildcard certs if referenced by ingress +wildcardSecrets := []string{"wildcard-wild-cloud-tls", "wildcard-internal-wild-cloud-tls"} +for _, secretName := range wildcardSecrets { + if bytes.Contains(ingressContent, []byte(secretName)) { + utilities.CopySecretBetweenNamespaces(kubeconfigPath, secretName, "cert-manager", appName) + } +} +``` + +#### Load Balancer IPs (via MetalLB) + +**Location**: MetalLB controller state + cluster network + +**Trigger**: Service with `type: LoadBalancer` +```yaml +apiVersion: v1 +kind: Service +metadata: + name: traefik + namespace: traefik +spec: + type: LoadBalancer + loadBalancerIP: 192.168.8.80 # Optional: request specific IP +``` + +**State Flow**: +``` +1. Service created with type: LoadBalancer +2. MetalLB controller assigns IP from configured pool +3. MetalLB announces IP via ARP (Layer 2) or BGP (Layer 3) +4. Network routes traffic to assigned IP +5. kube-proxy on nodes routes to service endpoints +``` + +**Lifecycle**: +- **Create**: Automatic when LoadBalancer Service deployed +- **Persist**: IP sticky (same IP across pod restarts) +- **Delete**: IP returned to pool when Service deleted + +**Eventual Consistency**: ARP cache clearing takes 0-60 seconds. + +**Edge Cases**: +- IP pool exhaustion (no IPs available from MetalLB pool) +- IP conflicts (pool overlaps with DHCP or static assignments) +- ARP cache issues (old MAC address cached, traffic fails until cleared) +- Split-brain scenarios (multiple nodes announce same IP) + +**Debugging**: +```bash +# View services with assigned IPs +kubectl get svc -A --field-selector spec.type=LoadBalancer + +# Check MetalLB IP pools +kubectl get ipaddresspool -n metallb-system + +# View MetalLB controller state +kubectl logs -n metallb-system deployment/controller +kubectl logs -n metallb-system daemonset/speaker +``` + +### Cross-System Dependency Chain + +A complete app deployment triggers this cascade across systems: + +``` +Wild Cloud API (Deploy) + ↓ +Kubernetes (kubectl apply) + ↓ +Namespace + Resources Created + ↓ +┌─────────────────┬──────────────────┬─────────────────┐ +│ │ │ │ +external-dns cert-manager MetalLB +watches Ingress watches Ingress watches Service + ↓ ↓ ↓ +DNS Provider Let's Encrypt Network ARP/BGP +(Cloudflare) (ACME CA) (Local Network) + ↓ ↓ ↓ +CNAME Record TLS Certificate IP Address +Created Issued Announced +(30s-5min) (30s-2min) (0-60s) + ↓ ↓ ↓ +Domain Resolves + HTTPS Works + Traffic Routes +``` + +**Total Time to Fully Operational**: +- Kubernetes resources: 5-30 seconds (image pull + pod start) +- DNS propagation: 30 seconds - 5 minutes +- TLS certificate: 30 seconds - 2 minutes +- Network ARP: 0-60 seconds + +**Worst case**: 5-7 minutes from deploy command to app fully accessible via HTTPS. + +## App Lifecycle States + +### State 0: NOT_ADDED + +``` +Wild Directory: {app-name}/ exists (templates) +Instance Apps: (does not exist) +config.yaml: (no apps.{app-name} entry) +secrets.yaml: (no apps.{app-name} entry) +Cluster: (no namespace) +``` + +**Invariants**: +- App can be added from Wild Directory +- No local or cluster state exists + +--- + +### State 1: ADDED + +**After**: `wild app add {app-name}` + +``` +Wild Directory: {app-name}/ (unchanged) +Instance Apps: {app-name}/ created with compiled manifests + manifest.yaml has installedAs dependency mappings +config.yaml: apps.{app-name} populated from defaultConfig +secrets.yaml: apps.{app-name} populated with generated secrets +Cluster: (no namespace yet) +``` + +**Operations**: +1. Read `wild-directory/{app-name}/manifest.yaml` +2. Resolve gomplate variables using instance config +3. Generate random secrets for `defaultSecrets` (if no default provided) +4. Copy secrets from dependencies for `requiredSecrets` +5. Compile templates → write to `instance/apps/{app-name}/` +6. Append to `config.yaml` (file-locked) +7. Append to `secrets.yaml` (file-locked) + +**Invariants**: +- Local state consistent: config, secrets, and compiled manifests all exist +- Cluster state empty: nothing deployed yet +- Idempotent: Can re-add without side effects (overwrites local state) + +--- + +### State 2: DEPLOYING + +**During**: `wild app deploy {app-name}` + +``` +Wild Directory: (unchanged) +Instance Apps: (unchanged) +config.yaml: (unchanged) +secrets.yaml: (unchanged) +Cluster: namespace: Active (or being created) + resources: Creating/Pending + secret/{app-name}-secrets: Created +``` + +**Operations**: +1. Check namespace status (pre-flight check) +2. Create/update namespace (idempotent) +3. Create Kubernetes secret from `secrets.yaml` (overwrite if exists) +4. Copy dependency secrets (e.g., postgres-secrets) +5. Copy TLS certificates (e.g., wildcard certs from cert-manager) +6. Apply manifests: `kubectl apply -k instance/apps/{app-name}/` + +**Invariants Being Established**: +- Namespace must be `Active` or `NotFound` (not `Terminating`) +- Kubernetes secret created before workloads +- All dependencies deployed first + +--- + +### State 3: DEPLOYED + +**After successful deploy**: + +``` +Wild Directory: (unchanged) +Instance Apps: (unchanged) +config.yaml: (unchanged) +secrets.yaml: (unchanged) +Cluster: namespace: Active + deployment: Ready (replicas running) + pods: Running + service: Available (endpoints ready) + ingress: Ready (external-dns created DNS) + pvc: Bound (storage provisioned) + secret: Exists +``` + +**Invariants**: +- **Strong consistency**: Local state matches cluster intent +- All pods healthy and running +- Services have endpoints +- DNS records created (via external-dns) +- TLS certificates valid (via cert-manager) + +**Health Checks**: +```bash +kubectl get pods -n {app-name} +kubectl get ingress -n {app-name} +kubectl get pvc -n {app-name} +``` + +--- + +### State 3a: UPDATING (Configuration/Secret Changes) + +**Scenario**: User modifies config.yaml or secrets.yaml and redeploys. + +**Operations**: + +#### Update Configuration Only +``` +1. User edits config.yaml (e.g., change port, storage size) +2. User runs: wild app deploy {app-name} +3. System re-compiles templates with new config +4. System applies updated manifests: kubectl apply -k +5. Kubernetes performs rolling update (if applicable) +``` + +**State Flow**: +``` +config.yaml: Modified (new values) +Instance Apps: Templates re-compiled with new config +secrets.yaml: (unchanged) +Cluster: Rolling update (pods recreated with new config) +``` + +**Important**: Config changes trigger template recompilation. The `.package` directory preserves original templates, but deployed manifests are regenerated. + +#### Update Secrets Only +``` +1. User edits secrets.yaml (e.g., change password) +2. User runs: wild app deploy {app-name} +3. System deletes old Kubernetes secret +4. System creates new Kubernetes secret with updated values +5. Pods must be restarted to pick up new secrets +``` + +**State Flow**: +``` +config.yaml: (unchanged) +Instance Apps: (unchanged - no template changes) +secrets.yaml: Modified (new secrets) +Cluster: Secret updated, pods may need manual restart +``` + +**Critical**: Most apps don't auto-reload secrets. May require manual pod restart: +```bash +kubectl rollout restart deployment/{app-name} -n {app-name} +``` + +#### Update Both Config and Secrets +``` +1. User edits both config.yaml and secrets.yaml +2. User runs: wild app deploy {app-name} +3. System re-compiles templates + updates secrets +4. System applies manifests (rolling update) +5. Pods restart with new config and secrets +``` + +--- + +### State 3b: UPDATING (Manifest/Template Changes) + +**Scenario**: User directly edits Kustomize files in instance apps directory. + +This workflow differs significantly for **advanced users** (git-based) vs **regular users** (Web UI/CLI). + +#### Advanced User Workflow (Git-Based) + +**Instance directory as git repository**: +```bash +# Instance data directory is a git repo +cd /var/lib/wild-central/instances/my-cloud +git status +git log +``` + +**Operations**: +``` +1. User SSHs to Wild Central device (or uses VSCode Remote SSH) +2. User edits: apps/{app-name}/deployment.yaml +3. User commits changes: git add . && git commit -m "Custom resource limits" +4. User runs: wild app deploy {app-name} OR kubectl apply -k apps/{app-name}/ +5. Changes applied to cluster +``` + +**State Flow**: +``` +Wild Directory: (unchanged - original templates intact) +Instance Apps: Modified and git-tracked (intentional divergence) +config.yaml: (unchanged) +secrets.yaml: (unchanged) +Cluster: Updated with manual changes +.package/: (unchanged - preserves original templates) +Git History: Tracks all manual edits with commit messages +``` + +**Benefits**: +- **Version Control**: Full audit trail of all changes +- **Rollback**: `git revert` to undo changes +- **Infrastructure as Code**: Instance config managed like application code +- **Collaboration**: Multiple admins can work on same cluster config +- **Merge Workflow**: Wild Directory updates handled as upstream merges + +**Example Git Workflow**: +```bash +# Make custom changes +vim apps/myapp/deployment.yaml +git add apps/myapp/deployment.yaml +git commit -m "Increase CPU limit for production load" + +# Deploy changes +wild app deploy myapp + +# Later, merge upstream Wild Directory updates +git pull upstream main # Pull Wild Directory changes +git merge upstream/main # Merge with local customizations +# Resolve any conflicts +git push origin main +``` + +#### Regular User Workflow (Web UI/CLI) + +**Operations**: +``` +1. User cannot directly edit manifests (no SSH access) +2. User modifies config.yaml or secrets.yaml via Web UI +3. System re-compiles templates automatically +4. User deploys via Web UI +``` + +**State Flow**: +``` +Wild Directory: (unchanged) +Instance Apps: Re-compiled from templates (stays in sync) +config.yaml: Modified via Web UI +secrets.yaml: Modified via Web UI +Cluster: Updated via Web UI deploy +``` + +**Protection**: +- No manual manifest editing (prevents divergence) +- All changes through config/secrets (stays synchronized) +- Wild Directory updates apply cleanly (no merge conflicts) + +--- + +### State 3c: UPDATING (Wild Directory Version Update) + +**Scenario**: Wild Directory app updated (bug fix, new version, new features). + +This workflow differs significantly for **advanced users** (git-based) vs **regular users** (Web UI/CLI). + +#### Advanced User Workflow (Git Merge) + +**Wild Directory as upstream remote**: +```bash +# Add Wild Directory as upstream remote (one-time setup) +git remote add wild-directory https://github.com/wildcloud/wild-directory.git +git fetch wild-directory +``` + +**Detection**: +```bash +# Check for upstream updates +git fetch wild-directory +git log HEAD..wild-directory/main --oneline + +# See what changed in specific app +git diff HEAD wild-directory/main -- apps/myapp/ +``` + +**Merge Operations**: +```bash +# 1. Fetch latest Wild Directory changes +git fetch wild-directory + +# 2. Merge upstream changes with local customizations +git merge wild-directory/main + +# 3. Resolve any conflicts +# Git will show conflicts in manifest files, config, etc. +# User resolves conflicts preserving their custom changes + +# 4. Test changes +wild app deploy myapp --dry-run + +# 5. Deploy updated app +wild app deploy myapp + +# 6. Commit merge +git push origin main +``` + +**Conflict Resolution Example**: +```yaml +# Conflict in apps/myapp/deployment.yaml +<<<<<<< HEAD + resources: + limits: + cpu: "2000m" # Local customization + memory: "4Gi" # Local customization +======= + resources: + limits: + cpu: "1000m" # Wild Directory default + memory: "2Gi" # Wild Directory default +>>>>>>> wild-directory/main +``` + +User resolves by keeping their custom values or adopting new defaults. + +**Benefits**: +- **Full Control**: User decides what to merge and when +- **Conflict Resolution**: Git's standard merge tools handle conflicts +- **Audit Trail**: Git history shows what changed and why +- **Selective Updates**: Can cherry-pick specific app updates + +**State Flow**: +``` +Wild Directory: (tracked as remote, fetched regularly) +Instance Apps: Merged with git (custom + upstream changes) +config.yaml: Manually merged (conflicts resolved by user) +secrets.yaml: Preserved (not in Wild Directory) +.package/: Updated after merge +Git History: Shows merge commits and conflict resolutions +Cluster: Updated when user deploys after merge +``` + +#### Regular User Workflow (Automated Merge) + +**Detection Methods**: + +**Method 1: Compare .package with Wild Directory** +```bash +# System compares checksums/timestamps +diff -r instance/apps/{app-name}/.package/ wild-directory/{app-name}/ +``` + +If differences exist: New version available in Wild Directory. + +**Method 2: Check manifest version field** +```yaml +# wild-directory/{app-name}/manifest.yaml +version: 2.0.0 + +# instance/apps/{app-name}/manifest.yaml +version: 1.0.0 # Older version +``` + +**Safe Update (Preserves Local Config)** +``` +1. System detects Wild Directory changes +2. User initiates update (via UI or CLI) +3. System backs up current instance state: + - Saves current config.yaml section + - Saves current secrets.yaml section + - Saves current manifest.yaml (with installedAs mappings) +4. System re-adds app from Wild Directory: + - Copies new templates to instance/apps/{app-name}/ + - Updates .package/ with new source files + - Merges new defaultConfig with existing config + - Preserves existing secrets (doesn't regenerate) +5. System re-compiles templates with preserved config +6. User reviews changes (diff shown in UI) +7. User deploys updated app +``` + +**State Flow**: +``` +Wild Directory: (unchanged - new version available) +Instance Apps: Updated templates + recompiled manifests +config.yaml: Merged (new fields added, existing preserved) +secrets.yaml: (unchanged - existing secrets preserved) +.package/: Updated with new source files +Cluster: (not changed until user deploys) +``` + +**Merge Strategy for Config**: +```yaml +# Old config.yaml (version 1.0.0) +apps: + myapp: + port: "8080" + storage: 10Gi + +# New Wild Directory manifest (version 2.0.0) adds "replicas" field +defaultConfig: + port: "8080" + storage: 10Gi + replicas: "3" # New field + +# Merged config.yaml (after update) +apps: + myapp: + port: "8080" # Preserved + storage: 10Gi # Preserved + replicas: "3" # Added +``` + +**Breaking Changes**: +If Wild Directory update has breaking changes (renamed fields, removed features): +- System cannot auto-merge +- User must manually reconcile +- UI shows conflicts and requires resolution + +#### Destructive Update (Fresh Install) +``` +1. User deletes app: wild app delete {app-name} +2. User re-adds app: wild app add {app-name} +3. Config and secrets regenerated (loses customizations) +4. User must manually reconfigure +``` + +**Use When**: +- Major version upgrade with breaking changes +- Significant manifest restructuring +- User wants clean slate + +--- + +### State 3d: DEPLOYED with Drift + +**Scenario**: Cluster state diverged from instance state. + +This state has different meanings for **advanced users** vs **regular users**. + +#### Advanced Users: Intentional Drift (Git-Tracked) + +**Scenario**: User made direct cluster changes and committed them to git. + +**Example**: +```bash +# User edits deployment directly +kubectl edit deployment myapp -n myapp + +# User documents change in git +vim apps/myapp/deployment.yaml # Update manifest to match +git add apps/myapp/deployment.yaml +git commit -m "Emergency CPU limit increase for production incident" +``` + +**State Flow**: +``` +Instance Apps: Updated and git-tracked (intentional) +Git History: Documents why change was made +Cluster: Matches updated instance state +``` + +**This is NOT drift** - it's infrastructure-as-code in action. The instance directory reflects the true desired state, tracked in git. + +**Reconciliation**: Not needed (intentional state). + +#### Regular Users: Unintentional Drift + +**Scenario**: Cluster state diverged from instance state (unexpected). + +**Causes**: +- User ran `kubectl edit` directly (shouldn't happen - no SSH access) +- Another admin modified cluster resources +- Partial deployment failure (some resources applied, others failed) +- Kubernetes controller modified resources (e.g., HPA changed replicas) + +**Detection**: +```bash +# Compare desired vs actual state +kubectl diff -k instance/apps/{app-name}/ + +# Or use declarative check +kubectl apply -k instance/apps/{app-name}/ --dry-run=server +``` + +**State Flow**: +``` +Instance Apps: Unchanged (desired state) +Cluster: Diverged (actual state differs) +``` + +**Reconciliation**: +``` +1. User runs: wild app deploy {app-name} +2. kubectl apply re-applies desired state +3. Kubernetes reconciles differences (three-way merge) +4. Cluster returns to matching instance state +``` + +**Important**: `kubectl apply` is idempotent and safe for reconciliation. + +#### Distinguishing Intentional vs Unintentional Drift + +**Advanced users (git-based)**: +- Check git status: `git status` shows no uncommitted changes → intentional +- Check git log: `git log -- apps/myapp/` shows recent commits → intentional +- Cluster state matches git-tracked files → intentional + +**Regular users (Web UI)**: +- Any divergence is unintentional (no way to edit manifests directly) +- Reconcile immediately by redeploying + +--- + +### State 4: DELETING + +**During**: `wild app delete {app-name}` + +``` +Wild Directory: (unchanged) +Instance Apps: Being removed +config.yaml: apps.{app-name} being removed +secrets.yaml: apps.{app-name} being removed +Cluster: namespace: Active → Terminating + resources: Deleting (cascade) +``` + +**Operations** (Two-Phase): + +**Phase 1: Cluster Cleanup (Best Effort)** +```bash +# Try graceful deletion +kubectl delete namespace {app-name} --timeout=30s --wait=true + +# If stuck, force cleanup +kubectl patch namespace {app-name} --type=merge -p '{"metadata":{"finalizers":null}}' +``` + +**Phase 2: Local Cleanup (Always Succeeds)** +```bash +rm -rf instance/apps/{app-name}/ +yq delete config.yaml '.apps.{app-name}' +yq delete secrets.yaml '.apps.{app-name}' +``` + +**Critical Design Decision**: +- **Don't wait indefinitely for cluster cleanup** +- Local state is immediately consistent after Phase 2 +- Cluster cleanup is eventually consistent + +--- + +### State 5: DELETED + +**After successful delete**: + +``` +Wild Directory: (unchanged - still available for re-add) +Instance Apps: (removed) +config.yaml: (no apps.{app-name} entry) +secrets.yaml: (no apps.{app-name} entry) +Cluster: namespace: NotFound + all resources: (removed) +``` + +**Invariants**: +- Local state has no trace of app +- Cluster has no namespace or resources +- App can be re-added cleanly + +--- + +### State X: STUCK_TERMINATING (Edge Case) + +**Problematic state when namespace won't delete**: + +``` +Wild Directory: (unchanged) +Instance Apps: May or may not exist (depends on delete progress) +config.yaml: May or may not have entry +secrets.yaml: May or may not have entry +Cluster: namespace: Terminating (STUCK!) + finalizers: Blocking deletion + resources: Some exist, some terminating +``` + +**Why This Happens**: +1. Resources with custom finalizers +2. Webhooks or admission controllers blocking deletion +3. Network issues during deletion +4. StatefulSet with orphaned PVCs + +**Resolution**: +- Handled automatically by Deploy pre-flight checks +- Force cleanup finalizers after retries +- User never needs manual intervention + +## System Boundaries and Consistency + +### Consistency Guarantees by System + +| System | Consistency Model | Synchronization | +|--------|------------------|-----------------| +| Wild Directory | Immutable | Read-only | +| Instance Data | Immediately Consistent | File locks | +| Kubernetes | Eventually Consistent | Reconciliation loops | + +### Cross-System Operations + +#### Delete Operation (Spans 2 Systems) + +``` +┌─────────────────────────────────────────────────┐ +│ Delete Operation Timeline │ +├─────────────────────────────────────────────────┤ +│ │ +│ T=0s: kubectl delete namespace (initiated) │ +│ └─ Cluster enters eventual consistency │ +│ │ +│ T=1s: rm apps/{app-name}/ (completes) │ +│ yq delete config.yaml (completes) │ +│ yq delete secrets.yaml (completes) │ +│ └─ Local state immediately consistent │ +│ │ +│ T=2s: Return success to user │ +│ │ +│ T=30s: Namespace still terminating in cluster │ +│ └─ This is OK! Eventually consistent │ +│ │ +│ T=60s: Cluster cleanup completes │ +│ └─ Both systems now consistent │ +└─────────────────────────────────────────────────┘ +``` + +**Key Insight**: We accept temporary inconsistency at the system boundary. + +#### Deploy Operation (Spans 2 Systems) + +``` +┌─────────────────────────────────────────────────┐ +│ Deploy Operation Timeline │ +├─────────────────────────────────────────────────┤ +│ │ +│ T=0s: Check namespace status (pre-flight) │ +│ If Terminating: Force cleanup + retry │ +│ │ +│ T=5s: Create namespace (idempotent) │ +│ Create secrets │ +│ Apply manifests │ +│ └─ Cluster enters reconciliation │ +│ │ +│ T=30s: Pods starting, images pulling │ +│ │ +│ T=60s: All pods Running, services ready │ +│ └─ Deployment successful │ +└─────────────────────────────────────────────────┘ +``` + +**Key Insight**: Deploy owns making cluster match local state. + +## Idempotency and Safety + +### Idempotent Operations + +| Operation | Idempotent? | Why | +|-----------|-------------|-----| +| `app add` | ✅ Yes | Overwrites local state | +| `app deploy` | ✅ Yes | `kubectl apply` is idempotent | +| `app delete` | ✅ Yes | `kubectl delete --ignore-not-found` | + +### Non-Idempotent Danger Zones + +1. **Secret Generation**: Regenerating secrets breaks running apps + - Solution: Only generate if key doesn't exist + +2. **Database Initialization**: Running twice can cause conflicts + - Solution: Job uses `CREATE IF NOT EXISTS`, `ALTER IF EXISTS` + +3. **Finalizer Removal**: Skips cleanup logic + - Solution: Only as last resort after graceful attempts + +## Edge Cases and Error Handling + +### Edge Case 1: Namespace Stuck Terminating + +**Scenario**: Previous delete left namespace in Terminating state. + +**Detection**: +```bash +kubectl get namespace {app-name} -o jsonpath='{.status.phase}' +# Returns: "Terminating" +``` + +**Resolution** (Automatic): +1. Deploy pre-flight check detects Terminating state +2. Attempts force cleanup: removes finalizers +3. Waits 5 seconds +4. Retries up to 3 times +5. If still stuck, returns clear error message + +**Code**: +```go +if status == "Terminating" { + forceNamespaceCleanup(kubeconfigPath, appName) + time.Sleep(5 * time.Second) + // Retry deploy +} +``` + +### Edge Case 2: Concurrent Delete + Deploy + +**Scenario**: User deletes app, then immediately redeploys. + +**Timeline**: +``` +T=0s: Delete initiated +T=1s: Local state cleaned up +T=2s: User clicks "Deploy" +T=3s: Deploy detects Terminating namespace +T=4s: Deploy force cleanups and retries +T=10s: Deploy succeeds +``` + +**Why This Works**: +- Delete doesn't block on cluster cleanup +- Deploy handles any namespace state +- Eventual consistency at system boundary + +### Edge Case 3: Dependency Not Deployed + +**Scenario**: User tries to deploy app requiring postgres, but postgres isn't deployed. + +**Current Behavior**: Deployment succeeds but pods crash (CrashLoopBackOff). + +**Detection**: +```bash +kubectl get pods -n {app-name} +# Shows: CrashLoopBackOff +kubectl logs {pod-name} -n {app-name} +# Shows: "Connection refused to postgres.postgres.svc.cluster.local" +``` + +**Future Enhancement**: Pre-flight dependency check in Deploy operation. + +### Edge Case 4: Secrets Out of Sync + +**Scenario**: User manually updates password in Kubernetes but not in `secrets.yaml`. + +**Impact**: +- Next deploy overwrites Kubernetes secret +- App may lose access if password changed elsewhere + +**Best Practice**: Always update `secrets.yaml` as source of truth. + +### Edge Case 5: PVC Retention + +**Scenario**: Delete removes namespace but PVCs may persist (depends on reclaim policy). + +**Behavior**: +- PVC with `ReclaimPolicy: Retain` stays after delete +- Redeploy creates new PVC (data orphaned) + +**Resolution**: Document PVC backup/restore procedures. + +## App Package Development Best Practices + +### Security Requirements + +**All pods must include security contexts**: + +```yaml +spec: + template: + spec: + securityContext: + runAsNonRoot: true + runAsUser: 999 # Use appropriate non-root UID + runAsGroup: 999 + seccompProfile: + type: RuntimeDefault + containers: + - name: app + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: [ALL] + readOnlyRootFilesystem: false # true when possible +``` + +Common UIDs: PostgreSQL/Redis use 999. + +### Database Initialization Pattern + +Apps requiring databases should include `db-init-job.yaml`: + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: myapp-db-init +spec: + template: + spec: + restartPolicy: OnFailure + containers: + - name: db-init + image: postgres:15 + command: + - /bin/bash + - -c + - | + # Create database if doesn't exist + # Create/update user with password + # Grant permissions +``` + +**Critical**: Use idempotent SQL: +- `CREATE DATABASE IF NOT EXISTS` +- `CREATE USER IF NOT EXISTS ... ELSE ALTER USER ... WITH PASSWORD` +- Jobs retry on failure until success + +### Database URL Secrets + +Never use runtime variable substitution - it doesn't work with Kustomize: + +```yaml +# ❌ Wrong: +- name: DB_URL + value: "postgres://user:$(DB_PASSWORD)@host/db" + +# ✅ Correct: +- name: DB_URL + valueFrom: + secretKeyRef: + name: myapp-secrets + key: dbUrl +``` + +Define `dbUrl` in manifest's `defaultSecrets` with template: +```yaml +defaultSecrets: + - key: dbUrl + default: "postgres://{{ .app.dbUser }}:{{ .secrets.dbPassword }}@{{ .app.dbHost }}/{{ .app.dbName }}" +``` + +### External DNS Integration + +Ingresses should include external-dns annotations: + +```yaml +metadata: + annotations: + external-dns.alpha.kubernetes.io/target: {{ .domain }} + external-dns.alpha.kubernetes.io/cloudflare-proxied: "false" +``` + +This creates: `myapp.cloud.example.com` → `cloud.example.com` (CNAME) + +### Converting from Helm Charts + +1. Extract and render Helm chart: + ```bash + helm fetch --untar --untardir charts repo/chart-name + helm template --output-dir base --namespace myapp myapp charts/chart-name + ``` + +2. Create Wild Cloud structure: + - Add `namespace.yaml` + - Run `kustomize create --autodetect` + - Create `manifest.yaml` + - Replace values with gomplate variables + - Update labels (remove Helm-style, add Wild Cloud standard) + - Add security contexts + - Add external-dns annotations + +## Testing Strategies + +### Unit Tests + +Test individual operations in isolation: + +```go +func TestDelete_NamespaceNotFound(t *testing.T) { + // Test delete when namespace doesn't exist + // Should succeed without error +} + +func TestDelete_NamespaceTerminating(t *testing.T) { + // Test delete when namespace stuck terminating + // Should force cleanup and succeed +} + +func TestDeploy_NamespaceTerminating(t *testing.T) { + // Test deploy when namespace terminating + // Should retry and eventually succeed +} +``` + +### Integration Tests + +Test cross-system operations: + +```go +func TestDeleteThenDeploy(t *testing.T) { + // 1. Deploy app + // 2. Delete app + // 3. Immediately redeploy + // Should succeed without manual intervention +} + +func TestConcurrentOperations(t *testing.T) { + // Test multiple operations on same app + // File locks should prevent corruption +} +``` + +### Chaos Tests + +Test resilience to failures: + +```go +func TestDeleteWithNetworkPartition(t *testing.T) { + // Simulate network failure during delete + // Local state should still be cleaned up +} + +func TestDeployWithStuckFinalizer(t *testing.T) { + // Manually add finalizer to namespace + // Deploy should detect and force cleanup +} +``` + +## Operational Procedures + +### Manual Inspection + +**Check all state locations**: +```bash +# 1. Local state +ls instance/apps/{app-name}/ +yq eval '.apps.{app-name}' config.yaml +yq eval '.apps.{app-name}' secrets.yaml + +# 2. Cluster state +kubectl get namespace {app-name} +kubectl get all -n {app-name} +kubectl get pvc -n {app-name} +kubectl get secrets -n {app-name} +kubectl get ingress -n {app-name} +``` + +**Check operation status**: +```bash +ls -lt instance/operations/ | head -5 +cat instance/operations/op_deploy_app_{app-name}_*.json +``` + +### Manual Recovery + +**If namespace stuck terminating**: +```bash +# This should never be needed - Deploy handles automatically +# But for understanding: +kubectl get namespace {app-name} -o json | \ + jq '.spec.finalizers = []' | \ + kubectl replace --raw /api/v1/namespaces/{app-name}/finalize -f - +``` + +**If local state corrupted**: +```bash +# Re-add from Wild Directory +wild app add {app-name} +# This regenerates local state from source +``` + +**If secrets lost**: +```bash +# Secrets are auto-generated on add +# If lost, must re-add app (regenerates new secrets) +# Apps will need reconfiguration with new credentials +``` + +## Design Principles + +### 1. Eventual Consistency at Boundaries + +Accept that cluster state and local state may temporarily diverge. Design operations to handle any state. + +### 2. Local State as Source of Truth + +Instance data (config.yaml, secrets.yaml) is authoritative for intended state. Cluster reflects current state. + +### 3. Idempotent Everything + +Every operation should be safely repeatable. Use: +- `kubectl apply` (not `create`) +- `kubectl delete --ignore-not-found` +- `CREATE IF NOT EXISTS` in SQL + +### 4. Fail Forward, Not Backward + +If operation partially completes, retry should make progress (not start over). + +### 5. No Indefinite Waits + +Operations timeout and fail explicitly rather than hanging forever. + +### 6. User Never Needs Manual Intervention + +Automated recovery from all known edge cases (stuck namespaces, etc.). + +## Future Enhancements + +### 1. Dependency Validation + +Pre-flight check that required apps are deployed: +```go +if manifest.Requires != nil { + for _, dep := range manifest.Requires { + if !isAppDeployed(dep.Name) { + return fmt.Errorf("dependency %s not deployed", dep.Name) + } + } +} +``` + +### 2. State Reconciliation + +Periodic background job to ensure consistency: +```go +func ReconcileAppState(appName string) { + localState := readLocalState(appName) + clusterState := readClusterState(appName) + + if !statesMatch(localState, clusterState) { + // Alert or auto-correct + } +} +``` + +### 3. Backup/Restore Workflows + +Built-in PVC backup before delete: +```bash +wild app backup {app-name} +wild app restore {app-name} --from-backup {timestamp} +``` + +### 4. Dry-Run Mode + +Preview changes without applying: +```bash +wild app deploy {app-name} --dry-run +# Shows: resources that would be created/updated +``` + +## Git Workflow Best Practices (Advanced Users) + +This section provides operational guidance for advanced users managing Wild Cloud instances as git repositories. + +### Initial Repository Setup + +```bash +# Initialize instance directory as git repo +cd /var/lib/wild-central/instances/my-cloud +git init +git add . +git commit -m "Initial Wild Cloud instance configuration" + +# Add Wild Directory as upstream remote +git remote add wild-directory https://github.com/wildcloud/wild-directory.git +git fetch wild-directory + +# Add origin for your team's instance repo +git remote add origin git@github.com:myorg/wild-cloud-instances.git +git push -u origin main +``` + +### .gitignore Configuration + +```bash +# Create .gitignore for instance directory +cat > .gitignore <>>>>>> wild-directory/main + +# Resolution: Keep our production values +resources: + limits: + cpu: "4000m" + memory: "16Gi" + requests: + cpu: "2000m" + memory: "8Gi" +``` + +### Commit Message Conventions + +**Format**: `(): ` + +**Types**: +- `feat`: New app or feature +- `fix`: Bug fix or correction +- `config`: Configuration change +- `scale`: Resource scaling +- `upgrade`: Version upgrade +- `security`: Security-related change +- `docs`: Documentation change + +**Examples**: +```bash +git commit -m "feat(redis): Add Redis cache for session storage" +git commit -m "scale(postgres): Increase CPU limits for production load" +git commit -m "fix(ghost): Correct domain configuration for SSL" +git commit -m "upgrade(immich): Update to v1.2.0 with new ML features" +git commit -m "security(all): Rotate database passwords" +git commit -m "config(mastodon): Enable SMTP for email notifications" +``` + +### Rollback Procedures + +**Rollback entire app configuration**: +```bash +# Find commit to rollback to +git log --oneline -- apps/myapp/ + +# Revert specific commit +git revert abc123 + +# Or rollback to specific point +git checkout abc123 -- apps/myapp/ +git commit -m "rollback(myapp): Revert to stable configuration" + +# Deploy reverted state +wild app deploy myapp +``` + +**Emergency rollback (production incident)**: +```bash +# Immediately revert to last known good state +git log --oneline -5 +git reset --hard abc123 # Last working commit +wild app deploy myapp + +# Document the incident +git commit --allow-empty -m "emergency: Rolled back myapp due to production incident" +git push --force origin main # Force push to update remote +``` + +### Collaboration Patterns + +**Multiple admins working on same cluster**: + +```bash +# Always pull before making changes +git pull origin main + +# Use descriptive branch names +git checkout -b alice/add-monitoring +git checkout -b bob/upgrade-postgres + +# Push branches for review +git push origin alice/add-monitoring + +# Use PRs/MRs for review before merging to main +# This prevents conflicts and ensures peer review +``` + +**Code review checklist**: +- [ ] Changes tested in non-production environment +- [ ] Resource limits appropriate for workload +- [ ] Secrets not committed +- [ ] Dependencies deployed (if new app) +- [ ] Commit message follows conventions +- [ ] Breaking changes documented + +### Backup and Disaster Recovery + +**Regular backups**: +```bash +# Create tagged backup of current state +git tag -a backup-$(date +%Y%m%d) -m "Daily backup" +git push origin backup-$(date +%Y%m%d) + +# Automated daily backup (cron) +0 2 * * * cd /var/lib/wild-central/instances/my-cloud && git tag backup-$(date +%Y%m%d-%H%M) && git push origin --tags +``` + +**Disaster recovery**: +```bash +# Clone instance repository to new Wild Central device +git clone git@github.com:myorg/wild-cloud-instances.git /var/lib/wild-central/instances/my-cloud + +# Restore secrets from secure backup (NOT in git) +# (From password manager, Vault, encrypted backup, etc.) +cp ~/secure-backup/secrets.yaml /var/lib/wild-central/instances/my-cloud/ + +# Deploy all apps +cd /var/lib/wild-central/instances/my-cloud +for app in apps/*/; do + wild app deploy $(basename $app) +done +``` + +### Git Workflow vs Web UI + +**When git is better**: +- Complex changes requiring review +- Multi-app updates +- Compliance/audit requirements +- Team collaboration +- Emergency rollbacks + +**When Web UI is better**: +- Quick configuration tweaks +- Adding single app +- Viewing current state +- Non-technical team members + +**Hybrid approach**: Advanced users can use git for complex changes, Web UI for quick operations. The two workflows coexist peacefully since both modify the same instance directory. + +## Conclusion + +Wild Cloud's app lifecycle management spans three independent systems with different consistency guarantees. By understanding these systems and their boundaries, we can design operations that are: + +- **Reliable**: Handle edge cases automatically +- **Simple**: Two-phase operations (cluster + local) +- **Safe**: Idempotent and recoverable +- **Fast**: Don't wait unnecessarily for eventual consistency + +Additionally, for advanced users, the git-based workflow provides: +- **Auditable**: Full version control history +- **Collaborative**: Standard git workflows for team management +- **Recoverable**: Git revert/rollback capabilities +- **Professional**: Infrastructure-as-code best practices + +The key insight is accepting eventual consistency at system boundaries while maintaining immediate consistency within each system. This allows operations to complete quickly for users while ensuring the system eventually reaches a consistent state. diff --git a/docs/future/backups.md b/docs/future/backups.md new file mode 100644 index 0000000..5a42892 --- /dev/null +++ b/docs/future/backups.md @@ -0,0 +1,2529 @@ +# Wild Cloud Backup System - Complete Implementation Guide + +**Date:** 2025-11-26 +**Status:** 📋 READY FOR IMPLEMENTATION +**Estimated Effort:** Phase 1: 2-3 days | Phase 2: 5-7 days | Phase 3: 3-5 days + +--- + +## Table of Contents + +1. [Executive Summary](#executive-summary) +2. [Background and Context](#background-and-context) +3. [Problem Analysis](#problem-analysis) +4. [Architecture Overview](#architecture-overview) +5. [Configuration Design](#configuration-design) +6. [Phase 1: Core Backup Fix](#phase-1-core-backup-fix) +7. [Phase 2: Restic Integration](#phase-2-restic-integration) +8. [Phase 3: Restore from Restic](#phase-3-restore-from-restic) +9. [API Specifications](#api-specifications) +10. [Web UI Design](#web-ui-design) +11. [Testing Strategy](#testing-strategy) +12. [Deployment Guide](#deployment-guide) +13. [Task Breakdown](#task-breakdown) +14. [Success Criteria](#success-criteria) + +--- + +## Executive Summary + +### Current State +App backups are completely broken - they create only metadata files (`backup.json`) without any actual backup data: +- ❌ No database dump files (`.sql`, `.dump`) +- ❌ No PVC archive files (`.tar.gz`) +- ❌ Users cannot restore from these "backups" +- ✅ Cluster backups work correctly (different code path) + +### Root Cause +Database detection uses pod label-based discovery (`app=gitea` in `postgres` namespace), but database pods are shared infrastructure labeled `app=postgres`. Detection always returns empty, so no backups are created. + +### Why This Matters +- **Scale**: Applications like Immich may host terabyte-scale photo libraries +- **Storage**: Wild Central devices may not have sufficient local storage +- **Flexibility**: Need flexible destinations: local, NFS, S3, Backblaze B2, SFTP, etc. +- **Deduplication**: Critical for TB-scale data (60-80% space savings) + +### Solution: Three-Phase Approach + +**Phase 1 (CRITICAL - 2-3 days)**: Fix broken app backups +- Manifest-based database detection (declarative) +- kubectl exec for database dumps +- PVC discovery and backup +- Store files locally in staging directory + +**Phase 2 (HIGH PRIORITY - 5-7 days)**: Restic integration +- Upload staged files to restic repository +- Configuration via config.yaml and web UI +- Support multiple backends (local, S3, B2, SFTP) +- Repository initialization and testing + +**Phase 3 (MEDIUM PRIORITY - 3-5 days)**: Restore from restic +- List available snapshots +- Restore from any snapshot +- Database and PVC restoration +- Web UI for restore operations + +--- + +## Background and Context + +### Project Philosophy + +Wild Cloud follows strict KISS/YAGNI principles: +- **KISS**: Keep implementations as simple as possible +- **YAGNI**: Build only what's needed now, not speculative features +- **No future-proofing**: Let complexity emerge from actual requirements +- **Trust in emergence**: Start simple, enhance when requirements proven + +### Key Design Decisions + +1. **Manifest-based detection**: Read app dependencies from `manifest.yaml` (declarative), not runtime pod discovery +2. **kubectl exec approach**: Use standard Kubernetes operations for dumps and tar archives +3. **Restic for scale**: Use battle-tested restic tool for TB-scale data and flexible backends +4. **Phased implementation**: Fix core bugs first, add features incrementally + +### Why Restic? + +**Justified by actual requirements** (not premature optimization): +- **Scale**: Handle TB-scale data (Immich with terabytes of photos) +- **Flexibility**: Multiple backends (local, S3, B2, SFTP, Azure, GCS) +- **Efficiency**: 60-80% space savings via deduplication +- **Security**: Built-in AES-256 encryption +- **Reliability**: Battle-tested, widely adopted +- **Incremental**: Only backup changed blocks + +--- + +## Problem Analysis + +### Critical Bug: App Backups Create No Files + +**Evidence** from `/home/payne/repos/wild-cloud-dev/.working/in-progress-fix.md`: + +``` +Backup structure: +apps/ +└── gitea/ + └── 20241124T143022Z/ + └── backup.json ← Only this file exists! + +Expected structure: +apps/ +└── gitea/ + └── 20241124T143022Z/ + ├── backup.json + ├── postgres.sql ← Missing! + └── data.tar.gz ← Missing! +``` + +### Root Cause Analysis + +**File**: `wild-central-api/internal/backup/backup.go` (lines 544-569) + +```go +func (m *Manager) detectDatabaseType(ctx context.Context, namespace, appLabel string) (string, error) { + // This looks for pods with label "app=gitea" in namespace "postgres" + // But database pods are labeled "app=postgres" in namespace "postgres" + // This ALWAYS returns empty result! + + cmd := exec.CommandContext(ctx, "kubectl", "get", "pods", + "-n", namespace, + "-l", fmt.Sprintf("app=%s", appLabel), // ← Wrong label! + "-o", "jsonpath={.items[0].metadata.name}") + + output, err := cmd.Output() + if err != nil || len(output) == 0 { + return "", nil // ← Returns empty, no backup created + } + // ... +} +``` + +**Why It's Broken**: +1. Gitea backup tries to find pod with label `app=gitea` in namespace `postgres` +2. But PostgreSQL pod is labeled `app=postgres` in namespace `postgres` +3. Detection always fails → no database dump created +4. Same problem for PVC detection → no PVC archive created +5. Only `backup.json` metadata file is written + +### Why Cluster Backups Work + +Cluster backups don't use app-specific detection: +- Directly use `kubectl get` to find etcd pods +- Use hardcoded paths for config files +- Don't rely on app-based pod labels +- Actually create `.tar.gz` files with real data + +--- + +## Architecture Overview + +### System Components + +``` +┌─────────────────────────────────────────────────────────┐ +│ Wild Cloud Backup System │ +├─────────────────────────────────────────────────────────┤ +│ │ +│ ┌─ Web UI (wild-web-app) ─────────────────────┐ │ +│ │ - Backup configuration form │ │ +│ │ - Repository status display │ │ +│ │ - Backup creation/restore UI │ │ +│ └──────────────────┬───────────────────────────┘ │ +│ │ REST API │ +│ ┌─ API Layer (wild-central-api) ───────────────┐ │ +│ │ - Backup configuration endpoints │ │ +│ │ - Backup/restore operation handlers │ │ +│ │ - Restic integration layer │ │ +│ └──────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─ Backup Engine ────────────────────────────┐ │ +│ │ - Manifest parser │ │ +│ │ - Database backup (kubectl exec pg_dump) │ │ +│ │ - PVC backup (kubectl exec tar) │ │ +│ │ - Restic upload (Phase 2) │ │ +│ └──────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─ Storage Layer ────────────────────────────┐ │ +│ │ Phase 1: Local staging directory │ │ +│ │ Phase 2: Restic repository (local/remote) │ │ +│ └─────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────┘ +``` + +### Data Flow + +**Phase 1 (Local Staging)**: +``` +User clicks "Backup" → API Handler + ↓ + Read manifest.yaml (detect databases) + ↓ + kubectl exec pg_dump → postgres.sql + ↓ + kubectl exec tar → pvc-data.tar.gz + ↓ + Save to /var/lib/wild-central/backup-staging/ + ↓ + Write backup.json metadata +``` + +**Phase 2 (Restic Upload)**: +``` +[Same as Phase 1] → Local staging files created + ↓ + restic backup + ↓ + Upload to repository (S3/B2/local/etc) + ↓ + Clean staging directory + ↓ + Update metadata with snapshot ID +``` + +**Phase 3 (Restore)**: +``` +User selects snapshot → restic restore + ↓ + Download to staging directory + ↓ + kubectl exec psql < postgres.sql + ↓ + kubectl cp tar file → pod + ↓ + kubectl exec tar -xzf → restore PVC data +``` + +--- + +## Configuration Design + +### Schema: config.yaml + +```yaml +cloud: + domain: "wildcloud.local" + dns: + ip: "192.168.8.50" + + backup: + # Restic repository location (native restic URI format) + repository: "/mnt/backups/wild-cloud" # or "s3:bucket" or "sftp:user@host:/path" + + # Local staging directory (always on Wild Central filesystem) + staging: "/var/lib/wild-central/backup-staging" + + # Retention policy (restic forget flags) + retention: + keepDaily: 7 + keepWeekly: 4 + keepMonthly: 6 + keepYearly: 2 + + # Backend-specific configuration (optional, backend-dependent) + backend: + # For S3-compatible backends (B2, Wasabi, MinIO) + endpoint: "s3.us-west-002.backblazeb2.com" + region: "us-west-002" + + # For SFTP + port: 22 +``` + +### Schema: secrets.yaml + +```yaml +cloud: + backup: + # Restic repository encryption password + password: "strong-encryption-password" + + # Backend credentials (conditional on backend type) + credentials: + # For S3/B2/S3-compatible (auto-detected from repository prefix) + s3: + accessKeyId: "KEY_ID" + secretAccessKey: "SECRET_KEY" + + # For SFTP + sftp: + password: "ssh-password" + # OR + privateKey: | + -----BEGIN OPENSSH PRIVATE KEY----- + ... + -----END OPENSSH PRIVATE KEY----- + + # For Azure + azure: + accountName: "account" + accountKey: "key" + + # For Google Cloud + gcs: + projectId: "project-id" + serviceAccountKey: | + { "type": "service_account", ... } +``` + +### Configuration Examples + +#### Example 1: Local Testing + +**config.yaml**: +```yaml +cloud: + backup: + repository: "/mnt/external-drive/wild-cloud-backups" + staging: "/var/lib/wild-central/backup-staging" + retention: + keepDaily: 7 + keepWeekly: 4 + keepMonthly: 6 +``` + +**secrets.yaml**: +```yaml +cloud: + backup: + password: "test-backup-password-123" +``` + +#### Example 2: Backblaze B2 + +**config.yaml**: +```yaml +cloud: + backup: + repository: "b2:wild-cloud-backups" + staging: "/var/lib/wild-central/backup-staging" + retention: + keepDaily: 7 + keepWeekly: 4 + keepMonthly: 6 + backend: + endpoint: "s3.us-west-002.backblazeb2.com" + region: "us-west-002" +``` + +**secrets.yaml**: +```yaml +cloud: + backup: + password: "strong-encryption-password" + credentials: + s3: + accessKeyId: "0020123456789abcdef" + secretAccessKey: "K002abcdefghijklmnop" +``` + +#### Example 3: AWS S3 + +**config.yaml**: +```yaml +cloud: + backup: + repository: "s3:s3.amazonaws.com/my-wild-cloud-backups" + staging: "/var/lib/wild-central/backup-staging" + retention: + keepDaily: 14 + keepWeekly: 8 + keepMonthly: 12 + backend: + region: "us-east-1" +``` + +**secrets.yaml**: +```yaml +cloud: + backup: + password: "prod-encryption-password" + credentials: + s3: + accessKeyId: "AKIAIOSFODNN7EXAMPLE" + secretAccessKey: "wJalrXUtnFEMI/K7MDENG/bPxRfiCY" +``` + +#### Example 4: SFTP Remote Server + +**config.yaml**: +```yaml +cloud: + backup: + repository: "sftp:backup-user@backup.example.com:/wild-cloud-backups" + staging: "/var/lib/wild-central/backup-staging" + retention: + keepDaily: 7 + keepWeekly: 4 + keepMonthly: 6 + backend: + port: 2222 +``` + +**secrets.yaml**: +```yaml +cloud: + backup: + password: "restic-repo-password" + credentials: + sftp: + privateKey: | + -----BEGIN OPENSSH PRIVATE KEY----- + ... + -----END OPENSSH PRIVATE KEY----- +``` + +#### Example 5: NFS/SMB Mount (as Local Path) + +**config.yaml**: +```yaml +cloud: + backup: + repository: "/mnt/nas-backups/wild-cloud" # NFS mounted via OS + staging: "/var/lib/wild-central/backup-staging" + retention: + keepDaily: 7 + keepWeekly: 4 + keepMonthly: 6 +``` + +**secrets.yaml**: +```yaml +cloud: + backup: + password: "backup-encryption-password" +``` + +### Backend Detection Logic + +```go +func DetectBackendType(repository string) string { + if strings.HasPrefix(repository, "/") { + return "local" + } else if strings.HasPrefix(repository, "sftp:") { + return "sftp" + } else if strings.HasPrefix(repository, "s3:") || strings.HasPrefix(repository, "b2:") { + return "s3" + } else if strings.HasPrefix(repository, "azure:") { + return "azure" + } else if strings.HasPrefix(repository, "gs:") { + return "gcs" + } else if strings.HasPrefix(repository, "rclone:") { + return "rclone" + } + return "unknown" +} +``` + +### Environment Variable Mapping + +```go +func BuildResticEnv(config BackupConfig, secrets BackupSecrets) map[string]string { + env := map[string]string{ + "RESTIC_REPOSITORY": config.Repository, + "RESTIC_PASSWORD": secrets.Password, + } + + backendType := DetectBackendType(config.Repository) + + switch backendType { + case "s3": + env["AWS_ACCESS_KEY_ID"] = secrets.Credentials.S3.AccessKeyID + env["AWS_SECRET_ACCESS_KEY"] = secrets.Credentials.S3.SecretAccessKey + + if config.Backend.Endpoint != "" { + env["AWS_S3_ENDPOINT"] = config.Backend.Endpoint + } + if config.Backend.Region != "" { + env["AWS_DEFAULT_REGION"] = config.Backend.Region + } + + case "sftp": + if secrets.Credentials.SFTP.Password != "" { + env["RESTIC_SFTP_PASSWORD"] = secrets.Credentials.SFTP.Password + } + // SSH key handling done via temp file + + case "azure": + env["AZURE_ACCOUNT_NAME"] = secrets.Credentials.Azure.AccountName + env["AZURE_ACCOUNT_KEY"] = secrets.Credentials.Azure.AccountKey + + case "gcs": + // Write service account key to temp file, set GOOGLE_APPLICATION_CREDENTIALS + } + + return env +} +``` + +--- + +## Phase 1: Core Backup Fix + +### Goal +Fix critical bugs and create actual backup files (no restic yet). + +### Priority +🔴 **CRITICAL** - Users cannot restore from current backups + +### Timeline +2-3 days + +### Overview + +Replace broken pod label-based detection with manifest-based detection. Use kubectl exec to create actual database dumps and PVC archives. + +### Task 1.1: Implement Manifest-Based Database Detection + +**File**: `wild-central-api/internal/backup/backup.go` + +**Add New Structures**: +```go +type AppDependencies struct { + HasPostgres bool + HasMySQL bool + HasRedis bool +} +``` + +**Implement Detection Function**: +```go +func (m *Manager) getAppDependencies(appName string) (*AppDependencies, error) { + manifestPath := filepath.Join(m.directoryPath, appName, "manifest.yaml") + + manifest, err := directory.LoadManifest(manifestPath) + if err != nil { + return nil, fmt.Errorf("failed to load manifest: %w", err) + } + + deps := &AppDependencies{ + HasPostgres: contains(manifest.Requires, "postgres"), + HasMySQL: contains(manifest.Requires, "mysql"), + HasRedis: contains(manifest.Requires, "redis"), + } + + return deps, nil +} + +func contains(slice []string, item string) bool { + for _, s := range slice { + if s == item { + return true + } + } + return false +} +``` + +**Changes Required**: +- Add import: `"github.com/wild-cloud/wild-central/daemon/internal/directory"` +- Remove old `detectDatabaseType()` function (lines 544-569) + +**Acceptance Criteria**: +- Reads manifest.yaml for app +- Correctly identifies postgres dependency +- Correctly identifies mysql dependency +- Returns error if manifest not found +- Unit test: parse manifest with postgres +- Unit test: parse manifest without databases + +**Estimated Effort**: 2 hours + +--- + +### Task 1.2: Implement PostgreSQL Backup via kubectl exec + +**File**: `wild-central-api/internal/backup/backup.go` + +**Implementation**: +```go +func (m *Manager) backupPostgres(ctx context.Context, instanceName, appName, backupDir string) (string, error) { + dbName := appName // Database name convention + + // Find postgres pod in postgres namespace + podName, err := m.findPodInNamespace(ctx, "postgres", "app=postgres") + if err != nil { + return "", fmt.Errorf("postgres pod not found: %w", err) + } + + // Execute pg_dump + dumpFile := filepath.Join(backupDir, "postgres.sql") + cmd := exec.CommandContext(ctx, "kubectl", "exec", "-n", "postgres", + podName, "--", "pg_dump", "-U", "postgres", dbName) + + output, err := cmd.Output() + if err != nil { + return "", fmt.Errorf("pg_dump failed: %w", err) + } + + // Write dump to file + if err := os.WriteFile(dumpFile, output, 0600); err != nil { + return "", fmt.Errorf("failed to write dump: %w", err) + } + + return dumpFile, nil +} + +// Helper function to find pod by label +func (m *Manager) findPodInNamespace(ctx context.Context, namespace, labelSelector string) (string, error) { + cmd := exec.CommandContext(ctx, "kubectl", "get", "pods", + "-n", namespace, + "-l", labelSelector, + "-o", "jsonpath={.items[0].metadata.name}") + + output, err := cmd.Output() + if err != nil { + return "", fmt.Errorf("kubectl get pods failed: %w", err) + } + + podName := strings.TrimSpace(string(output)) + if podName == "" { + return "", fmt.Errorf("no pod found with label %s in namespace %s", labelSelector, namespace) + } + + return podName, nil +} +``` + +**Acceptance Criteria**: +- Finds postgres pod correctly +- Executes pg_dump successfully +- Creates .sql file with actual data +- Handles errors gracefully +- Integration test: backup Gitea database + +**Estimated Effort**: 3 hours + +--- + +### Task 1.3: Implement MySQL Backup via kubectl exec + +**File**: `wild-central-api/internal/backup/backup.go` + +**Implementation**: +```go +func (m *Manager) backupMySQL(ctx context.Context, instanceName, appName, backupDir string) (string, error) { + dbName := appName + + // Find mysql pod + podName, err := m.findPodInNamespace(ctx, "mysql", "app=mysql") + if err != nil { + return "", fmt.Errorf("mysql pod not found: %w", err) + } + + // Get MySQL root password from secret + password, err := m.getMySQLPassword(ctx) + if err != nil { + return "", fmt.Errorf("failed to get mysql password: %w", err) + } + + // Execute mysqldump + dumpFile := filepath.Join(backupDir, "mysql.sql") + cmd := exec.CommandContext(ctx, "kubectl", "exec", "-n", "mysql", + podName, "--", "mysqldump", + "-uroot", + fmt.Sprintf("-p%s", password), + "--single-transaction", + "--routines", + "--triggers", + dbName) + + output, err := cmd.Output() + if err != nil { + return "", fmt.Errorf("mysqldump failed: %w", err) + } + + if err := os.WriteFile(dumpFile, output, 0600); err != nil { + return "", fmt.Errorf("failed to write dump: %w", err) + } + + return dumpFile, nil +} + +func (m *Manager) getMySQLPassword(ctx context.Context) (string, error) { + cmd := exec.CommandContext(ctx, "kubectl", "get", "secret", + "-n", "mysql", + "mysql-root-password", + "-o", "jsonpath={.data.password}") + + output, err := cmd.Output() + if err != nil { + return "", fmt.Errorf("failed to get secret: %w", err) + } + + // Decode base64 + decoded, err := base64.StdEncoding.DecodeString(string(output)) + if err != nil { + return "", fmt.Errorf("failed to decode password: %w", err) + } + + return string(decoded), nil +} +``` + +**Acceptance Criteria**: +- Finds mysql pod correctly +- Retrieves password from secret +- Executes mysqldump successfully +- Creates .sql file with actual data +- Handles errors gracefully + +**Estimated Effort**: 3 hours + +--- + +### Task 1.4: Implement PVC Discovery and Backup + +**File**: `wild-central-api/internal/backup/backup.go` + +**Implementation**: +```go +func (m *Manager) findAppPVCs(ctx context.Context, appName string) ([]string, error) { + // Get namespace for app (convention: app name) + namespace := appName + + cmd := exec.CommandContext(ctx, "kubectl", "get", "pvc", + "-n", namespace, + "-o", "jsonpath={.items[*].metadata.name}") + + output, err := cmd.Output() + if err != nil { + return nil, fmt.Errorf("kubectl get pvc failed: %w", err) + } + + pvcNames := strings.Fields(string(output)) + return pvcNames, nil +} + +func (m *Manager) backupPVC(ctx context.Context, namespace, pvcName, backupDir string) (string, error) { + // Find pod using this PVC + podName, err := m.findPodUsingPVC(ctx, namespace, pvcName) + if err != nil { + return "", fmt.Errorf("no pod found using PVC %s: %w", pvcName, err) + } + + // Get mount path for PVC + mountPath, err := m.getPVCMountPath(ctx, namespace, podName, pvcName) + if err != nil { + return "", fmt.Errorf("failed to get mount path: %w", err) + } + + // Create tar archive of PVC data + tarFile := filepath.Join(backupDir, fmt.Sprintf("%s.tar.gz", pvcName)) + cmd := exec.CommandContext(ctx, "kubectl", "exec", "-n", namespace, + podName, "--", "tar", "czf", "-", "-C", mountPath, ".") + + output, err := cmd.Output() + if err != nil { + return "", fmt.Errorf("tar command failed: %w", err) + } + + if err := os.WriteFile(tarFile, output, 0600); err != nil { + return "", fmt.Errorf("failed to write tar file: %w", err) + } + + return tarFile, nil +} + +func (m *Manager) findPodUsingPVC(ctx context.Context, namespace, pvcName string) (string, error) { + cmd := exec.CommandContext(ctx, "kubectl", "get", "pods", + "-n", namespace, + "-o", "json") + + output, err := cmd.Output() + if err != nil { + return "", fmt.Errorf("kubectl get pods failed: %w", err) + } + + // Parse JSON to find pod using this PVC + var podList struct { + Items []struct { + Metadata struct { + Name string `json:"name"` + } `json:"metadata"` + Spec struct { + Volumes []struct { + PersistentVolumeClaim *struct { + ClaimName string `json:"claimName"` + } `json:"persistentVolumeClaim"` + } `json:"volumes"` + } `json:"spec"` + } `json:"items"` + } + + if err := json.Unmarshal(output, &podList); err != nil { + return "", fmt.Errorf("failed to parse pod list: %w", err) + } + + for _, pod := range podList.Items { + for _, volume := range pod.Spec.Volumes { + if volume.PersistentVolumeClaim != nil && + volume.PersistentVolumeClaim.ClaimName == pvcName { + return pod.Metadata.Name, nil + } + } + } + + return "", fmt.Errorf("no pod found using PVC %s", pvcName) +} + +func (m *Manager) getPVCMountPath(ctx context.Context, namespace, podName, pvcName string) (string, error) { + cmd := exec.CommandContext(ctx, "kubectl", "get", "pod", + "-n", namespace, + podName, + "-o", "json") + + output, err := cmd.Output() + if err != nil { + return "", fmt.Errorf("kubectl get pod failed: %w", err) + } + + var pod struct { + Spec struct { + Volumes []struct { + Name string `json:"name"` + PersistentVolumeClaim *struct { + ClaimName string `json:"claimName"` + } `json:"persistentVolumeClaim"` + } `json:"volumes"` + Containers []struct { + VolumeMounts []struct { + Name string `json:"name"` + MountPath string `json:"mountPath"` + } `json:"volumeMounts"` + } `json:"containers"` + } `json:"spec"` + } + + if err := json.Unmarshal(output, &pod); err != nil { + return "", fmt.Errorf("failed to parse pod: %w", err) + } + + // Find volume name for PVC + var volumeName string + for _, volume := range pod.Spec.Volumes { + if volume.PersistentVolumeClaim != nil && + volume.PersistentVolumeClaim.ClaimName == pvcName { + volumeName = volume.Name + break + } + } + + if volumeName == "" { + return "", fmt.Errorf("PVC %s not found in pod volumes", pvcName) + } + + // Find mount path for volume + for _, container := range pod.Spec.Containers { + for _, mount := range container.VolumeMounts { + if mount.Name == volumeName { + return mount.MountPath, nil + } + } + } + + return "", fmt.Errorf("mount path not found for volume %s", volumeName) +} +``` + +**Acceptance Criteria**: +- Discovers PVCs in app namespace +- Finds pod using PVC +- Gets correct mount path +- Creates tar.gz with actual data +- Handles multiple PVCs +- Integration test: backup Immich PVCs + +**Estimated Effort**: 4 hours + +--- + +### Task 1.5: Update BackupApp Flow + +**File**: `wild-central-api/internal/backup/backup.go` + +**Replace BackupApp function** (complete rewrite): + +```go +func (m *Manager) BackupApp(instanceName, appName string) (*BackupInfo, error) { + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Minute) + defer cancel() + + // Create timestamped backup directory + timestamp := time.Now().UTC().Format("20060102T150405Z") + stagingDir := filepath.Join(m.dataDir, "instances", instanceName, "backups", "staging") + backupDir := filepath.Join(stagingDir, "apps", appName, timestamp) + + if err := os.MkdirAll(backupDir, 0755); err != nil { + return nil, fmt.Errorf("failed to create backup directory: %w", err) + } + + // Initialize backup info with in_progress status + info := &BackupInfo{ + Type: "app", + AppName: appName, + Status: "in_progress", + CreatedAt: time.Now().UTC().Format(time.RFC3339), + Files: []string{}, + } + + // Save initial metadata + if err := m.saveBackupMetadata(backupDir, info); err != nil { + return nil, fmt.Errorf("failed to save initial metadata: %w", err) + } + + // Read app dependencies from manifest + deps, err := m.getAppDependencies(appName) + if err != nil { + info.Status = "failed" + info.Error = fmt.Sprintf("Failed to read manifest: %v", err) + m.saveBackupMetadata(backupDir, info) + return info, err + } + + var backupFiles []string + + // Backup PostgreSQL if required + if deps.HasPostgres { + file, err := m.backupPostgres(ctx, instanceName, appName, backupDir) + if err != nil { + info.Status = "failed" + info.Error = fmt.Sprintf("PostgreSQL backup failed: %v", err) + m.saveBackupMetadata(backupDir, info) + return info, err + } + backupFiles = append(backupFiles, file) + } + + // Backup MySQL if required + if deps.HasMySQL { + file, err := m.backupMySQL(ctx, instanceName, appName, backupDir) + if err != nil { + info.Status = "failed" + info.Error = fmt.Sprintf("MySQL backup failed: %v", err) + m.saveBackupMetadata(backupDir, info) + return info, err + } + backupFiles = append(backupFiles, file) + } + + // Discover and backup PVCs + pvcNames, err := m.findAppPVCs(ctx, appName) + if err != nil { + // Log warning but don't fail if no PVCs found + log.Printf("Warning: failed to find PVCs for %s: %v", appName, err) + } else { + for _, pvcName := range pvcNames { + file, err := m.backupPVC(ctx, appName, pvcName, backupDir) + if err != nil { + log.Printf("Warning: failed to backup PVC %s: %v", pvcName, err) + continue + } + backupFiles = append(backupFiles, file) + } + } + + // Calculate total backup size + var totalSize int64 + for _, file := range backupFiles { + stat, err := os.Stat(file) + if err == nil { + totalSize += stat.Size() + } + } + + // Update final metadata + info.Status = "completed" + info.Files = backupFiles + info.Size = totalSize + info.Error = "" + + if err := m.saveBackupMetadata(backupDir, info); err != nil { + return info, fmt.Errorf("failed to save final metadata: %w", err) + } + + return info, nil +} + +func (m *Manager) saveBackupMetadata(backupDir string, info *BackupInfo) error { + metadataFile := filepath.Join(backupDir, "backup.json") + data, err := json.MarshalIndent(info, "", " ") + if err != nil { + return fmt.Errorf("failed to marshal metadata: %w", err) + } + return os.WriteFile(metadataFile, data, 0644) +} +``` + +**Acceptance Criteria**: +- Creates timestamped backup directories +- Reads manifest to detect dependencies +- Backs up databases if present +- Backs up PVCs if present +- Calculates accurate backup size +- Saves complete metadata +- Handles errors gracefully +- Integration test: Full Gitea backup + +**Estimated Effort**: 4 hours + +--- + +### Task 1.6: Build and Test + +**Steps**: +1. Build wild-central-api +2. Deploy to test environment +3. Test Gitea backup (PostgreSQL + PVC) +4. Test Immich backup (PostgreSQL + multiple PVCs) +5. Verify backup files exist and have data +6. Verify metadata accuracy +7. Test manual restore + +**Acceptance Criteria**: +- All builds succeed +- App backups create actual files +- Metadata is accurate +- Manual restore works + +**Estimated Effort**: 4 hours + +--- + +## Phase 2: Restic Integration + +### Goal +Upload staged backups to restic repository with flexible backends. + +### Priority +🟡 **HIGH PRIORITY** (after Phase 1 complete) + +### Timeline +5-7 days + +### Prerequisites +- Phase 1 completed and tested +- Restic installed on Wild Central device +- Backup destination configured (S3, B2, local, etc.) + +### Task 2.1: Configuration Management + +**File**: `wild-central-api/internal/backup/config.go` (new file) + +**Implementation**: +```go +package backup + +import ( + "fmt" + "strings" + + "github.com/wild-cloud/wild-central/daemon/internal/config" +) + +type BackupConfig struct { + Repository string + Staging string + Retention RetentionPolicy + Backend BackendConfig +} + +type RetentionPolicy struct { + KeepDaily int + KeepWeekly int + KeepMonthly int + KeepYearly int +} + +type BackendConfig struct { + Type string + Endpoint string + Region string + Port int +} + +type BackupSecrets struct { + Password string + Credentials BackendCredentials +} + +type BackendCredentials struct { + S3 *S3Credentials + SFTP *SFTPCredentials + Azure *AzureCredentials + GCS *GCSCredentials +} + +type S3Credentials struct { + AccessKeyID string + SecretAccessKey string +} + +type SFTPCredentials struct { + Password string + PrivateKey string +} + +type AzureCredentials struct { + AccountName string + AccountKey string +} + +type GCSCredentials struct { + ProjectID string + ServiceAccountKey string +} + +func LoadBackupConfig(instanceName string) (*BackupConfig, *BackupSecrets, error) { + cfg, err := config.LoadInstanceConfig(instanceName) + if err != nil { + return nil, nil, fmt.Errorf("failed to load config: %w", err) + } + + secrets, err := config.LoadInstanceSecrets(instanceName) + if err != nil { + return nil, nil, fmt.Errorf("failed to load secrets: %w", err) + } + + backupCfg := &BackupConfig{ + Repository: cfg.Cloud.Backup.Repository, + Staging: cfg.Cloud.Backup.Staging, + Retention: RetentionPolicy{ + KeepDaily: cfg.Cloud.Backup.Retention.KeepDaily, + KeepWeekly: cfg.Cloud.Backup.Retention.KeepWeekly, + KeepMonthly: cfg.Cloud.Backup.Retention.KeepMonthly, + KeepYearly: cfg.Cloud.Backup.Retention.KeepYearly, + }, + Backend: BackendConfig{ + Type: DetectBackendType(cfg.Cloud.Backup.Repository), + Endpoint: cfg.Cloud.Backup.Backend.Endpoint, + Region: cfg.Cloud.Backup.Backend.Region, + Port: cfg.Cloud.Backup.Backend.Port, + }, + } + + backupSecrets := &BackupSecrets{ + Password: secrets.Cloud.Backup.Password, + Credentials: BackendCredentials{ + S3: secrets.Cloud.Backup.Credentials.S3, + SFTP: secrets.Cloud.Backup.Credentials.SFTP, + Azure: secrets.Cloud.Backup.Credentials.Azure, + GCS: secrets.Cloud.Backup.Credentials.GCS, + }, + } + + return backupCfg, backupSecrets, nil +} + +func DetectBackendType(repository string) string { + if strings.HasPrefix(repository, "/") { + return "local" + } else if strings.HasPrefix(repository, "sftp:") { + return "sftp" + } else if strings.HasPrefix(repository, "s3:") || strings.HasPrefix(repository, "b2:") { + return "s3" + } else if strings.HasPrefix(repository, "azure:") { + return "azure" + } else if strings.HasPrefix(repository, "gs:") { + return "gcs" + } else if strings.HasPrefix(repository, "rclone:") { + return "rclone" + } + return "unknown" +} + +func ValidateBackupConfig(cfg *BackupConfig, secrets *BackupSecrets) error { + if cfg.Repository == "" { + return fmt.Errorf("repository is required") + } + + if secrets.Password == "" { + return fmt.Errorf("repository password is required") + } + + // Validate backend-specific credentials + switch cfg.Backend.Type { + case "s3": + if secrets.Credentials.S3 == nil { + return fmt.Errorf("S3 credentials required for S3 backend") + } + if secrets.Credentials.S3.AccessKeyID == "" || secrets.Credentials.S3.SecretAccessKey == "" { + return fmt.Errorf("S3 access key and secret key required") + } + case "sftp": + if secrets.Credentials.SFTP == nil { + return fmt.Errorf("SFTP credentials required for SFTP backend") + } + if secrets.Credentials.SFTP.Password == "" && secrets.Credentials.SFTP.PrivateKey == "" { + return fmt.Errorf("SFTP password or private key required") + } + case "azure": + if secrets.Credentials.Azure == nil { + return fmt.Errorf("Azure credentials required for Azure backend") + } + if secrets.Credentials.Azure.AccountName == "" || secrets.Credentials.Azure.AccountKey == "" { + return fmt.Errorf("Azure account name and key required") + } + case "gcs": + if secrets.Credentials.GCS == nil { + return fmt.Errorf("GCS credentials required for GCS backend") + } + if secrets.Credentials.GCS.ServiceAccountKey == "" { + return fmt.Errorf("GCS service account key required") + } + } + + return nil +} +``` + +**Estimated Effort**: 3 hours + +--- + +### Task 2.2: Restic Operations Module + +**File**: `wild-central-api/internal/backup/restic.go` (new file) + +**Implementation**: +```go +package backup + +import ( + "context" + "encoding/json" + "fmt" + "os" + "os/exec" + "strings" +) + +type ResticClient struct { + config *BackupConfig + secrets *BackupSecrets +} + +func NewResticClient(config *BackupConfig, secrets *BackupSecrets) *ResticClient { + return &ResticClient{ + config: config, + secrets: secrets, + } +} + +func (r *ResticClient) buildEnv() map[string]string { + env := map[string]string{ + "RESTIC_REPOSITORY": r.config.Repository, + "RESTIC_PASSWORD": r.secrets.Password, + } + + switch r.config.Backend.Type { + case "s3": + if r.secrets.Credentials.S3 != nil { + env["AWS_ACCESS_KEY_ID"] = r.secrets.Credentials.S3.AccessKeyID + env["AWS_SECRET_ACCESS_KEY"] = r.secrets.Credentials.S3.SecretAccessKey + } + if r.config.Backend.Endpoint != "" { + env["AWS_S3_ENDPOINT"] = r.config.Backend.Endpoint + } + if r.config.Backend.Region != "" { + env["AWS_DEFAULT_REGION"] = r.config.Backend.Region + } + + case "sftp": + if r.secrets.Credentials.SFTP != nil && r.secrets.Credentials.SFTP.Password != "" { + env["RESTIC_SFTP_PASSWORD"] = r.secrets.Credentials.SFTP.Password + } + + case "azure": + if r.secrets.Credentials.Azure != nil { + env["AZURE_ACCOUNT_NAME"] = r.secrets.Credentials.Azure.AccountName + env["AZURE_ACCOUNT_KEY"] = r.secrets.Credentials.Azure.AccountKey + } + } + + return env +} + +func (r *ResticClient) Init(ctx context.Context) error { + cmd := exec.CommandContext(ctx, "restic", "init") + + // Set environment variables + cmd.Env = os.Environ() + for k, v := range r.buildEnv() { + cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v)) + } + + output, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("restic init failed: %w: %s", err, string(output)) + } + + return nil +} + +func (r *ResticClient) Backup(ctx context.Context, path string, tags []string) (string, error) { + args := []string{"backup", path} + for _, tag := range tags { + args = append(args, "--tag", tag) + } + + cmd := exec.CommandContext(ctx, "restic", args...) + + cmd.Env = os.Environ() + for k, v := range r.buildEnv() { + cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v)) + } + + output, err := cmd.CombinedOutput() + if err != nil { + return "", fmt.Errorf("restic backup failed: %w: %s", err, string(output)) + } + + // Parse snapshot ID from output + snapshotID := r.parseSnapshotID(string(output)) + + return snapshotID, nil +} + +func (r *ResticClient) ListSnapshots(ctx context.Context, tags []string) ([]Snapshot, error) { + args := []string{"snapshots", "--json"} + for _, tag := range tags { + args = append(args, "--tag", tag) + } + + cmd := exec.CommandContext(ctx, "restic", args...) + + cmd.Env = os.Environ() + for k, v := range r.buildEnv() { + cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v)) + } + + output, err := cmd.Output() + if err != nil { + return nil, fmt.Errorf("restic snapshots failed: %w", err) + } + + var snapshots []Snapshot + if err := json.Unmarshal(output, &snapshots); err != nil { + return nil, fmt.Errorf("failed to parse snapshots: %w", err) + } + + return snapshots, nil +} + +func (r *ResticClient) Restore(ctx context.Context, snapshotID, targetPath string) error { + cmd := exec.CommandContext(ctx, "restic", "restore", snapshotID, "--target", targetPath) + + cmd.Env = os.Environ() + for k, v := range r.buildEnv() { + cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v)) + } + + output, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("restic restore failed: %w: %s", err, string(output)) + } + + return nil +} + +func (r *ResticClient) Stats(ctx context.Context) (*RepositoryStats, error) { + cmd := exec.CommandContext(ctx, "restic", "stats", "--json") + + cmd.Env = os.Environ() + for k, v := range r.buildEnv() { + cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v)) + } + + output, err := cmd.Output() + if err != nil { + return nil, fmt.Errorf("restic stats failed: %w", err) + } + + var stats RepositoryStats + if err := json.Unmarshal(output, &stats); err != nil { + return nil, fmt.Errorf("failed to parse stats: %w", err) + } + + return &stats, nil +} + +func (r *ResticClient) TestConnection(ctx context.Context) error { + cmd := exec.CommandContext(ctx, "restic", "cat", "config") + + cmd.Env = os.Environ() + for k, v := range r.buildEnv() { + cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v)) + } + + _, err := cmd.Output() + if err != nil { + return fmt.Errorf("connection test failed: %w", err) + } + + return nil +} + +func (r *ResticClient) parseSnapshotID(output string) string { + lines := strings.Split(output, "\n") + for _, line := range lines { + if strings.Contains(line, "snapshot") && strings.Contains(line, "saved") { + parts := strings.Fields(line) + for i, part := range parts { + if part == "snapshot" && i+1 < len(parts) { + return parts[i+1] + } + } + } + } + return "" +} + +type Snapshot struct { + ID string `json:"id"` + Time string `json:"time"` + Hostname string `json:"hostname"` + Tags []string `json:"tags"` + Paths []string `json:"paths"` +} + +type RepositoryStats struct { + TotalSize int64 `json:"total_size"` + TotalFileCount int64 `json:"total_file_count"` + SnapshotCount int `json:"snapshot_count"` +} +``` + +**Estimated Effort**: 4 hours + +--- + +### Task 2.3: Update Backup Flow to Upload to Restic + +**File**: `wild-central-api/internal/backup/backup.go` + +**Modify BackupApp function** to add restic upload after staging: + +```go +func (m *Manager) BackupApp(instanceName, appName string) (*BackupInfo, error) { + // ... existing Phase 1 code to create local backup ... + + // After local backup succeeds, upload to restic if configured + cfg, secrets, err := LoadBackupConfig(instanceName) + if err == nil && cfg.Repository != "" { + // Restic is configured, upload backup + client := NewResticClient(cfg, secrets) + + tags := []string{ + fmt.Sprintf("type:app"), + fmt.Sprintf("app:%s", appName), + fmt.Sprintf("instance:%s", instanceName), + } + + snapshotID, err := client.Backup(ctx, backupDir, tags) + if err != nil { + log.Printf("Warning: restic upload failed: %v", err) + // Don't fail the backup, local files still exist + } else { + info.SnapshotID = snapshotID + + // Clean up staging directory after successful upload + if err := os.RemoveAll(backupDir); err != nil { + log.Printf("Warning: failed to clean staging directory: %v", err) + } + } + } + + // Save final metadata + if err := m.saveBackupMetadata(backupDir, info); err != nil { + return info, fmt.Errorf("failed to save final metadata: %w", err) + } + + return info, nil +} +``` + +**Estimated Effort**: 2 hours + +--- + +### Task 2.4: API Client Updates + +**File**: `wild-web-app/src/services/api/backups.ts` + +**Add configuration endpoints**: + +```typescript +export interface BackupConfiguration { + repository: string; + staging: string; + retention: { + keepDaily: number; + keepWeekly: number; + keepMonthly: number; + keepYearly: number; + }; + backend: { + type: string; + endpoint?: string; + region?: string; + port?: number; + }; +} + +export interface BackupConfigurationWithCredentials extends BackupConfiguration { + password: string; + credentials?: { + s3?: { + accessKeyId: string; + secretAccessKey: string; + }; + sftp?: { + password?: string; + privateKey?: string; + }; + azure?: { + accountName: string; + accountKey: string; + }; + gcs?: { + projectId: string; + serviceAccountKey: string; + }; + }; +} + +export interface RepositoryStatus { + initialized: boolean; + reachable: boolean; + lastBackup?: string; + snapshotCount: number; +} + +export interface RepositoryStats { + repositorySize: number; + repositorySizeHuman: string; + snapshotCount: number; + fileCount: number; + uniqueChunks: number; + compressionRatio: number; + oldestSnapshot?: string; + latestSnapshot?: string; +} + +export async function getBackupConfiguration( + instanceId: string +): Promise<{ config: BackupConfiguration; status: RepositoryStatus }> { + const response = await api.get(`/instances/${instanceId}/backup/config`); + return response.data; +} + +export async function updateBackupConfiguration( + instanceId: string, + config: BackupConfigurationWithCredentials +): Promise { + await api.put(`/instances/${instanceId}/backup/config`, config); +} + +export async function testBackupConnection( + instanceId: string, + config: BackupConfigurationWithCredentials +): Promise { + const response = await api.post(`/instances/${instanceId}/backup/test`, config); + return response.data; +} + +export async function initializeBackupRepository( + instanceId: string, + config: BackupConfigurationWithCredentials +): Promise<{ repositoryId: string }> { + const response = await api.post(`/instances/${instanceId}/backup/init`, config); + return response.data; +} + +export async function getRepositoryStats( + instanceId: string +): Promise { + const response = await api.get(`/instances/${instanceId}/backup/stats`); + return response.data; +} +``` + +**Estimated Effort**: 2 hours + +--- + +### Task 2.5: Configuration UI Components + +Create the following components in `wild-web-app/src/components/backup/`: + +**BackupConfigurationCard.tsx**: +- Main configuration form +- Backend type selector +- Conditional credential inputs +- Retention policy inputs +- Test/Save/Cancel buttons + +**BackendSelector.tsx**: +- Dropdown for backend types +- Shows available backends with icons + +**CredentialsForm.tsx**: +- Dynamic form based on selected backend +- Password/key inputs with visibility toggle +- Validation + +**RepositoryStatus.tsx**: +- Display repository health +- Show stats (size, snapshots, last backup) +- Visual indicators + +**RetentionPolicyInputs.tsx**: +- Number inputs for retention periods +- Tooltips explaining each period + +**Estimated Effort**: 8 hours + +--- + +### Task 2.6: Integrate with BackupsPage + +**File**: `wild-web-app/src/router/pages/BackupsPage.tsx` + +**Add configuration section above backup list**: + +```typescript +function BackupsPage() { + const { instanceId } = useParams(); + const [showConfig, setShowConfig] = useState(false); + + const { data: backupConfig } = useQuery({ + queryKey: ['backup-config', instanceId], + queryFn: () => getBackupConfiguration(instanceId), + }); + + return ( +
+ {/* Repository Status Card */} + {backupConfig && ( + setShowConfig(true)} + /> + )} + + {/* Configuration Card (conditional) */} + {showConfig && ( + setShowConfig(false)} + onCancel={() => setShowConfig(false)} + /> + )} + + {/* Existing backup list */} + +
+ ); +} +``` + +**Estimated Effort**: 3 hours + +--- + +### Task 2.7: Backup Configuration API Handlers + +**File**: `wild-central-api/internal/api/v1/handlers_backup.go` + +**Add new handlers**: + +```go +func (h *Handler) BackupConfigGet(c *gin.Context) { + instanceName := c.Param("name") + + cfg, secrets, err := backup.LoadBackupConfig(instanceName) + if err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + // Test repository status + var status backup.RepositoryStatus + if cfg.Repository != "" { + client := backup.NewResticClient(cfg, secrets) + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + status.Initialized = true + status.Reachable = client.TestConnection(ctx) == nil + + if stats, err := client.Stats(ctx); err == nil { + status.SnapshotCount = stats.SnapshotCount + } + } + + c.JSON(http.StatusOK, gin.H{ + "success": true, + "data": gin.H{ + "config": cfg, + "status": status, + }, + }) +} + +func (h *Handler) BackupConfigUpdate(c *gin.Context) { + instanceName := c.Param("name") + + var req backup.BackupConfigurationWithCredentials + if err := c.BindJSON(&req); err != nil { + c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()}) + return + } + + // Validate configuration + if err := backup.ValidateBackupConfig(&req.Config, &req.Secrets); err != nil { + c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()}) + return + } + + // Save to config.yaml and secrets.yaml + if err := config.SaveBackupConfig(instanceName, &req); err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + c.JSON(http.StatusOK, gin.H{ + "success": true, + "message": "Backup configuration updated successfully", + }) +} + +func (h *Handler) BackupConnectionTest(c *gin.Context) { + var req backup.BackupConfigurationWithCredentials + if err := c.BindJSON(&req); err != nil { + c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()}) + return + } + + client := backup.NewResticClient(&req.Config, &req.Secrets) + + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + status := backup.RepositoryStatus{ + Reachable: client.TestConnection(ctx) == nil, + } + + if status.Reachable { + if stats, err := client.Stats(ctx); err == nil { + status.Initialized = true + status.SnapshotCount = stats.SnapshotCount + } + } + + c.JSON(http.StatusOK, gin.H{ + "success": true, + "data": status, + }) +} + +func (h *Handler) BackupRepositoryInit(c *gin.Context) { + var req backup.BackupConfigurationWithCredentials + if err := c.BindJSON(&req); err != nil { + c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()}) + return + } + + client := backup.NewResticClient(&req.Config, &req.Secrets) + + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + + if err := client.Init(ctx); err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + c.JSON(http.StatusOK, gin.H{ + "success": true, + "message": "Repository initialized successfully", + }) +} + +func (h *Handler) BackupStatsGet(c *gin.Context) { + instanceName := c.Param("name") + + cfg, secrets, err := backup.LoadBackupConfig(instanceName) + if err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + client := backup.NewResticClient(cfg, secrets) + + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + stats, err := client.Stats(ctx) + if err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + c.JSON(http.StatusOK, gin.H{ + "success": true, + "data": stats, + }) +} +``` + +**Register routes**: +```go +backupGroup := v1.Group("/instances/:name/backup") +{ + backupGroup.GET("/config", h.BackupConfigGet) + backupGroup.PUT("/config", h.BackupConfigUpdate) + backupGroup.POST("/test", h.BackupConnectionTest) + backupGroup.POST("/init", h.BackupRepositoryInit) + backupGroup.GET("/stats", h.BackupStatsGet) +} +``` + +**Estimated Effort**: 4 hours + +--- + +### Task 2.8: End-to-End Testing + +**Test scenarios**: +1. Configure local repository via UI +2. Configure S3 repository via UI +3. Test connection validation +4. Create backup and verify upload +5. Check repository stats +6. Test error handling + +**Estimated Effort**: 4 hours + +--- + +## Phase 3: Restore from Restic + +### Goal +Enable users to restore backups from restic snapshots. + +### Priority +🟢 **MEDIUM PRIORITY** (after Phase 2 complete) + +### Timeline +3-5 days + +### Task 3.1: List Snapshots API + +**File**: `wild-central-api/internal/api/v1/handlers_backup.go` + +**Implementation**: +```go +func (h *Handler) BackupSnapshotsList(c *gin.Context) { + instanceName := c.Param("name") + appName := c.Query("app") + + cfg, secrets, err := backup.LoadBackupConfig(instanceName) + if err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + client := backup.NewResticClient(cfg, secrets) + + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + var tags []string + if appName != "" { + tags = append(tags, fmt.Sprintf("app:%s", appName)) + } + + snapshots, err := client.ListSnapshots(ctx, tags) + if err != nil { + c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()}) + return + } + + c.JSON(http.StatusOK, gin.H{ + "success": true, + "data": snapshots, + }) +} +``` + +**Estimated Effort**: 2 hours + +--- + +### Task 3.2: Restore Snapshot Function + +**File**: `wild-central-api/internal/backup/backup.go` + +**Implementation**: +```go +func (m *Manager) RestoreFromSnapshot(instanceName, snapshotID string) error { + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Minute) + defer cancel() + + // Load restic config + cfg, secrets, err := LoadBackupConfig(instanceName) + if err != nil { + return fmt.Errorf("failed to load config: %w", err) + } + + client := NewResticClient(cfg, secrets) + + // Create temp directory for restore + tempDir := filepath.Join(cfg.Staging, "restore", snapshotID) + if err := os.MkdirAll(tempDir, 0755); err != nil { + return fmt.Errorf("failed to create temp directory: %w", err) + } + defer os.RemoveAll(tempDir) + + // Restore snapshot to temp directory + if err := client.Restore(ctx, snapshotID, tempDir); err != nil { + return fmt.Errorf("restic restore failed: %w", err) + } + + // Parse metadata to determine what to restore + metadataFile := filepath.Join(tempDir, "backup.json") + info, err := m.loadBackupMetadata(metadataFile) + if err != nil { + return fmt.Errorf("failed to load metadata: %w", err) + } + + // Restore databases + for _, file := range info.Files { + if strings.HasSuffix(file, "postgres.sql") { + if err := m.restorePostgres(ctx, info.AppName, filepath.Join(tempDir, "postgres.sql")); err != nil { + return fmt.Errorf("postgres restore failed: %w", err) + } + } else if strings.HasSuffix(file, "mysql.sql") { + if err := m.restoreMySQL(ctx, info.AppName, filepath.Join(tempDir, "mysql.sql")); err != nil { + return fmt.Errorf("mysql restore failed: %w", err) + } + } + } + + // Restore PVCs + for _, file := range info.Files { + if strings.HasSuffix(file, ".tar.gz") { + pvcName := strings.TrimSuffix(filepath.Base(file), ".tar.gz") + if err := m.restorePVC(ctx, info.AppName, pvcName, filepath.Join(tempDir, file)); err != nil { + return fmt.Errorf("pvc restore failed: %w", err) + } + } + } + + return nil +} + +func (m *Manager) restorePostgres(ctx context.Context, appName, dumpFile string) error { + dbName := appName + + podName, err := m.findPodInNamespace(ctx, "postgres", "app=postgres") + if err != nil { + return fmt.Errorf("postgres pod not found: %w", err) + } + + // Drop and recreate database + cmd := exec.CommandContext(ctx, "kubectl", "exec", "-n", "postgres", + podName, "--", "psql", "-U", "postgres", "-c", + fmt.Sprintf("DROP DATABASE IF EXISTS %s; CREATE DATABASE %s;", dbName, dbName)) + + if err := cmd.Run(); err != nil { + return fmt.Errorf("failed to recreate database: %w", err) + } + + // Restore dump + dumpData, err := os.ReadFile(dumpFile) + if err != nil { + return fmt.Errorf("failed to read dump: %w", err) + } + + cmd = exec.CommandContext(ctx, "kubectl", "exec", "-i", "-n", "postgres", + podName, "--", "psql", "-U", "postgres", dbName) + cmd.Stdin = strings.NewReader(string(dumpData)) + + if err := cmd.Run(); err != nil { + return fmt.Errorf("psql restore failed: %w", err) + } + + return nil +} + +func (m *Manager) restoreMySQL(ctx context.Context, appName, dumpFile string) error { + // Similar implementation to restorePostgres + // Use mysqldump with password from secret + return nil +} + +func (m *Manager) restorePVC(ctx context.Context, namespace, pvcName, tarFile string) error { + podName, err := m.findPodUsingPVC(ctx, namespace, pvcName) + if err != nil { + return fmt.Errorf("no pod found using PVC: %w", err) + } + + mountPath, err := m.getPVCMountPath(ctx, namespace, podName, pvcName) + if err != nil { + return fmt.Errorf("failed to get mount path: %w", err) + } + + // Copy tar file to pod + cmd := exec.CommandContext(ctx, "kubectl", "cp", tarFile, + fmt.Sprintf("%s/%s:/tmp/restore.tar.gz", namespace, podName)) + + if err := cmd.Run(); err != nil { + return fmt.Errorf("kubectl cp failed: %w", err) + } + + // Extract tar file + cmd = exec.CommandContext(ctx, "kubectl", "exec", "-n", namespace, + podName, "--", "tar", "xzf", "/tmp/restore.tar.gz", "-C", mountPath) + + if err := cmd.Run(); err != nil { + return fmt.Errorf("tar extract failed: %w", err) + } + + // Clean up temp file + cmd = exec.CommandContext(ctx, "kubectl", "exec", "-n", namespace, + podName, "--", "rm", "/tmp/restore.tar.gz") + cmd.Run() // Ignore error + + return nil +} +``` + +**Estimated Effort**: 5 hours + +--- + +### Task 3.3: Restore API Handler + +**File**: `wild-central-api/internal/api/v1/handlers_backup.go` + +**Implementation**: +```go +func (h *Handler) BackupSnapshotRestore(c *gin.Context) { + instanceName := c.Param("name") + snapshotID := c.Param("snapshotId") + + // Start restore operation asynchronously + go func() { + if err := h.backupManager.RestoreFromSnapshot(instanceName, snapshotID); err != nil { + log.Printf("Restore failed: %v", err) + } + }() + + c.JSON(http.StatusAccepted, gin.H{ + "success": true, + "message": "Restore operation started", + }) +} +``` + +**Estimated Effort**: 1 hour + +--- + +### Task 3.4: Restore UI + +**File**: `wild-web-app/src/components/backup/RestoreDialog.tsx` + +**Implementation**: +Create dialog that: +- Lists available snapshots +- Shows snapshot details (date, size, files) +- Confirmation before restore +- Progress indicator + +**Estimated Effort**: 4 hours + +--- + +### Task 3.5: End-to-End Restore Testing + +**Test scenarios**: +1. List snapshots for app +2. Select snapshot to restore +3. Restore database +4. Restore PVCs +5. Verify application works after restore +6. Test error handling + +**Estimated Effort**: 3 hours + +--- + +## API Specifications + +### Complete API Reference + +``` +# Backup Operations +POST /api/v1/instances/{name}/backups/app/{appName} # Create app backup +POST /api/v1/instances/{name}/backups/cluster # Create cluster backup +GET /api/v1/instances/{name}/backups/app # List app backups +GET /api/v1/instances/{name}/backups/cluster # List cluster backups +DELETE /api/v1/instances/{name}/backups/app/{appName}/{id} # Delete app backup +DELETE /api/v1/instances/{name}/backups/cluster/{id} # Delete cluster backup + +# Backup Configuration (Phase 2) +GET /api/v1/instances/{name}/backup/config # Get backup configuration +PUT /api/v1/instances/{name}/backup/config # Update configuration +POST /api/v1/instances/{name}/backup/test # Test connection +POST /api/v1/instances/{name}/backup/init # Initialize repository +GET /api/v1/instances/{name}/backup/stats # Get repository stats + +# Restore Operations (Phase 3) +GET /api/v1/instances/{name}/backup/snapshots # List snapshots +POST /api/v1/instances/{name}/backup/snapshots/{id}/restore # Restore snapshot +``` + +--- + +## Web UI Design + +### Page Structure + +**BackupsPage Layout**: +``` +┌─────────────────────────────────────────────────┐ +│ Backups │ +├─────────────────────────────────────────────────┤ +│ │ +│ ┌─ Backup Status ─────────────────────────┐ │ +│ │ Repository: Configured ✓ │ │ +│ │ Last Backup: 2 hours ago │ │ +│ │ Total Size: 2.4 GB │ │ +│ │ Snapshots: 24 │ │ +│ │ [Edit Configuration] │ │ +│ └─────────────────────────────────────────┘ │ +│ │ +│ ┌─ Recent Backups ────────────────────────┐ │ +│ │ [Backup cards with restore/delete] │ │ +│ │ ... │ │ +│ └─────────────────────────────────────────┘ │ +│ │ +│ ┌─ Configuration (when editing) ──────────┐ │ +│ │ Backend Type: [S3 ▼] │ │ +│ │ Repository URI: [s3:bucket/path ] │ │ +│ │ Credentials: │ │ +│ │ Access Key ID: [••••••••••• ] │ │ +│ │ Secret Key: [•••••••••••••••• ] │ │ +│ │ Retention Policy: │ │ +│ │ Daily: [7] Weekly: [4] Monthly: [6] │ │ +│ │ [Test Connection] [Save] [Cancel] │ │ +│ └─────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────┘ +``` + +### Component Hierarchy + +``` +BackupsPage +├── BackupStatusCard (read-only) +│ ├── RepositoryStatus +│ ├── Stats (size, snapshots, last backup) +│ └── EditButton +│ +├── BackupListSection +│ └── BackupCard[] (existing) +│ +└── BackupConfigurationCard (conditional) + ├── BackendTypeSelect + ├── RepositoryUriInput + ├── CredentialsSection + │ ├── S3CredentialsForm (conditional) + │ ├── SFTPCredentialsForm (conditional) + │ └── ... + ├── RetentionPolicyInputs + └── ActionButtons + ├── TestConnectionButton + ├── SaveButton + └── CancelButton +``` + +--- + +## Testing Strategy + +### Phase 1 Testing + +**Unit Tests**: +- Manifest parsing +- Helper functions (contains, findPodInNamespace) +- Backup file creation + +**Integration Tests**: +- End-to-end Gitea backup (PostgreSQL + PVC) +- End-to-end Immich backup (PostgreSQL + multiple PVCs) +- Backup with no database +- Backup with no PVCs + +**Manual Tests**: +1. Create backup via web UI +2. Verify `.sql` file exists with actual data +3. Verify `.tar.gz` files exist with actual data +4. Check metadata accuracy +5. Test delete functionality + +### Phase 2 Testing + +**Unit Tests**: +- Backend type detection +- Environment variable mapping +- Configuration validation + +**Integration Tests**: +- Repository initialization (local, S3, SFTP) +- Backup upload to restic +- Snapshot listing +- Stats retrieval +- Connection testing + +**Manual Tests**: +1. Configure local repository via UI +2. Configure S3 repository via UI +3. Test connection validation before save +4. Create backup and verify in restic +5. Check repository stats display +6. Test error handling for bad credentials + +### Phase 3 Testing + +**Integration Tests**: +- Restore database from snapshot +- Restore PVC from snapshot +- Full app restore +- Handle missing/corrupted snapshots + +**Manual Tests**: +1. List snapshots in UI +2. Select and restore from snapshot +3. Verify database data after restore +4. Verify PVC data after restore +5. Verify application functions correctly + +--- + +## Deployment Guide + +### Phase 1 Deployment + +**Preparation**: +1. Update wild-central-api code +2. Build and test on development instance +3. Verify backup files created with real data +4. Test manual restore + +**Rollout**: +1. Deploy to staging environment +2. Create test backups for multiple apps +3. Verify all backup files exist +4. Manually restore one backup to verify +5. Deploy to production + +**Rollback Plan**: +- Previous version still creates metadata files +- No breaking changes to backup structure +- Users can manually copy backup files if needed + +### Phase 2 Deployment + +**Preparation**: +1. Install restic on Wild Central devices: `apt install restic` +2. Update wild-central-api with restic code +3. Update wild-web-app with configuration UI +4. Test on development with local repository +5. Test with S3 and SFTP backends + +**Migration**: +- Existing local backups remain accessible +- Users opt-in to restic by configuring repository +- Gradual migration: Phase 1 staging continues working + +**Rollout**: +1. Deploy backend API updates +2. Deploy web UI updates +3. Create user documentation with examples +4. Provide migration guide for existing setups + +**Rollback Plan**: +- Restic is optional: users can continue using local backups +- Configuration in config.yaml: easy to revert +- No data loss: existing backups preserved + +### Phase 3 Deployment + +**Preparation**: +1. Ensure Phase 2 is stable +2. Ensure at least one backup exists in restic +3. Test restore in staging environment + +**Rollout**: +1. Deploy restore functionality +2. Document restore procedures +3. Train users on restore process + +--- + +## Task Breakdown + +### Phase 1 Tasks (2-3 days) + +| Task | Description | Effort | Dependencies | +|------|-------------|--------|--------------| +| 1.1 | Manifest-based database detection | 2h | None | +| 1.2 | PostgreSQL backup via kubectl exec | 3h | 1.1 | +| 1.3 | MySQL backup via kubectl exec | 3h | 1.1 | +| 1.4 | PVC discovery and backup | 4h | 1.1 | +| 1.5 | Update BackupApp flow | 4h | 1.2, 1.3, 1.4 | +| 1.6 | Build and test | 4h | 1.5 | + +**Total**: 20 hours (2.5 days) + +### Phase 2 Tasks (5-7 days) + +| Task | Description | Effort | Dependencies | +|------|-------------|--------|--------------| +| 2.1 | Configuration management | 3h | Phase 1 done | +| 2.2 | Restic operations module | 4h | 2.1 | +| 2.3 | Update backup flow for restic | 2h | 2.2 | +| 2.4 | API client updates | 2h | Phase 1 done | +| 2.5 | Configuration UI components | 8h | 2.4 | +| 2.6 | Integrate with BackupsPage | 3h | 2.5 | +| 2.7 | Backup configuration API handlers | 4h | 2.1, 2.2 | +| 2.8 | End-to-end testing | 4h | 2.3, 2.6, 2.7 | + +**Total**: 30 hours (3.75 days) + +### Phase 3 Tasks (3-5 days) + +| Task | Description | Effort | Dependencies | +|------|-------------|--------|--------------| +| 3.1 | List snapshots API | 2h | Phase 2 done | +| 3.2 | Restore snapshot function | 5h | 3.1 | +| 3.3 | Restore API handler | 1h | 3.2 | +| 3.4 | Restore UI | 4h | 3.3 | +| 3.5 | End-to-end restore testing | 3h | 3.4 | + +**Total**: 15 hours (2 days) + +### Grand Total +**65 hours** across 3 phases (8-12 days total) + +--- + +## Success Criteria + +### Phase 1 Success +- ✅ App backups create actual database dumps (`.sql` files) +- ✅ App backups create actual PVC archives (`.tar.gz` files) +- ✅ Backup metadata accurately lists all files +- ✅ Backups organized in timestamped directories +- ✅ In-progress tracking works correctly +- ✅ Delete functionality works for both app and cluster backups +- ✅ No silent failures (clear error messages) +- ✅ Manual restore verified working + +### Phase 2 Success +- ✅ Users can configure restic repository via web UI +- ✅ Configuration persists to config.yaml/secrets.yaml +- ✅ Test connection validates before save +- ✅ Backups automatically upload to restic repository +- ✅ Repository stats display correctly in UI +- ✅ Local, S3, and SFTP backends supported and tested +- ✅ Clear error messages for authentication/connection failures +- ✅ Staging files cleaned after successful upload + +### Phase 3 Success +- ✅ Users can list available snapshots in UI +- ✅ Users can restore from any snapshot via UI +- ✅ Database restoration works correctly +- ✅ PVC restoration works correctly +- ✅ Application functional after restore +- ✅ Error handling for corrupted snapshots + +### Long-Term Metrics +- **Storage Efficiency**: Deduplication achieves 60-80% space savings +- **Reliability**: < 1% backup failures +- **Performance**: Backup TB-scale data in < 4 hours +- **User Satisfaction**: Backup/restore completes without support intervention + +--- + +## Dependencies and Prerequisites + +### External Dependencies + +**Restic** (backup tool): +- Installation: `apt install restic` +- Version: >= 0.16.0 recommended +- License: BSD 2-Clause (compatible) + +**kubectl** (Kubernetes CLI): +- Already required for Wild Cloud operations +- Used for database dumps and PVC backup + +### Infrastructure Prerequisites + +**Storage Requirements**: + +**Staging Directory**: +- Location: `/var/lib/wild-central/backup-staging` (default) +- Space: `max(largest_database, largest_pvc) + 20% buffer` +- Recommendation: Monitor space, warn if < 50GB free + +**Restic Repository**: +- Local: Sufficient disk space on target mount +- Network: Mounted filesystem (NFS/SMB) +- Cloud: Typically unlimited, check quota/billing + +**Network Requirements**: +- Outbound HTTPS (443) for S3/B2/cloud backends +- Outbound SSH (22 or custom) for SFTP +- No inbound ports needed + +### Security Considerations + +**Credentials Storage**: +- Stored in secrets.yaml +- Never logged or exposed in API responses +- Transmitted only via HTTPS to backend APIs + +**Encryption**: +- Restic: AES-256 encryption of all backup data +- Transport: TLS for cloud backends, SSH for SFTP +- At rest: Depends on backend (S3 server-side encryption, etc.) + +**Access Control**: +- API endpoints check instance ownership +- Repository password required for all restic operations +- Backend credentials validated before save + +--- + +## Philosophy Compliance Review + +### KISS (Keep It Simple, Stupid) + +✅ **What We're Doing Right**: +- Restic repository URI as simple string (native format) +- Backend type auto-detected from URI prefix +- Credentials organized by backend type +- No complex abstraction layers + +✅ **What We're Avoiding**: +- Custom backup format +- Complex configuration DSL +- Over-abstracted backend interfaces +- Scheduling/automation (not needed yet) + +### YAGNI (You Aren't Gonna Need It) + +✅ **Building Only What's Needed**: +- Basic configuration (repository, credentials, retention) +- Test connection before save +- Upload to restic after staging +- Display repository stats + +❌ **Not Building** (until proven needed): +- Automated scheduling +- Multiple repository support +- Backup verification automation +- Email notifications +- Bandwidth limiting +- Custom encryption options + +### No Future-Proofing + +✅ **Current Requirements Only**: +- Support TB-scale data (restic deduplication) +- Flexible storage destinations (restic backends) +- Storage constraints (upload to remote, not local-only) + +❌ **Not Speculating On**: +- "What if users want backup versioning rules?" +- "What if users need bandwidth control?" +- "What if users want custom encryption?" +- Build these features WHEN users ask, not before + +### Trust in Emergence + +✅ **Starting Simple**: +- Phase 1: Fix core backup (files actually created) +- Phase 2: Add restic upload (storage flexibility) +- Phase 3: Add restore from restic +- Phase 4+: Wait for user feedback + +**Let complexity emerge from actual needs**, not speculation. + +--- + +## Conclusion + +This complete implementation guide provides everything needed to implement a production-ready backup system for Wild Cloud across three phases: + +1. **Phase 1 (CRITICAL)**: Fix broken app backups by creating actual database dumps and PVC archives using manifest-based detection and kubectl exec +2. **Phase 2 (HIGH)**: Integrate restic for TB-scale data, flexible storage backends, and configuration via web UI +3. **Phase 3 (MEDIUM)**: Enable restore from restic snapshots + +All phases are designed following Wild Cloud's KISS/YAGNI philosophy: build only what's needed now, let complexity emerge from actual requirements, and trust that good architecture emerges from simplicity. + +The implementation is ready for a senior engineer to begin Phase 1 immediately with all necessary context, specifications, code examples, and guidance provided. + +--- + +**Document Version**: 1.0 +**Created**: 2025-11-26 +**Status**: Ready for implementation +**Next Action**: Begin Phase 1, Task 1.1 diff --git a/future/independent-versioning.md b/docs/future/independent-versioning.md similarity index 100% rename from future/independent-versioning.md rename to docs/future/independent-versioning.md