Files
wild-cloud-dev/docs/future/backups.md

72 KiB

Wild Cloud Backup System - Complete Implementation Guide

Date: 2025-11-26 Status: 📋 READY FOR IMPLEMENTATION Estimated Effort: Phase 1: 2-3 days | Phase 2: 5-7 days | Phase 3: 3-5 days


Table of Contents

  1. Executive Summary
  2. Background and Context
  3. Problem Analysis
  4. Architecture Overview
  5. Configuration Design
  6. Phase 1: Core Backup Fix
  7. Phase 2: Restic Integration
  8. Phase 3: Restore from Restic
  9. API Specifications
  10. Web UI Design
  11. Testing Strategy
  12. Deployment Guide
  13. Task Breakdown
  14. Success Criteria

Executive Summary

Current State

App backups are completely broken - they create only metadata files (backup.json) without any actual backup data:

  • No database dump files (.sql, .dump)
  • No PVC archive files (.tar.gz)
  • Users cannot restore from these "backups"
  • Cluster backups work correctly (different code path)

Root Cause

Database detection uses pod label-based discovery (app=gitea in postgres namespace), but database pods are shared infrastructure labeled app=postgres. Detection always returns empty, so no backups are created.

Why This Matters

  • Scale: Applications like Immich may host terabyte-scale photo libraries
  • Storage: Wild Central devices may not have sufficient local storage
  • Flexibility: Need flexible destinations: local, NFS, S3, Backblaze B2, SFTP, etc.
  • Deduplication: Critical for TB-scale data (60-80% space savings)

Solution: Three-Phase Approach

Phase 1 (CRITICAL - 2-3 days): Fix broken app backups

  • Manifest-based database detection (declarative)
  • kubectl exec for database dumps
  • PVC discovery and backup
  • Store files locally in staging directory

Phase 2 (HIGH PRIORITY - 5-7 days): Restic integration

  • Upload staged files to restic repository
  • Configuration via config.yaml and web UI
  • Support multiple backends (local, S3, B2, SFTP)
  • Repository initialization and testing

Phase 3 (MEDIUM PRIORITY - 3-5 days): Restore from restic

  • List available snapshots
  • Restore from any snapshot
  • Database and PVC restoration
  • Web UI for restore operations

Background and Context

Project Philosophy

Wild Cloud follows strict KISS/YAGNI principles:

  • KISS: Keep implementations as simple as possible
  • YAGNI: Build only what's needed now, not speculative features
  • No future-proofing: Let complexity emerge from actual requirements
  • Trust in emergence: Start simple, enhance when requirements proven

Key Design Decisions

  1. Manifest-based detection: Read app dependencies from manifest.yaml (declarative), not runtime pod discovery
  2. kubectl exec approach: Use standard Kubernetes operations for dumps and tar archives
  3. Restic for scale: Use battle-tested restic tool for TB-scale data and flexible backends
  4. Phased implementation: Fix core bugs first, add features incrementally

Why Restic?

Justified by actual requirements (not premature optimization):

  • Scale: Handle TB-scale data (Immich with terabytes of photos)
  • Flexibility: Multiple backends (local, S3, B2, SFTP, Azure, GCS)
  • Efficiency: 60-80% space savings via deduplication
  • Security: Built-in AES-256 encryption
  • Reliability: Battle-tested, widely adopted
  • Incremental: Only backup changed blocks

Problem Analysis

Critical Bug: App Backups Create No Files

Evidence from /home/payne/repos/wild-cloud-dev/.working/in-progress-fix.md:

Backup structure:
apps/
└── gitea/
    └── 20241124T143022Z/
        └── backup.json    ← Only this file exists!

Expected structure:
apps/
└── gitea/
    └── 20241124T143022Z/
        ├── backup.json
        ├── postgres.sql    ← Missing!
        └── data.tar.gz     ← Missing!

Root Cause Analysis

File: wild-central-api/internal/backup/backup.go (lines 544-569)

func (m *Manager) detectDatabaseType(ctx context.Context, namespace, appLabel string) (string, error) {
    // This looks for pods with label "app=gitea" in namespace "postgres"
    // But database pods are labeled "app=postgres" in namespace "postgres"
    // This ALWAYS returns empty result!

    cmd := exec.CommandContext(ctx, "kubectl", "get", "pods",
        "-n", namespace,
        "-l", fmt.Sprintf("app=%s", appLabel),  // ← Wrong label!
        "-o", "jsonpath={.items[0].metadata.name}")

    output, err := cmd.Output()
    if err != nil || len(output) == 0 {
        return "", nil  // ← Returns empty, no backup created
    }
    // ...
}

Why It's Broken:

  1. Gitea backup tries to find pod with label app=gitea in namespace postgres
  2. But PostgreSQL pod is labeled app=postgres in namespace postgres
  3. Detection always fails → no database dump created
  4. Same problem for PVC detection → no PVC archive created
  5. Only backup.json metadata file is written

Why Cluster Backups Work

Cluster backups don't use app-specific detection:

  • Directly use kubectl get to find etcd pods
  • Use hardcoded paths for config files
  • Don't rely on app-based pod labels
  • Actually create .tar.gz files with real data

Architecture Overview

System Components

┌─────────────────────────────────────────────────────────┐
│ Wild Cloud Backup System                                │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─ Web UI (wild-web-app) ─────────────────────┐     │
│  │ - Backup configuration form                  │     │
│  │ - Repository status display                  │     │
│  │ - Backup creation/restore UI                 │     │
│  └──────────────────┬───────────────────────────┘     │
│                     │ REST API                         │
│  ┌─ API Layer (wild-central-api) ───────────────┐     │
│  │ - Backup configuration endpoints              │     │
│  │ - Backup/restore operation handlers           │     │
│  │ - Restic integration layer                    │     │
│  └──────────────────┬───────────────────────────┘     │
│                     │                                   │
│  ┌─ Backup Engine ────────────────────────────┐       │
│  │ - Manifest parser                           │       │
│  │ - Database backup (kubectl exec pg_dump)   │       │
│  │ - PVC backup (kubectl exec tar)            │       │
│  │ - Restic upload (Phase 2)                  │       │
│  └──────────────────┬───────────────────────────┘     │
│                     │                                   │
│  ┌─ Storage Layer ────────────────────────────┐       │
│  │ Phase 1: Local staging directory            │       │
│  │ Phase 2: Restic repository (local/remote)  │       │
│  └─────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────┘

Data Flow

Phase 1 (Local Staging):

User clicks "Backup" → API Handler
                          ↓
                    Read manifest.yaml (detect databases)
                          ↓
                    kubectl exec pg_dump → postgres.sql
                          ↓
                    kubectl exec tar → pvc-data.tar.gz
                          ↓
                    Save to /var/lib/wild-central/backup-staging/
                          ↓
                    Write backup.json metadata

Phase 2 (Restic Upload):

[Same as Phase 1] →  Local staging files created
                          ↓
                    restic backup <staging-dir>
                          ↓
                    Upload to repository (S3/B2/local/etc)
                          ↓
                    Clean staging directory
                          ↓
                    Update metadata with snapshot ID

Phase 3 (Restore):

User selects snapshot → restic restore <snapshot-id>
                          ↓
                    Download to staging directory
                          ↓
                    kubectl exec psql < postgres.sql
                          ↓
                    kubectl cp tar file → pod
                          ↓
                    kubectl exec tar -xzf → restore PVC data

Configuration Design

Schema: config.yaml

cloud:
  domain: "wildcloud.local"
  dns:
    ip: "192.168.8.50"

  backup:
    # Restic repository location (native restic URI format)
    repository: "/mnt/backups/wild-cloud"  # or "s3:bucket" or "sftp:user@host:/path"

    # Local staging directory (always on Wild Central filesystem)
    staging: "/var/lib/wild-central/backup-staging"

    # Retention policy (restic forget flags)
    retention:
      keepDaily: 7
      keepWeekly: 4
      keepMonthly: 6
      keepYearly: 2

    # Backend-specific configuration (optional, backend-dependent)
    backend:
      # For S3-compatible backends (B2, Wasabi, MinIO)
      endpoint: "s3.us-west-002.backblazeb2.com"
      region: "us-west-002"

      # For SFTP
      port: 22

Schema: secrets.yaml

cloud:
  backup:
    # Restic repository encryption password
    password: "strong-encryption-password"

    # Backend credentials (conditional on backend type)
    credentials:
      # For S3/B2/S3-compatible (auto-detected from repository prefix)
      s3:
        accessKeyId: "KEY_ID"
        secretAccessKey: "SECRET_KEY"

      # For SFTP
      sftp:
        password: "ssh-password"
        # OR
        privateKey: |
          -----BEGIN OPENSSH PRIVATE KEY-----
          ...
          -----END OPENSSH PRIVATE KEY-----

      # For Azure
      azure:
        accountName: "account"
        accountKey: "key"

      # For Google Cloud
      gcs:
        projectId: "project-id"
        serviceAccountKey: |
          { "type": "service_account", ... }

Configuration Examples

Example 1: Local Testing

config.yaml:

cloud:
  backup:
    repository: "/mnt/external-drive/wild-cloud-backups"
    staging: "/var/lib/wild-central/backup-staging"
    retention:
      keepDaily: 7
      keepWeekly: 4
      keepMonthly: 6

secrets.yaml:

cloud:
  backup:
    password: "test-backup-password-123"

Example 2: Backblaze B2

config.yaml:

cloud:
  backup:
    repository: "b2:wild-cloud-backups"
    staging: "/var/lib/wild-central/backup-staging"
    retention:
      keepDaily: 7
      keepWeekly: 4
      keepMonthly: 6
    backend:
      endpoint: "s3.us-west-002.backblazeb2.com"
      region: "us-west-002"

secrets.yaml:

cloud:
  backup:
    password: "strong-encryption-password"
    credentials:
      s3:
        accessKeyId: "0020123456789abcdef"
        secretAccessKey: "K002abcdefghijklmnop"

Example 3: AWS S3

config.yaml:

cloud:
  backup:
    repository: "s3:s3.amazonaws.com/my-wild-cloud-backups"
    staging: "/var/lib/wild-central/backup-staging"
    retention:
      keepDaily: 14
      keepWeekly: 8
      keepMonthly: 12
    backend:
      region: "us-east-1"

secrets.yaml:

cloud:
  backup:
    password: "prod-encryption-password"
    credentials:
      s3:
        accessKeyId: "AKIAIOSFODNN7EXAMPLE"
        secretAccessKey: "wJalrXUtnFEMI/K7MDENG/bPxRfiCY"

Example 4: SFTP Remote Server

config.yaml:

cloud:
  backup:
    repository: "sftp:backup-user@backup.example.com:/wild-cloud-backups"
    staging: "/var/lib/wild-central/backup-staging"
    retention:
      keepDaily: 7
      keepWeekly: 4
      keepMonthly: 6
    backend:
      port: 2222

secrets.yaml:

cloud:
  backup:
    password: "restic-repo-password"
    credentials:
      sftp:
        privateKey: |
          -----BEGIN OPENSSH PRIVATE KEY-----
          ...
          -----END OPENSSH PRIVATE KEY-----

Example 5: NFS/SMB Mount (as Local Path)

config.yaml:

cloud:
  backup:
    repository: "/mnt/nas-backups/wild-cloud"  # NFS mounted via OS
    staging: "/var/lib/wild-central/backup-staging"
    retention:
      keepDaily: 7
      keepWeekly: 4
      keepMonthly: 6

secrets.yaml:

cloud:
  backup:
    password: "backup-encryption-password"

Backend Detection Logic

func DetectBackendType(repository string) string {
    if strings.HasPrefix(repository, "/") {
        return "local"
    } else if strings.HasPrefix(repository, "sftp:") {
        return "sftp"
    } else if strings.HasPrefix(repository, "s3:") || strings.HasPrefix(repository, "b2:") {
        return "s3"
    } else if strings.HasPrefix(repository, "azure:") {
        return "azure"
    } else if strings.HasPrefix(repository, "gs:") {
        return "gcs"
    } else if strings.HasPrefix(repository, "rclone:") {
        return "rclone"
    }
    return "unknown"
}

Environment Variable Mapping

func BuildResticEnv(config BackupConfig, secrets BackupSecrets) map[string]string {
    env := map[string]string{
        "RESTIC_REPOSITORY": config.Repository,
        "RESTIC_PASSWORD":   secrets.Password,
    }

    backendType := DetectBackendType(config.Repository)

    switch backendType {
    case "s3":
        env["AWS_ACCESS_KEY_ID"] = secrets.Credentials.S3.AccessKeyID
        env["AWS_SECRET_ACCESS_KEY"] = secrets.Credentials.S3.SecretAccessKey

        if config.Backend.Endpoint != "" {
            env["AWS_S3_ENDPOINT"] = config.Backend.Endpoint
        }
        if config.Backend.Region != "" {
            env["AWS_DEFAULT_REGION"] = config.Backend.Region
        }

    case "sftp":
        if secrets.Credentials.SFTP.Password != "" {
            env["RESTIC_SFTP_PASSWORD"] = secrets.Credentials.SFTP.Password
        }
        // SSH key handling done via temp file

    case "azure":
        env["AZURE_ACCOUNT_NAME"] = secrets.Credentials.Azure.AccountName
        env["AZURE_ACCOUNT_KEY"] = secrets.Credentials.Azure.AccountKey

    case "gcs":
        // Write service account key to temp file, set GOOGLE_APPLICATION_CREDENTIALS
    }

    return env
}

Phase 1: Core Backup Fix

Goal

Fix critical bugs and create actual backup files (no restic yet).

Priority

🔴 CRITICAL - Users cannot restore from current backups

Timeline

2-3 days

Overview

Replace broken pod label-based detection with manifest-based detection. Use kubectl exec to create actual database dumps and PVC archives.

Task 1.1: Implement Manifest-Based Database Detection

File: wild-central-api/internal/backup/backup.go

Add New Structures:

type AppDependencies struct {
    HasPostgres bool
    HasMySQL    bool
    HasRedis    bool
}

Implement Detection Function:

func (m *Manager) getAppDependencies(appName string) (*AppDependencies, error) {
    manifestPath := filepath.Join(m.directoryPath, appName, "manifest.yaml")

    manifest, err := directory.LoadManifest(manifestPath)
    if err != nil {
        return nil, fmt.Errorf("failed to load manifest: %w", err)
    }

    deps := &AppDependencies{
        HasPostgres: contains(manifest.Requires, "postgres"),
        HasMySQL:    contains(manifest.Requires, "mysql"),
        HasRedis:    contains(manifest.Requires, "redis"),
    }

    return deps, nil
}

func contains(slice []string, item string) bool {
    for _, s := range slice {
        if s == item {
            return true
        }
    }
    return false
}

Changes Required:

  • Add import: "github.com/wild-cloud/wild-central/daemon/internal/directory"
  • Remove old detectDatabaseType() function (lines 544-569)

Acceptance Criteria:

  • Reads manifest.yaml for app
  • Correctly identifies postgres dependency
  • Correctly identifies mysql dependency
  • Returns error if manifest not found
  • Unit test: parse manifest with postgres
  • Unit test: parse manifest without databases

Estimated Effort: 2 hours


Task 1.2: Implement PostgreSQL Backup via kubectl exec

File: wild-central-api/internal/backup/backup.go

Implementation:

func (m *Manager) backupPostgres(ctx context.Context, instanceName, appName, backupDir string) (string, error) {
    dbName := appName // Database name convention

    // Find postgres pod in postgres namespace
    podName, err := m.findPodInNamespace(ctx, "postgres", "app=postgres")
    if err != nil {
        return "", fmt.Errorf("postgres pod not found: %w", err)
    }

    // Execute pg_dump
    dumpFile := filepath.Join(backupDir, "postgres.sql")
    cmd := exec.CommandContext(ctx, "kubectl", "exec", "-n", "postgres",
        podName, "--", "pg_dump", "-U", "postgres", dbName)

    output, err := cmd.Output()
    if err != nil {
        return "", fmt.Errorf("pg_dump failed: %w", err)
    }

    // Write dump to file
    if err := os.WriteFile(dumpFile, output, 0600); err != nil {
        return "", fmt.Errorf("failed to write dump: %w", err)
    }

    return dumpFile, nil
}

// Helper function to find pod by label
func (m *Manager) findPodInNamespace(ctx context.Context, namespace, labelSelector string) (string, error) {
    cmd := exec.CommandContext(ctx, "kubectl", "get", "pods",
        "-n", namespace,
        "-l", labelSelector,
        "-o", "jsonpath={.items[0].metadata.name}")

    output, err := cmd.Output()
    if err != nil {
        return "", fmt.Errorf("kubectl get pods failed: %w", err)
    }

    podName := strings.TrimSpace(string(output))
    if podName == "" {
        return "", fmt.Errorf("no pod found with label %s in namespace %s", labelSelector, namespace)
    }

    return podName, nil
}

Acceptance Criteria:

  • Finds postgres pod correctly
  • Executes pg_dump successfully
  • Creates .sql file with actual data
  • Handles errors gracefully
  • Integration test: backup Gitea database

Estimated Effort: 3 hours


Task 1.3: Implement MySQL Backup via kubectl exec

File: wild-central-api/internal/backup/backup.go

Implementation:

func (m *Manager) backupMySQL(ctx context.Context, instanceName, appName, backupDir string) (string, error) {
    dbName := appName

    // Find mysql pod
    podName, err := m.findPodInNamespace(ctx, "mysql", "app=mysql")
    if err != nil {
        return "", fmt.Errorf("mysql pod not found: %w", err)
    }

    // Get MySQL root password from secret
    password, err := m.getMySQLPassword(ctx)
    if err != nil {
        return "", fmt.Errorf("failed to get mysql password: %w", err)
    }

    // Execute mysqldump
    dumpFile := filepath.Join(backupDir, "mysql.sql")
    cmd := exec.CommandContext(ctx, "kubectl", "exec", "-n", "mysql",
        podName, "--", "mysqldump",
        "-uroot",
        fmt.Sprintf("-p%s", password),
        "--single-transaction",
        "--routines",
        "--triggers",
        dbName)

    output, err := cmd.Output()
    if err != nil {
        return "", fmt.Errorf("mysqldump failed: %w", err)
    }

    if err := os.WriteFile(dumpFile, output, 0600); err != nil {
        return "", fmt.Errorf("failed to write dump: %w", err)
    }

    return dumpFile, nil
}

func (m *Manager) getMySQLPassword(ctx context.Context) (string, error) {
    cmd := exec.CommandContext(ctx, "kubectl", "get", "secret",
        "-n", "mysql",
        "mysql-root-password",
        "-o", "jsonpath={.data.password}")

    output, err := cmd.Output()
    if err != nil {
        return "", fmt.Errorf("failed to get secret: %w", err)
    }

    // Decode base64
    decoded, err := base64.StdEncoding.DecodeString(string(output))
    if err != nil {
        return "", fmt.Errorf("failed to decode password: %w", err)
    }

    return string(decoded), nil
}

Acceptance Criteria:

  • Finds mysql pod correctly
  • Retrieves password from secret
  • Executes mysqldump successfully
  • Creates .sql file with actual data
  • Handles errors gracefully

Estimated Effort: 3 hours


Task 1.4: Implement PVC Discovery and Backup

File: wild-central-api/internal/backup/backup.go

Implementation:

func (m *Manager) findAppPVCs(ctx context.Context, appName string) ([]string, error) {
    // Get namespace for app (convention: app name)
    namespace := appName

    cmd := exec.CommandContext(ctx, "kubectl", "get", "pvc",
        "-n", namespace,
        "-o", "jsonpath={.items[*].metadata.name}")

    output, err := cmd.Output()
    if err != nil {
        return nil, fmt.Errorf("kubectl get pvc failed: %w", err)
    }

    pvcNames := strings.Fields(string(output))
    return pvcNames, nil
}

func (m *Manager) backupPVC(ctx context.Context, namespace, pvcName, backupDir string) (string, error) {
    // Find pod using this PVC
    podName, err := m.findPodUsingPVC(ctx, namespace, pvcName)
    if err != nil {
        return "", fmt.Errorf("no pod found using PVC %s: %w", pvcName, err)
    }

    // Get mount path for PVC
    mountPath, err := m.getPVCMountPath(ctx, namespace, podName, pvcName)
    if err != nil {
        return "", fmt.Errorf("failed to get mount path: %w", err)
    }

    // Create tar archive of PVC data
    tarFile := filepath.Join(backupDir, fmt.Sprintf("%s.tar.gz", pvcName))
    cmd := exec.CommandContext(ctx, "kubectl", "exec", "-n", namespace,
        podName, "--", "tar", "czf", "-", "-C", mountPath, ".")

    output, err := cmd.Output()
    if err != nil {
        return "", fmt.Errorf("tar command failed: %w", err)
    }

    if err := os.WriteFile(tarFile, output, 0600); err != nil {
        return "", fmt.Errorf("failed to write tar file: %w", err)
    }

    return tarFile, nil
}

func (m *Manager) findPodUsingPVC(ctx context.Context, namespace, pvcName string) (string, error) {
    cmd := exec.CommandContext(ctx, "kubectl", "get", "pods",
        "-n", namespace,
        "-o", "json")

    output, err := cmd.Output()
    if err != nil {
        return "", fmt.Errorf("kubectl get pods failed: %w", err)
    }

    // Parse JSON to find pod using this PVC
    var podList struct {
        Items []struct {
            Metadata struct {
                Name string `json:"name"`
            } `json:"metadata"`
            Spec struct {
                Volumes []struct {
                    PersistentVolumeClaim *struct {
                        ClaimName string `json:"claimName"`
                    } `json:"persistentVolumeClaim"`
                } `json:"volumes"`
            } `json:"spec"`
        } `json:"items"`
    }

    if err := json.Unmarshal(output, &podList); err != nil {
        return "", fmt.Errorf("failed to parse pod list: %w", err)
    }

    for _, pod := range podList.Items {
        for _, volume := range pod.Spec.Volumes {
            if volume.PersistentVolumeClaim != nil &&
               volume.PersistentVolumeClaim.ClaimName == pvcName {
                return pod.Metadata.Name, nil
            }
        }
    }

    return "", fmt.Errorf("no pod found using PVC %s", pvcName)
}

func (m *Manager) getPVCMountPath(ctx context.Context, namespace, podName, pvcName string) (string, error) {
    cmd := exec.CommandContext(ctx, "kubectl", "get", "pod",
        "-n", namespace,
        podName,
        "-o", "json")

    output, err := cmd.Output()
    if err != nil {
        return "", fmt.Errorf("kubectl get pod failed: %w", err)
    }

    var pod struct {
        Spec struct {
            Volumes []struct {
                Name                  string `json:"name"`
                PersistentVolumeClaim *struct {
                    ClaimName string `json:"claimName"`
                } `json:"persistentVolumeClaim"`
            } `json:"volumes"`
            Containers []struct {
                VolumeMounts []struct {
                    Name      string `json:"name"`
                    MountPath string `json:"mountPath"`
                } `json:"volumeMounts"`
            } `json:"containers"`
        } `json:"spec"`
    }

    if err := json.Unmarshal(output, &pod); err != nil {
        return "", fmt.Errorf("failed to parse pod: %w", err)
    }

    // Find volume name for PVC
    var volumeName string
    for _, volume := range pod.Spec.Volumes {
        if volume.PersistentVolumeClaim != nil &&
           volume.PersistentVolumeClaim.ClaimName == pvcName {
            volumeName = volume.Name
            break
        }
    }

    if volumeName == "" {
        return "", fmt.Errorf("PVC %s not found in pod volumes", pvcName)
    }

    // Find mount path for volume
    for _, container := range pod.Spec.Containers {
        for _, mount := range container.VolumeMounts {
            if mount.Name == volumeName {
                return mount.MountPath, nil
            }
        }
    }

    return "", fmt.Errorf("mount path not found for volume %s", volumeName)
}

Acceptance Criteria:

  • Discovers PVCs in app namespace
  • Finds pod using PVC
  • Gets correct mount path
  • Creates tar.gz with actual data
  • Handles multiple PVCs
  • Integration test: backup Immich PVCs

Estimated Effort: 4 hours


Task 1.5: Update BackupApp Flow

File: wild-central-api/internal/backup/backup.go

Replace BackupApp function (complete rewrite):

func (m *Manager) BackupApp(instanceName, appName string) (*BackupInfo, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Minute)
    defer cancel()

    // Create timestamped backup directory
    timestamp := time.Now().UTC().Format("20060102T150405Z")
    stagingDir := filepath.Join(m.dataDir, "instances", instanceName, "backups", "staging")
    backupDir := filepath.Join(stagingDir, "apps", appName, timestamp)

    if err := os.MkdirAll(backupDir, 0755); err != nil {
        return nil, fmt.Errorf("failed to create backup directory: %w", err)
    }

    // Initialize backup info with in_progress status
    info := &BackupInfo{
        Type:      "app",
        AppName:   appName,
        Status:    "in_progress",
        CreatedAt: time.Now().UTC().Format(time.RFC3339),
        Files:     []string{},
    }

    // Save initial metadata
    if err := m.saveBackupMetadata(backupDir, info); err != nil {
        return nil, fmt.Errorf("failed to save initial metadata: %w", err)
    }

    // Read app dependencies from manifest
    deps, err := m.getAppDependencies(appName)
    if err != nil {
        info.Status = "failed"
        info.Error = fmt.Sprintf("Failed to read manifest: %v", err)
        m.saveBackupMetadata(backupDir, info)
        return info, err
    }

    var backupFiles []string

    // Backup PostgreSQL if required
    if deps.HasPostgres {
        file, err := m.backupPostgres(ctx, instanceName, appName, backupDir)
        if err != nil {
            info.Status = "failed"
            info.Error = fmt.Sprintf("PostgreSQL backup failed: %v", err)
            m.saveBackupMetadata(backupDir, info)
            return info, err
        }
        backupFiles = append(backupFiles, file)
    }

    // Backup MySQL if required
    if deps.HasMySQL {
        file, err := m.backupMySQL(ctx, instanceName, appName, backupDir)
        if err != nil {
            info.Status = "failed"
            info.Error = fmt.Sprintf("MySQL backup failed: %v", err)
            m.saveBackupMetadata(backupDir, info)
            return info, err
        }
        backupFiles = append(backupFiles, file)
    }

    // Discover and backup PVCs
    pvcNames, err := m.findAppPVCs(ctx, appName)
    if err != nil {
        // Log warning but don't fail if no PVCs found
        log.Printf("Warning: failed to find PVCs for %s: %v", appName, err)
    } else {
        for _, pvcName := range pvcNames {
            file, err := m.backupPVC(ctx, appName, pvcName, backupDir)
            if err != nil {
                log.Printf("Warning: failed to backup PVC %s: %v", pvcName, err)
                continue
            }
            backupFiles = append(backupFiles, file)
        }
    }

    // Calculate total backup size
    var totalSize int64
    for _, file := range backupFiles {
        stat, err := os.Stat(file)
        if err == nil {
            totalSize += stat.Size()
        }
    }

    // Update final metadata
    info.Status = "completed"
    info.Files = backupFiles
    info.Size = totalSize
    info.Error = ""

    if err := m.saveBackupMetadata(backupDir, info); err != nil {
        return info, fmt.Errorf("failed to save final metadata: %w", err)
    }

    return info, nil
}

func (m *Manager) saveBackupMetadata(backupDir string, info *BackupInfo) error {
    metadataFile := filepath.Join(backupDir, "backup.json")
    data, err := json.MarshalIndent(info, "", "  ")
    if err != nil {
        return fmt.Errorf("failed to marshal metadata: %w", err)
    }
    return os.WriteFile(metadataFile, data, 0644)
}

Acceptance Criteria:

  • Creates timestamped backup directories
  • Reads manifest to detect dependencies
  • Backs up databases if present
  • Backs up PVCs if present
  • Calculates accurate backup size
  • Saves complete metadata
  • Handles errors gracefully
  • Integration test: Full Gitea backup

Estimated Effort: 4 hours


Task 1.6: Build and Test

Steps:

  1. Build wild-central-api
  2. Deploy to test environment
  3. Test Gitea backup (PostgreSQL + PVC)
  4. Test Immich backup (PostgreSQL + multiple PVCs)
  5. Verify backup files exist and have data
  6. Verify metadata accuracy
  7. Test manual restore

Acceptance Criteria:

  • All builds succeed
  • App backups create actual files
  • Metadata is accurate
  • Manual restore works

Estimated Effort: 4 hours


Phase 2: Restic Integration

Goal

Upload staged backups to restic repository with flexible backends.

Priority

🟡 HIGH PRIORITY (after Phase 1 complete)

Timeline

5-7 days

Prerequisites

  • Phase 1 completed and tested
  • Restic installed on Wild Central device
  • Backup destination configured (S3, B2, local, etc.)

Task 2.1: Configuration Management

File: wild-central-api/internal/backup/config.go (new file)

Implementation:

package backup

import (
    "fmt"
    "strings"

    "github.com/wild-cloud/wild-central/daemon/internal/config"
)

type BackupConfig struct {
    Repository string
    Staging    string
    Retention  RetentionPolicy
    Backend    BackendConfig
}

type RetentionPolicy struct {
    KeepDaily   int
    KeepWeekly  int
    KeepMonthly int
    KeepYearly  int
}

type BackendConfig struct {
    Type     string
    Endpoint string
    Region   string
    Port     int
}

type BackupSecrets struct {
    Password    string
    Credentials BackendCredentials
}

type BackendCredentials struct {
    S3    *S3Credentials
    SFTP  *SFTPCredentials
    Azure *AzureCredentials
    GCS   *GCSCredentials
}

type S3Credentials struct {
    AccessKeyID     string
    SecretAccessKey string
}

type SFTPCredentials struct {
    Password   string
    PrivateKey string
}

type AzureCredentials struct {
    AccountName string
    AccountKey  string
}

type GCSCredentials struct {
    ProjectID         string
    ServiceAccountKey string
}

func LoadBackupConfig(instanceName string) (*BackupConfig, *BackupSecrets, error) {
    cfg, err := config.LoadInstanceConfig(instanceName)
    if err != nil {
        return nil, nil, fmt.Errorf("failed to load config: %w", err)
    }

    secrets, err := config.LoadInstanceSecrets(instanceName)
    if err != nil {
        return nil, nil, fmt.Errorf("failed to load secrets: %w", err)
    }

    backupCfg := &BackupConfig{
        Repository: cfg.Cloud.Backup.Repository,
        Staging:    cfg.Cloud.Backup.Staging,
        Retention: RetentionPolicy{
            KeepDaily:   cfg.Cloud.Backup.Retention.KeepDaily,
            KeepWeekly:  cfg.Cloud.Backup.Retention.KeepWeekly,
            KeepMonthly: cfg.Cloud.Backup.Retention.KeepMonthly,
            KeepYearly:  cfg.Cloud.Backup.Retention.KeepYearly,
        },
        Backend: BackendConfig{
            Type:     DetectBackendType(cfg.Cloud.Backup.Repository),
            Endpoint: cfg.Cloud.Backup.Backend.Endpoint,
            Region:   cfg.Cloud.Backup.Backend.Region,
            Port:     cfg.Cloud.Backup.Backend.Port,
        },
    }

    backupSecrets := &BackupSecrets{
        Password: secrets.Cloud.Backup.Password,
        Credentials: BackendCredentials{
            S3:    secrets.Cloud.Backup.Credentials.S3,
            SFTP:  secrets.Cloud.Backup.Credentials.SFTP,
            Azure: secrets.Cloud.Backup.Credentials.Azure,
            GCS:   secrets.Cloud.Backup.Credentials.GCS,
        },
    }

    return backupCfg, backupSecrets, nil
}

func DetectBackendType(repository string) string {
    if strings.HasPrefix(repository, "/") {
        return "local"
    } else if strings.HasPrefix(repository, "sftp:") {
        return "sftp"
    } else if strings.HasPrefix(repository, "s3:") || strings.HasPrefix(repository, "b2:") {
        return "s3"
    } else if strings.HasPrefix(repository, "azure:") {
        return "azure"
    } else if strings.HasPrefix(repository, "gs:") {
        return "gcs"
    } else if strings.HasPrefix(repository, "rclone:") {
        return "rclone"
    }
    return "unknown"
}

func ValidateBackupConfig(cfg *BackupConfig, secrets *BackupSecrets) error {
    if cfg.Repository == "" {
        return fmt.Errorf("repository is required")
    }

    if secrets.Password == "" {
        return fmt.Errorf("repository password is required")
    }

    // Validate backend-specific credentials
    switch cfg.Backend.Type {
    case "s3":
        if secrets.Credentials.S3 == nil {
            return fmt.Errorf("S3 credentials required for S3 backend")
        }
        if secrets.Credentials.S3.AccessKeyID == "" || secrets.Credentials.S3.SecretAccessKey == "" {
            return fmt.Errorf("S3 access key and secret key required")
        }
    case "sftp":
        if secrets.Credentials.SFTP == nil {
            return fmt.Errorf("SFTP credentials required for SFTP backend")
        }
        if secrets.Credentials.SFTP.Password == "" && secrets.Credentials.SFTP.PrivateKey == "" {
            return fmt.Errorf("SFTP password or private key required")
        }
    case "azure":
        if secrets.Credentials.Azure == nil {
            return fmt.Errorf("Azure credentials required for Azure backend")
        }
        if secrets.Credentials.Azure.AccountName == "" || secrets.Credentials.Azure.AccountKey == "" {
            return fmt.Errorf("Azure account name and key required")
        }
    case "gcs":
        if secrets.Credentials.GCS == nil {
            return fmt.Errorf("GCS credentials required for GCS backend")
        }
        if secrets.Credentials.GCS.ServiceAccountKey == "" {
            return fmt.Errorf("GCS service account key required")
        }
    }

    return nil
}

Estimated Effort: 3 hours


Task 2.2: Restic Operations Module

File: wild-central-api/internal/backup/restic.go (new file)

Implementation:

package backup

import (
    "context"
    "encoding/json"
    "fmt"
    "os"
    "os/exec"
    "strings"
)

type ResticClient struct {
    config  *BackupConfig
    secrets *BackupSecrets
}

func NewResticClient(config *BackupConfig, secrets *BackupSecrets) *ResticClient {
    return &ResticClient{
        config:  config,
        secrets: secrets,
    }
}

func (r *ResticClient) buildEnv() map[string]string {
    env := map[string]string{
        "RESTIC_REPOSITORY": r.config.Repository,
        "RESTIC_PASSWORD":   r.secrets.Password,
    }

    switch r.config.Backend.Type {
    case "s3":
        if r.secrets.Credentials.S3 != nil {
            env["AWS_ACCESS_KEY_ID"] = r.secrets.Credentials.S3.AccessKeyID
            env["AWS_SECRET_ACCESS_KEY"] = r.secrets.Credentials.S3.SecretAccessKey
        }
        if r.config.Backend.Endpoint != "" {
            env["AWS_S3_ENDPOINT"] = r.config.Backend.Endpoint
        }
        if r.config.Backend.Region != "" {
            env["AWS_DEFAULT_REGION"] = r.config.Backend.Region
        }

    case "sftp":
        if r.secrets.Credentials.SFTP != nil && r.secrets.Credentials.SFTP.Password != "" {
            env["RESTIC_SFTP_PASSWORD"] = r.secrets.Credentials.SFTP.Password
        }

    case "azure":
        if r.secrets.Credentials.Azure != nil {
            env["AZURE_ACCOUNT_NAME"] = r.secrets.Credentials.Azure.AccountName
            env["AZURE_ACCOUNT_KEY"] = r.secrets.Credentials.Azure.AccountKey
        }
    }

    return env
}

func (r *ResticClient) Init(ctx context.Context) error {
    cmd := exec.CommandContext(ctx, "restic", "init")

    // Set environment variables
    cmd.Env = os.Environ()
    for k, v := range r.buildEnv() {
        cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v))
    }

    output, err := cmd.CombinedOutput()
    if err != nil {
        return fmt.Errorf("restic init failed: %w: %s", err, string(output))
    }

    return nil
}

func (r *ResticClient) Backup(ctx context.Context, path string, tags []string) (string, error) {
    args := []string{"backup", path}
    for _, tag := range tags {
        args = append(args, "--tag", tag)
    }

    cmd := exec.CommandContext(ctx, "restic", args...)

    cmd.Env = os.Environ()
    for k, v := range r.buildEnv() {
        cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v))
    }

    output, err := cmd.CombinedOutput()
    if err != nil {
        return "", fmt.Errorf("restic backup failed: %w: %s", err, string(output))
    }

    // Parse snapshot ID from output
    snapshotID := r.parseSnapshotID(string(output))

    return snapshotID, nil
}

func (r *ResticClient) ListSnapshots(ctx context.Context, tags []string) ([]Snapshot, error) {
    args := []string{"snapshots", "--json"}
    for _, tag := range tags {
        args = append(args, "--tag", tag)
    }

    cmd := exec.CommandContext(ctx, "restic", args...)

    cmd.Env = os.Environ()
    for k, v := range r.buildEnv() {
        cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v))
    }

    output, err := cmd.Output()
    if err != nil {
        return nil, fmt.Errorf("restic snapshots failed: %w", err)
    }

    var snapshots []Snapshot
    if err := json.Unmarshal(output, &snapshots); err != nil {
        return nil, fmt.Errorf("failed to parse snapshots: %w", err)
    }

    return snapshots, nil
}

func (r *ResticClient) Restore(ctx context.Context, snapshotID, targetPath string) error {
    cmd := exec.CommandContext(ctx, "restic", "restore", snapshotID, "--target", targetPath)

    cmd.Env = os.Environ()
    for k, v := range r.buildEnv() {
        cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v))
    }

    output, err := cmd.CombinedOutput()
    if err != nil {
        return fmt.Errorf("restic restore failed: %w: %s", err, string(output))
    }

    return nil
}

func (r *ResticClient) Stats(ctx context.Context) (*RepositoryStats, error) {
    cmd := exec.CommandContext(ctx, "restic", "stats", "--json")

    cmd.Env = os.Environ()
    for k, v := range r.buildEnv() {
        cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v))
    }

    output, err := cmd.Output()
    if err != nil {
        return nil, fmt.Errorf("restic stats failed: %w", err)
    }

    var stats RepositoryStats
    if err := json.Unmarshal(output, &stats); err != nil {
        return nil, fmt.Errorf("failed to parse stats: %w", err)
    }

    return &stats, nil
}

func (r *ResticClient) TestConnection(ctx context.Context) error {
    cmd := exec.CommandContext(ctx, "restic", "cat", "config")

    cmd.Env = os.Environ()
    for k, v := range r.buildEnv() {
        cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%s", k, v))
    }

    _, err := cmd.Output()
    if err != nil {
        return fmt.Errorf("connection test failed: %w", err)
    }

    return nil
}

func (r *ResticClient) parseSnapshotID(output string) string {
    lines := strings.Split(output, "\n")
    for _, line := range lines {
        if strings.Contains(line, "snapshot") && strings.Contains(line, "saved") {
            parts := strings.Fields(line)
            for i, part := range parts {
                if part == "snapshot" && i+1 < len(parts) {
                    return parts[i+1]
                }
            }
        }
    }
    return ""
}

type Snapshot struct {
    ID       string   `json:"id"`
    Time     string   `json:"time"`
    Hostname string   `json:"hostname"`
    Tags     []string `json:"tags"`
    Paths    []string `json:"paths"`
}

type RepositoryStats struct {
    TotalSize      int64  `json:"total_size"`
    TotalFileCount int64  `json:"total_file_count"`
    SnapshotCount  int    `json:"snapshot_count"`
}

Estimated Effort: 4 hours


Task 2.3: Update Backup Flow to Upload to Restic

File: wild-central-api/internal/backup/backup.go

Modify BackupApp function to add restic upload after staging:

func (m *Manager) BackupApp(instanceName, appName string) (*BackupInfo, error) {
    // ... existing Phase 1 code to create local backup ...

    // After local backup succeeds, upload to restic if configured
    cfg, secrets, err := LoadBackupConfig(instanceName)
    if err == nil && cfg.Repository != "" {
        // Restic is configured, upload backup
        client := NewResticClient(cfg, secrets)

        tags := []string{
            fmt.Sprintf("type:app"),
            fmt.Sprintf("app:%s", appName),
            fmt.Sprintf("instance:%s", instanceName),
        }

        snapshotID, err := client.Backup(ctx, backupDir, tags)
        if err != nil {
            log.Printf("Warning: restic upload failed: %v", err)
            // Don't fail the backup, local files still exist
        } else {
            info.SnapshotID = snapshotID

            // Clean up staging directory after successful upload
            if err := os.RemoveAll(backupDir); err != nil {
                log.Printf("Warning: failed to clean staging directory: %v", err)
            }
        }
    }

    // Save final metadata
    if err := m.saveBackupMetadata(backupDir, info); err != nil {
        return info, fmt.Errorf("failed to save final metadata: %w", err)
    }

    return info, nil
}

Estimated Effort: 2 hours


Task 2.4: API Client Updates

File: wild-web-app/src/services/api/backups.ts

Add configuration endpoints:

export interface BackupConfiguration {
  repository: string;
  staging: string;
  retention: {
    keepDaily: number;
    keepWeekly: number;
    keepMonthly: number;
    keepYearly: number;
  };
  backend: {
    type: string;
    endpoint?: string;
    region?: string;
    port?: number;
  };
}

export interface BackupConfigurationWithCredentials extends BackupConfiguration {
  password: string;
  credentials?: {
    s3?: {
      accessKeyId: string;
      secretAccessKey: string;
    };
    sftp?: {
      password?: string;
      privateKey?: string;
    };
    azure?: {
      accountName: string;
      accountKey: string;
    };
    gcs?: {
      projectId: string;
      serviceAccountKey: string;
    };
  };
}

export interface RepositoryStatus {
  initialized: boolean;
  reachable: boolean;
  lastBackup?: string;
  snapshotCount: number;
}

export interface RepositoryStats {
  repositorySize: number;
  repositorySizeHuman: string;
  snapshotCount: number;
  fileCount: number;
  uniqueChunks: number;
  compressionRatio: number;
  oldestSnapshot?: string;
  latestSnapshot?: string;
}

export async function getBackupConfiguration(
  instanceId: string
): Promise<{ config: BackupConfiguration; status: RepositoryStatus }> {
  const response = await api.get(`/instances/${instanceId}/backup/config`);
  return response.data;
}

export async function updateBackupConfiguration(
  instanceId: string,
  config: BackupConfigurationWithCredentials
): Promise<void> {
  await api.put(`/instances/${instanceId}/backup/config`, config);
}

export async function testBackupConnection(
  instanceId: string,
  config: BackupConfigurationWithCredentials
): Promise<RepositoryStatus> {
  const response = await api.post(`/instances/${instanceId}/backup/test`, config);
  return response.data;
}

export async function initializeBackupRepository(
  instanceId: string,
  config: BackupConfigurationWithCredentials
): Promise<{ repositoryId: string }> {
  const response = await api.post(`/instances/${instanceId}/backup/init`, config);
  return response.data;
}

export async function getRepositoryStats(
  instanceId: string
): Promise<RepositoryStats> {
  const response = await api.get(`/instances/${instanceId}/backup/stats`);
  return response.data;
}

Estimated Effort: 2 hours


Task 2.5: Configuration UI Components

Create the following components in wild-web-app/src/components/backup/:

BackupConfigurationCard.tsx:

  • Main configuration form
  • Backend type selector
  • Conditional credential inputs
  • Retention policy inputs
  • Test/Save/Cancel buttons

BackendSelector.tsx:

  • Dropdown for backend types
  • Shows available backends with icons

CredentialsForm.tsx:

  • Dynamic form based on selected backend
  • Password/key inputs with visibility toggle
  • Validation

RepositoryStatus.tsx:

  • Display repository health
  • Show stats (size, snapshots, last backup)
  • Visual indicators

RetentionPolicyInputs.tsx:

  • Number inputs for retention periods
  • Tooltips explaining each period

Estimated Effort: 8 hours


Task 2.6: Integrate with BackupsPage

File: wild-web-app/src/router/pages/BackupsPage.tsx

Add configuration section above backup list:

function BackupsPage() {
  const { instanceId } = useParams();
  const [showConfig, setShowConfig] = useState(false);

  const { data: backupConfig } = useQuery({
    queryKey: ['backup-config', instanceId],
    queryFn: () => getBackupConfiguration(instanceId),
  });

  return (
    <div className="space-y-6">
      {/* Repository Status Card */}
      {backupConfig && (
        <RepositoryStatus
          status={backupConfig.status}
          onEditClick={() => setShowConfig(true)}
        />
      )}

      {/* Configuration Card (conditional) */}
      {showConfig && (
        <BackupConfigurationCard
          instanceId={instanceId}
          currentConfig={backupConfig?.config}
          onSave={() => setShowConfig(false)}
          onCancel={() => setShowConfig(false)}
        />
      )}

      {/* Existing backup list */}
      <BackupList instanceId={instanceId} />
    </div>
  );
}

Estimated Effort: 3 hours


Task 2.7: Backup Configuration API Handlers

File: wild-central-api/internal/api/v1/handlers_backup.go

Add new handlers:

func (h *Handler) BackupConfigGet(c *gin.Context) {
    instanceName := c.Param("name")

    cfg, secrets, err := backup.LoadBackupConfig(instanceName)
    if err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    // Test repository status
    var status backup.RepositoryStatus
    if cfg.Repository != "" {
        client := backup.NewResticClient(cfg, secrets)
        ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        defer cancel()

        status.Initialized = true
        status.Reachable = client.TestConnection(ctx) == nil

        if stats, err := client.Stats(ctx); err == nil {
            status.SnapshotCount = stats.SnapshotCount
        }
    }

    c.JSON(http.StatusOK, gin.H{
        "success": true,
        "data": gin.H{
            "config": cfg,
            "status": status,
        },
    })
}

func (h *Handler) BackupConfigUpdate(c *gin.Context) {
    instanceName := c.Param("name")

    var req backup.BackupConfigurationWithCredentials
    if err := c.BindJSON(&req); err != nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
        return
    }

    // Validate configuration
    if err := backup.ValidateBackupConfig(&req.Config, &req.Secrets); err != nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
        return
    }

    // Save to config.yaml and secrets.yaml
    if err := config.SaveBackupConfig(instanceName, &req); err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    c.JSON(http.StatusOK, gin.H{
        "success": true,
        "message": "Backup configuration updated successfully",
    })
}

func (h *Handler) BackupConnectionTest(c *gin.Context) {
    var req backup.BackupConfigurationWithCredentials
    if err := c.BindJSON(&req); err != nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
        return
    }

    client := backup.NewResticClient(&req.Config, &req.Secrets)

    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    status := backup.RepositoryStatus{
        Reachable: client.TestConnection(ctx) == nil,
    }

    if status.Reachable {
        if stats, err := client.Stats(ctx); err == nil {
            status.Initialized = true
            status.SnapshotCount = stats.SnapshotCount
        }
    }

    c.JSON(http.StatusOK, gin.H{
        "success": true,
        "data":    status,
    })
}

func (h *Handler) BackupRepositoryInit(c *gin.Context) {
    var req backup.BackupConfigurationWithCredentials
    if err := c.BindJSON(&req); err != nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
        return
    }

    client := backup.NewResticClient(&req.Config, &req.Secrets)

    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := client.Init(ctx); err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    c.JSON(http.StatusOK, gin.H{
        "success": true,
        "message": "Repository initialized successfully",
    })
}

func (h *Handler) BackupStatsGet(c *gin.Context) {
    instanceName := c.Param("name")

    cfg, secrets, err := backup.LoadBackupConfig(instanceName)
    if err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    client := backup.NewResticClient(cfg, secrets)

    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    stats, err := client.Stats(ctx)
    if err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    c.JSON(http.StatusOK, gin.H{
        "success": true,
        "data":    stats,
    })
}

Register routes:

backupGroup := v1.Group("/instances/:name/backup")
{
    backupGroup.GET("/config", h.BackupConfigGet)
    backupGroup.PUT("/config", h.BackupConfigUpdate)
    backupGroup.POST("/test", h.BackupConnectionTest)
    backupGroup.POST("/init", h.BackupRepositoryInit)
    backupGroup.GET("/stats", h.BackupStatsGet)
}

Estimated Effort: 4 hours


Task 2.8: End-to-End Testing

Test scenarios:

  1. Configure local repository via UI
  2. Configure S3 repository via UI
  3. Test connection validation
  4. Create backup and verify upload
  5. Check repository stats
  6. Test error handling

Estimated Effort: 4 hours


Phase 3: Restore from Restic

Goal

Enable users to restore backups from restic snapshots.

Priority

🟢 MEDIUM PRIORITY (after Phase 2 complete)

Timeline

3-5 days

Task 3.1: List Snapshots API

File: wild-central-api/internal/api/v1/handlers_backup.go

Implementation:

func (h *Handler) BackupSnapshotsList(c *gin.Context) {
    instanceName := c.Param("name")
    appName := c.Query("app")

    cfg, secrets, err := backup.LoadBackupConfig(instanceName)
    if err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    client := backup.NewResticClient(cfg, secrets)

    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    var tags []string
    if appName != "" {
        tags = append(tags, fmt.Sprintf("app:%s", appName))
    }

    snapshots, err := client.ListSnapshots(ctx, tags)
    if err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
        return
    }

    c.JSON(http.StatusOK, gin.H{
        "success": true,
        "data":    snapshots,
    })
}

Estimated Effort: 2 hours


Task 3.2: Restore Snapshot Function

File: wild-central-api/internal/backup/backup.go

Implementation:

func (m *Manager) RestoreFromSnapshot(instanceName, snapshotID string) error {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Minute)
    defer cancel()

    // Load restic config
    cfg, secrets, err := LoadBackupConfig(instanceName)
    if err != nil {
        return fmt.Errorf("failed to load config: %w", err)
    }

    client := NewResticClient(cfg, secrets)

    // Create temp directory for restore
    tempDir := filepath.Join(cfg.Staging, "restore", snapshotID)
    if err := os.MkdirAll(tempDir, 0755); err != nil {
        return fmt.Errorf("failed to create temp directory: %w", err)
    }
    defer os.RemoveAll(tempDir)

    // Restore snapshot to temp directory
    if err := client.Restore(ctx, snapshotID, tempDir); err != nil {
        return fmt.Errorf("restic restore failed: %w", err)
    }

    // Parse metadata to determine what to restore
    metadataFile := filepath.Join(tempDir, "backup.json")
    info, err := m.loadBackupMetadata(metadataFile)
    if err != nil {
        return fmt.Errorf("failed to load metadata: %w", err)
    }

    // Restore databases
    for _, file := range info.Files {
        if strings.HasSuffix(file, "postgres.sql") {
            if err := m.restorePostgres(ctx, info.AppName, filepath.Join(tempDir, "postgres.sql")); err != nil {
                return fmt.Errorf("postgres restore failed: %w", err)
            }
        } else if strings.HasSuffix(file, "mysql.sql") {
            if err := m.restoreMySQL(ctx, info.AppName, filepath.Join(tempDir, "mysql.sql")); err != nil {
                return fmt.Errorf("mysql restore failed: %w", err)
            }
        }
    }

    // Restore PVCs
    for _, file := range info.Files {
        if strings.HasSuffix(file, ".tar.gz") {
            pvcName := strings.TrimSuffix(filepath.Base(file), ".tar.gz")
            if err := m.restorePVC(ctx, info.AppName, pvcName, filepath.Join(tempDir, file)); err != nil {
                return fmt.Errorf("pvc restore failed: %w", err)
            }
        }
    }

    return nil
}

func (m *Manager) restorePostgres(ctx context.Context, appName, dumpFile string) error {
    dbName := appName

    podName, err := m.findPodInNamespace(ctx, "postgres", "app=postgres")
    if err != nil {
        return fmt.Errorf("postgres pod not found: %w", err)
    }

    // Drop and recreate database
    cmd := exec.CommandContext(ctx, "kubectl", "exec", "-n", "postgres",
        podName, "--", "psql", "-U", "postgres", "-c",
        fmt.Sprintf("DROP DATABASE IF EXISTS %s; CREATE DATABASE %s;", dbName, dbName))

    if err := cmd.Run(); err != nil {
        return fmt.Errorf("failed to recreate database: %w", err)
    }

    // Restore dump
    dumpData, err := os.ReadFile(dumpFile)
    if err != nil {
        return fmt.Errorf("failed to read dump: %w", err)
    }

    cmd = exec.CommandContext(ctx, "kubectl", "exec", "-i", "-n", "postgres",
        podName, "--", "psql", "-U", "postgres", dbName)
    cmd.Stdin = strings.NewReader(string(dumpData))

    if err := cmd.Run(); err != nil {
        return fmt.Errorf("psql restore failed: %w", err)
    }

    return nil
}

func (m *Manager) restoreMySQL(ctx context.Context, appName, dumpFile string) error {
    // Similar implementation to restorePostgres
    // Use mysqldump with password from secret
    return nil
}

func (m *Manager) restorePVC(ctx context.Context, namespace, pvcName, tarFile string) error {
    podName, err := m.findPodUsingPVC(ctx, namespace, pvcName)
    if err != nil {
        return fmt.Errorf("no pod found using PVC: %w", err)
    }

    mountPath, err := m.getPVCMountPath(ctx, namespace, podName, pvcName)
    if err != nil {
        return fmt.Errorf("failed to get mount path: %w", err)
    }

    // Copy tar file to pod
    cmd := exec.CommandContext(ctx, "kubectl", "cp", tarFile,
        fmt.Sprintf("%s/%s:/tmp/restore.tar.gz", namespace, podName))

    if err := cmd.Run(); err != nil {
        return fmt.Errorf("kubectl cp failed: %w", err)
    }

    // Extract tar file
    cmd = exec.CommandContext(ctx, "kubectl", "exec", "-n", namespace,
        podName, "--", "tar", "xzf", "/tmp/restore.tar.gz", "-C", mountPath)

    if err := cmd.Run(); err != nil {
        return fmt.Errorf("tar extract failed: %w", err)
    }

    // Clean up temp file
    cmd = exec.CommandContext(ctx, "kubectl", "exec", "-n", namespace,
        podName, "--", "rm", "/tmp/restore.tar.gz")
    cmd.Run() // Ignore error

    return nil
}

Estimated Effort: 5 hours


Task 3.3: Restore API Handler

File: wild-central-api/internal/api/v1/handlers_backup.go

Implementation:

func (h *Handler) BackupSnapshotRestore(c *gin.Context) {
    instanceName := c.Param("name")
    snapshotID := c.Param("snapshotId")

    // Start restore operation asynchronously
    go func() {
        if err := h.backupManager.RestoreFromSnapshot(instanceName, snapshotID); err != nil {
            log.Printf("Restore failed: %v", err)
        }
    }()

    c.JSON(http.StatusAccepted, gin.H{
        "success": true,
        "message": "Restore operation started",
    })
}

Estimated Effort: 1 hour


Task 3.4: Restore UI

File: wild-web-app/src/components/backup/RestoreDialog.tsx

Implementation: Create dialog that:

  • Lists available snapshots
  • Shows snapshot details (date, size, files)
  • Confirmation before restore
  • Progress indicator

Estimated Effort: 4 hours


Task 3.5: End-to-End Restore Testing

Test scenarios:

  1. List snapshots for app
  2. Select snapshot to restore
  3. Restore database
  4. Restore PVCs
  5. Verify application works after restore
  6. Test error handling

Estimated Effort: 3 hours


API Specifications

Complete API Reference

# Backup Operations
POST   /api/v1/instances/{name}/backups/app/{appName}        # Create app backup
POST   /api/v1/instances/{name}/backups/cluster              # Create cluster backup
GET    /api/v1/instances/{name}/backups/app                  # List app backups
GET    /api/v1/instances/{name}/backups/cluster              # List cluster backups
DELETE /api/v1/instances/{name}/backups/app/{appName}/{id}   # Delete app backup
DELETE /api/v1/instances/{name}/backups/cluster/{id}         # Delete cluster backup

# Backup Configuration (Phase 2)
GET    /api/v1/instances/{name}/backup/config                # Get backup configuration
PUT    /api/v1/instances/{name}/backup/config                # Update configuration
POST   /api/v1/instances/{name}/backup/test                  # Test connection
POST   /api/v1/instances/{name}/backup/init                  # Initialize repository
GET    /api/v1/instances/{name}/backup/stats                 # Get repository stats

# Restore Operations (Phase 3)
GET    /api/v1/instances/{name}/backup/snapshots             # List snapshots
POST   /api/v1/instances/{name}/backup/snapshots/{id}/restore # Restore snapshot

Web UI Design

Page Structure

BackupsPage Layout:

┌─────────────────────────────────────────────────┐
│ Backups                                         │
├─────────────────────────────────────────────────┤
│                                                 │
│ ┌─ Backup Status ─────────────────────────┐   │
│ │ Repository: Configured ✓                 │   │
│ │ Last Backup: 2 hours ago                 │   │
│ │ Total Size: 2.4 GB                       │   │
│ │ Snapshots: 24                            │   │
│ │ [Edit Configuration]                     │   │
│ └─────────────────────────────────────────┘   │
│                                                 │
│ ┌─ Recent Backups ────────────────────────┐   │
│ │ [Backup cards with restore/delete]       │   │
│ │ ...                                      │   │
│ └─────────────────────────────────────────┘   │
│                                                 │
│ ┌─ Configuration (when editing) ──────────┐   │
│ │ Backend Type: [S3 ▼]                     │   │
│ │ Repository URI: [s3:bucket/path      ]   │   │
│ │ Credentials:                             │   │
│ │   Access Key ID: [•••••••••••       ]   │   │
│ │   Secret Key: [••••••••••••••••     ]   │   │
│ │ Retention Policy:                        │   │
│ │   Daily: [7] Weekly: [4] Monthly: [6]   │   │
│ │ [Test Connection] [Save] [Cancel]       │   │
│ └─────────────────────────────────────────┘   │
└─────────────────────────────────────────────────┘

Component Hierarchy

BackupsPage
├── BackupStatusCard (read-only)
│   ├── RepositoryStatus
│   ├── Stats (size, snapshots, last backup)
│   └── EditButton
│
├── BackupListSection
│   └── BackupCard[] (existing)
│
└── BackupConfigurationCard (conditional)
    ├── BackendTypeSelect
    ├── RepositoryUriInput
    ├── CredentialsSection
    │   ├── S3CredentialsForm (conditional)
    │   ├── SFTPCredentialsForm (conditional)
    │   └── ...
    ├── RetentionPolicyInputs
    └── ActionButtons
        ├── TestConnectionButton
        ├── SaveButton
        └── CancelButton

Testing Strategy

Phase 1 Testing

Unit Tests:

  • Manifest parsing
  • Helper functions (contains, findPodInNamespace)
  • Backup file creation

Integration Tests:

  • End-to-end Gitea backup (PostgreSQL + PVC)
  • End-to-end Immich backup (PostgreSQL + multiple PVCs)
  • Backup with no database
  • Backup with no PVCs

Manual Tests:

  1. Create backup via web UI
  2. Verify .sql file exists with actual data
  3. Verify .tar.gz files exist with actual data
  4. Check metadata accuracy
  5. Test delete functionality

Phase 2 Testing

Unit Tests:

  • Backend type detection
  • Environment variable mapping
  • Configuration validation

Integration Tests:

  • Repository initialization (local, S3, SFTP)
  • Backup upload to restic
  • Snapshot listing
  • Stats retrieval
  • Connection testing

Manual Tests:

  1. Configure local repository via UI
  2. Configure S3 repository via UI
  3. Test connection validation before save
  4. Create backup and verify in restic
  5. Check repository stats display
  6. Test error handling for bad credentials

Phase 3 Testing

Integration Tests:

  • Restore database from snapshot
  • Restore PVC from snapshot
  • Full app restore
  • Handle missing/corrupted snapshots

Manual Tests:

  1. List snapshots in UI
  2. Select and restore from snapshot
  3. Verify database data after restore
  4. Verify PVC data after restore
  5. Verify application functions correctly

Deployment Guide

Phase 1 Deployment

Preparation:

  1. Update wild-central-api code
  2. Build and test on development instance
  3. Verify backup files created with real data
  4. Test manual restore

Rollout:

  1. Deploy to staging environment
  2. Create test backups for multiple apps
  3. Verify all backup files exist
  4. Manually restore one backup to verify
  5. Deploy to production

Rollback Plan:

  • Previous version still creates metadata files
  • No breaking changes to backup structure
  • Users can manually copy backup files if needed

Phase 2 Deployment

Preparation:

  1. Install restic on Wild Central devices: apt install restic
  2. Update wild-central-api with restic code
  3. Update wild-web-app with configuration UI
  4. Test on development with local repository
  5. Test with S3 and SFTP backends

Migration:

  • Existing local backups remain accessible
  • Users opt-in to restic by configuring repository
  • Gradual migration: Phase 1 staging continues working

Rollout:

  1. Deploy backend API updates
  2. Deploy web UI updates
  3. Create user documentation with examples
  4. Provide migration guide for existing setups

Rollback Plan:

  • Restic is optional: users can continue using local backups
  • Configuration in config.yaml: easy to revert
  • No data loss: existing backups preserved

Phase 3 Deployment

Preparation:

  1. Ensure Phase 2 is stable
  2. Ensure at least one backup exists in restic
  3. Test restore in staging environment

Rollout:

  1. Deploy restore functionality
  2. Document restore procedures
  3. Train users on restore process

Task Breakdown

Phase 1 Tasks (2-3 days)

Task Description Effort Dependencies
1.1 Manifest-based database detection 2h None
1.2 PostgreSQL backup via kubectl exec 3h 1.1
1.3 MySQL backup via kubectl exec 3h 1.1
1.4 PVC discovery and backup 4h 1.1
1.5 Update BackupApp flow 4h 1.2, 1.3, 1.4
1.6 Build and test 4h 1.5

Total: 20 hours (2.5 days)

Phase 2 Tasks (5-7 days)

Task Description Effort Dependencies
2.1 Configuration management 3h Phase 1 done
2.2 Restic operations module 4h 2.1
2.3 Update backup flow for restic 2h 2.2
2.4 API client updates 2h Phase 1 done
2.5 Configuration UI components 8h 2.4
2.6 Integrate with BackupsPage 3h 2.5
2.7 Backup configuration API handlers 4h 2.1, 2.2
2.8 End-to-end testing 4h 2.3, 2.6, 2.7

Total: 30 hours (3.75 days)

Phase 3 Tasks (3-5 days)

Task Description Effort Dependencies
3.1 List snapshots API 2h Phase 2 done
3.2 Restore snapshot function 5h 3.1
3.3 Restore API handler 1h 3.2
3.4 Restore UI 4h 3.3
3.5 End-to-end restore testing 3h 3.4

Total: 15 hours (2 days)

Grand Total

65 hours across 3 phases (8-12 days total)


Success Criteria

Phase 1 Success

  • App backups create actual database dumps (.sql files)
  • App backups create actual PVC archives (.tar.gz files)
  • Backup metadata accurately lists all files
  • Backups organized in timestamped directories
  • In-progress tracking works correctly
  • Delete functionality works for both app and cluster backups
  • No silent failures (clear error messages)
  • Manual restore verified working

Phase 2 Success

  • Users can configure restic repository via web UI
  • Configuration persists to config.yaml/secrets.yaml
  • Test connection validates before save
  • Backups automatically upload to restic repository
  • Repository stats display correctly in UI
  • Local, S3, and SFTP backends supported and tested
  • Clear error messages for authentication/connection failures
  • Staging files cleaned after successful upload

Phase 3 Success

  • Users can list available snapshots in UI
  • Users can restore from any snapshot via UI
  • Database restoration works correctly
  • PVC restoration works correctly
  • Application functional after restore
  • Error handling for corrupted snapshots

Long-Term Metrics

  • Storage Efficiency: Deduplication achieves 60-80% space savings
  • Reliability: < 1% backup failures
  • Performance: Backup TB-scale data in < 4 hours
  • User Satisfaction: Backup/restore completes without support intervention

Dependencies and Prerequisites

External Dependencies

Restic (backup tool):

  • Installation: apt install restic
  • Version: >= 0.16.0 recommended
  • License: BSD 2-Clause (compatible)

kubectl (Kubernetes CLI):

  • Already required for Wild Cloud operations
  • Used for database dumps and PVC backup

Infrastructure Prerequisites

Storage Requirements:

Staging Directory:

  • Location: /var/lib/wild-central/backup-staging (default)
  • Space: max(largest_database, largest_pvc) + 20% buffer
  • Recommendation: Monitor space, warn if < 50GB free

Restic Repository:

  • Local: Sufficient disk space on target mount
  • Network: Mounted filesystem (NFS/SMB)
  • Cloud: Typically unlimited, check quota/billing

Network Requirements:

  • Outbound HTTPS (443) for S3/B2/cloud backends
  • Outbound SSH (22 or custom) for SFTP
  • No inbound ports needed

Security Considerations

Credentials Storage:

  • Stored in secrets.yaml
  • Never logged or exposed in API responses
  • Transmitted only via HTTPS to backend APIs

Encryption:

  • Restic: AES-256 encryption of all backup data
  • Transport: TLS for cloud backends, SSH for SFTP
  • At rest: Depends on backend (S3 server-side encryption, etc.)

Access Control:

  • API endpoints check instance ownership
  • Repository password required for all restic operations
  • Backend credentials validated before save

Philosophy Compliance Review

KISS (Keep It Simple, Stupid)

What We're Doing Right:

  • Restic repository URI as simple string (native format)
  • Backend type auto-detected from URI prefix
  • Credentials organized by backend type
  • No complex abstraction layers

What We're Avoiding:

  • Custom backup format
  • Complex configuration DSL
  • Over-abstracted backend interfaces
  • Scheduling/automation (not needed yet)

YAGNI (You Aren't Gonna Need It)

Building Only What's Needed:

  • Basic configuration (repository, credentials, retention)
  • Test connection before save
  • Upload to restic after staging
  • Display repository stats

Not Building (until proven needed):

  • Automated scheduling
  • Multiple repository support
  • Backup verification automation
  • Email notifications
  • Bandwidth limiting
  • Custom encryption options

No Future-Proofing

Current Requirements Only:

  • Support TB-scale data (restic deduplication)
  • Flexible storage destinations (restic backends)
  • Storage constraints (upload to remote, not local-only)

Not Speculating On:

  • "What if users want backup versioning rules?"
  • "What if users need bandwidth control?"
  • "What if users want custom encryption?"
  • Build these features WHEN users ask, not before

Trust in Emergence

Starting Simple:

  • Phase 1: Fix core backup (files actually created)
  • Phase 2: Add restic upload (storage flexibility)
  • Phase 3: Add restore from restic
  • Phase 4+: Wait for user feedback

Let complexity emerge from actual needs, not speculation.


Conclusion

This complete implementation guide provides everything needed to implement a production-ready backup system for Wild Cloud across three phases:

  1. Phase 1 (CRITICAL): Fix broken app backups by creating actual database dumps and PVC archives using manifest-based detection and kubectl exec
  2. Phase 2 (HIGH): Integrate restic for TB-scale data, flexible storage backends, and configuration via web UI
  3. Phase 3 (MEDIUM): Enable restore from restic snapshots

All phases are designed following Wild Cloud's KISS/YAGNI philosophy: build only what's needed now, let complexity emerge from actual requirements, and trust that good architecture emerges from simplicity.

The implementation is ready for a senior engineer to begin Phase 1 immediately with all necessary context, specifications, code examples, and guidance provided.


Document Version: 1.0 Created: 2025-11-26 Status: Ready for implementation Next Action: Begin Phase 1, Task 1.1