Compare commits

...

4 Commits

Author SHA1 Message Date
Paul Payne
44ebbbd42c Adds docs. 2025-11-04 16:44:36 +00:00
Paul Payne
f9c7a9e2f4 Adds docs. 2025-10-12 00:36:17 +00:00
Paul Payne
97dc73cc9c Update Makefile. 2025-10-11 21:43:12 +00:00
Paul Payne
09e64acad9 Adds docs. 2025-10-11 18:13:37 +00:00
19 changed files with 1648 additions and 36 deletions

View File

@@ -107,46 +107,11 @@ clean:
@go clean
@echo "✅ Clean complete"
test:
@echo "🧪 Running tests..."
@go test -v ./...
run:
@echo "🚀 Running $(BINARY_NAME)..."
@go run -ldflags="$(LDFLAGS)" .
dev:
@echo "🚀 Running $(BINARY_NAME) in development mode..."
@go run . &
@echo "Daemon started on http://localhost:5055"
# Code quality targets
fmt:
@echo "🎨 Formatting code..."
@go fmt ./...
@echo "✅ Format complete"
vet:
@echo "🔍 Running go vet..."
@go vet ./...
@echo "✅ Vet complete"
check: fmt vet test
@echo "✅ All checks passed"
# Dependency management
deps-check:
@echo "📦 Checking dependencies..."
@go mod verify
@go mod tidy
@echo "✅ Dependencies verified"
# Version information
version:
@echo "Version: $(VERSION)"
@echo "Git Commit: $(GIT_COMMIT)"
@echo "Build Time: $(BUILD_TIME)"
@echo "Go Version: $(GO_VERSION)"
install: build
sudo cp $(BUILD_DIR)/$(BINARY_NAME) /usr/bin/
@@ -169,4 +134,4 @@ repo: package-all
./scripts/build-apt-repository.sh
deploy-repo: repo
./scripts/deploy-apt-repository.sh
./scripts/deploy-apt-repository.sh

103
README.md Normal file
View File

@@ -0,0 +1,103 @@
# Wild Central
## Installation
### APT Repository (Recommended)
```bash
# Download and install GPG key
curl -fsSL https://mywildcloud.org/apt/wild-cloud-central.gpg | sudo tee /usr/share/keyrings/wild-cloud-central-archive-keyring.gpg > /dev/null
# Add repository (modern .sources format)
sudo tee /etc/apt/sources.list.d/wild-cloud-central.sources << 'EOF'
Types: deb
URIs: https://mywildcloud.org/apt
Suites: stable
Components: main
Signed-By: /usr/share/keyrings/wild-cloud-central-archive-keyring.gpg
EOF
# Update and install
sudo apt update
sudo apt install wild-cloud-central
```
### Manual Installation
Download the latest `.deb` package from the [releases page](https://github.com/wildcloud/wild-central/releases) and install:
```bash
sudo dpkg -i wild-cloud-central_*.deb
sudo apt-get install -f # Fix any dependency issues
```
## Quick Start
1. **Configure the service** (optional):
```bash
sudo cp /etc/wild-cloud-central/config.yaml.example /etc/wild-cloud-central/config.yaml
sudo nano /etc/wild-cloud-central/config.yaml
```
2. **Start the service**:
```bash
sudo systemctl enable wild-cloud-central
sudo systemctl start wild-cloud-central
```
3. **Access the web interface**:
Open http://your-server-ip in your browser
## Features
- **Web Management Interface** - Browser-based configuration and monitoring
- **REST API** - JSON API for programmatic management
- **DNS/DHCP Services** - Integrated dnsmasq configuration management
- **PXE Boot Support** - Automatic Talos Linux asset downloading and serving
## Basic Configuration
The service uses `/etc/wild-cloud-central/config.yaml` for configuration:
```yaml
cloud:
domain: "wildcloud.local"
dns:
ip: "192.168.8.50" # Your server's IP
dhcpRange: "192.168.8.100,192.168.8.200"
cluster:
endpointIp: "192.168.8.60" # Talos cluster endpoint
nodes:
talos:
version: "v1.8.0" # Talos version to use
```
## Service Management
```bash
# Check status
sudo systemctl status wild-cloud-central
# View logs
sudo journalctl -u wild-cloud-central -f
# Restart service
sudo systemctl restart wild-cloud-central
# Stop service
sudo systemctl stop wild-cloud-central
```
## Support
- **Documentation**: See `docs/` directory for detailed guides
- **Issues**: Report problems on the project issue tracker
- **API Reference**: Available at `/api/v1/` endpoints when service is running
## Documentation
- [Developer Guide](docs/DEVELOPER.md) - Development setup, testing, and API reference
- [Maintainer Guide](docs/MAINTAINER.md) - Package management and repository deployment

121
docs/DEVELOPER.md Normal file
View File

@@ -0,0 +1,121 @@
### Building Locally
1. **Build the application:**
```bash
make build
```
2. **Run locally:**
```bash
make run
```
3. **Development with auto-reload:**
```bash
make dev
```
### Dependencies
- **gorilla/mux** - HTTP routing
- **gopkg.in/yaml.v3** - YAML configuration parsing
## API Reference
### Endpoints
- `GET /api/v1/health` - Service health check
- `GET /api/v1/config` - Get current configuration
- `PUT /api/v1/config` - Update configuration
- `GET /api/v1/dnsmasq/config` - Generate dnsmasq configuration
- `POST /api/v1/dnsmasq/restart` - Restart dnsmasq service
- `POST /api/v1/pxe/assets` - Download/update PXE boot assets
### Configuration
Edit `config.yaml` to customize your deployment:
```yaml
server:
port: 5055
host: "0.0.0.0"
cloud:
domain: "wildcloud.local"
dns:
ip: "192.168.8.50"
dhcpRange: "192.168.8.100,192.168.8.200"
cluster:
endpointIp: "192.168.8.60"
nodes:
talos:
version: "v1.8.0"
```
## Testing
> ⚠️ **Note**: These Docker scripts test the installation process only. In production, use `sudo apt install wild-cloud-central` and manage via systemd.
Choose the testing approach that fits your needs:
### 1. Automated Verification - `./tests/integration/test-docker.sh`
- **When to use**: Verify the installation works correctly
- **What it does**: Builds .deb package, installs it, tests all endpoints automatically
- **Best for**: CI/CD, quick verification that everything works
### 2. Background Testing - `./tests/integration/start-background.sh` / `./tests/integration/stop-background.sh`
- **When to use**: You want to test APIs while doing other work
- **What it does**: Starts services silently in background, gives you your terminal back
- **Example workflow**: Start services, test in another terminal, stop when done
```bash
./tests/integration/start-background.sh # Services start, terminal returns immediately
curl http://localhost:9081/api/v1/health # Test in same or different terminal
# Continue working while services run...
./tests/integration/stop-background.sh # Clean shutdown when finished
```
### 3. Interactive Development - `./tests/integration/start-interactive.sh`
- **When to use**: You want to see what's happening as you test
- **What it does**: Starts services with live logs, takes over your terminal
- **Example workflow**: Start services, watch logs in real-time, Ctrl+C to stop
```bash
./tests/integration/start-interactive.sh # Services start, shows live logs
# You see all HTTP requests, errors, debug info in real-time
# Press Ctrl+C when done - terminal is "busy" until then
```
### 4. Shell Access - `./tests/integration/debug-container.sh`
- **When to use**: Deep debugging, manual service control, file inspection
- **What it does**: Drops you into the container shell
- **Best for**: Investigating issues, manually starting/stopping services
### Test Access Points
All services bind to localhost (127.0.0.1) on non-standard ports, so they won't interfere with your local services:
- Management UI: http://localhost:9080
- API: http://localhost:9081
- DNS: localhost:9053 (UDP) - test with `dig @localhost -p 9053 wildcloud.local`
- DHCP: localhost:9067 (UDP)
- TFTP: localhost:9069 (UDP)
- Container logs: `docker logs wild-central-bg`
## Architecture
This service replaces the original bash script implementation with:
- Unified configuration management
- Real-time dnsmasq configuration generation
- Integrated Talos factory asset downloading
- Web-based management interface
- Proper systemd service integration
## Make Targets
- `make build` - Build the Go binary
- `make run` - Run the application locally
- `make dev` - Start development server
- `make test` - Run Go tests
- `make clean` - Clean build artifacts
- `make deb` - Create Debian package
- `make repo` - Build APT repository
- `make deploy-repo` - Deploy repository to server

356
docs/MAINTAINER.md Normal file
View File

@@ -0,0 +1,356 @@
# Maintainer Guide
This guide covers the complete build pipeline, package creation, repository management, and deployment for Wild Cloud Central.
## Build System Overview
Wild Cloud Central uses a modern, multi-stage build system with clear separation of concerns:
1. **Build** - Compile binaries with version information
2. **Package** - Create .deb packages for distribution
3. **Repository** - Build APT repository with GPG signing
4. **Deploy** - Upload to production server
### Quick Reference
```bash
make help # Show all available targets
make version # Show build information
make check # Run quality checks (fmt + vet + test)
make clean # Remove all build artifacts
```
## Development Workflow
### Code Quality Pipeline
Before building, always run quality checks:
```bash
make check
```
This runs:
- `go fmt` - Code formatting
- `go vet` - Static analysis
- `go test` - Unit tests
### Building Binaries
```bash
# Build for current architecture
make build
# Build for specific architecture
make build-amd64
make build-arm64
# Build all architectures
make build-all
```
Binaries include version information from Git and build metadata.
## Package Management
### Creating Debian Packages
```bash
# Create package for current architecture
make package
# Create packages for specific architectures
make package-amd64
make package-arm64
# Create all packages
make package-all
# Legacy alias (deprecated)
make deb
```
This creates `build/wild-cloud-central_0.1.0_amd64.deb` with:
- Binary installed to `/usr/bin/wild-cloud-central`
- Systemd service file
- Configuration template
- Web interface files
- Nginx configuration
### Package Structure
The .deb package includes:
- `/usr/bin/wild-cloud-central` - Main binary
- `/etc/systemd/system/wild-cloud-central.service` - Systemd service
- `/etc/wild-cloud-central/config.yaml.example` - Configuration template
- `/var/www/html/wild-central/` - Web interface files
- `/etc/nginx/sites-available/wild-central` - Nginx configuration
### Post-installation Setup
The package automatically:
- Creates `wildcloud` system user
- Creates required directories with proper permissions
- Configures nginx
- Enables systemd service
- Sets up file ownership
## APT Repository Management
### Building Repository
```bash
make repo
```
This uses `./scripts/build-apt-repository.sh` with **aptly** to create a professional APT repository in `dist/repositories/apt/`:
- Complete repository metadata with all hash types (MD5, SHA1, SHA256, SHA512)
- Contents files for enhanced package discovery
- Multiple compression formats (.gz, .bz2) for compatibility
- Proper GPG signing with modern InRelease format
- Industry-standard repository structure following Debian conventions
The repository includes:
- `pool/main/w/wild-cloud-central/` - Package files
- `dists/stable/main/binary-amd64/` - Metadata and package lists
- `dists/stable/main/binary-arm64/` - ARM64 package metadata
- `dists/stable/InRelease` - Modern GPG signature (preferred)
- `dists/stable/Release.asc` - Traditional GPG signature compatibility
- `wild-cloud-central.gpg` - GPG public key for users
### Aptly Configuration
The build system automatically configures aptly to:
- Use strong RSA 4096-bit GPG keys
- Generate complete security metadata to prevent "weak security information" warnings
- Create Contents files for better package discovery
- Support multiple architectures (amd64, arm64)
### GPG Key Management
#### First-time Setup
```bash
./scripts/setup-gpg.sh
```
This creates:
- 4096-bit RSA GPG key pair
- Public key exported as `dist/wild-cloud-central.gpg` (binary format for APT)
- Key configured for 2-year expiration
- Automatic aptly configuration for repository signing
#### Key Renewal
When the key expires, regenerate with:
```bash
gpg --delete-secret-keys "Wild Cloud Central"
gpg --delete-keys "Wild Cloud Central"
make clean # Remove old GPG key and aptly state
./scripts/setup-gpg.sh
```
### Repository Deployment
1. **Configure server details** in `scripts/deploy-apt-repository.sh`:
```bash
SERVER="user@mywildcloud.org"
REMOTE_PATH="/var/www/html/apt"
```
2. **Deploy repository**:
```bash
make deploy-repo
```
This uploads the aptly-generated repository with complete security metadata, eliminating "weak security information" warnings and ensuring compatibility with modern APT security standards.
This uploads:
- Complete repository structure to server
- GPG public key for user verification
- Proper file permissions and structure
### Server Requirements
The target server needs:
- Web server (nginx/apache) serving `/var/www/html/apt`
- HTTPS support for `https://mywildcloud.org/apt`
- SSH access for deployment
### Repository Structure
```
/var/www/html/apt/
├── dists/
│ └── stable/
│ ├── InRelease (modern GPG signature)
│ ├── Release
│ ├── Release.asc
│ └── main/
│ ├── binary-amd64/
│ │ ├── Packages
│ │ ├── Packages.gz
│ │ └── Release
│ └── binary-arm64/
│ ├── Packages
│ ├── Packages.gz
│ └── Release
├── pool/
│ └── main/
│ └── w/
│ └── wild-cloud-central/
│ ├── wild-cloud-central_0.1.0_amd64.deb
│ └── wild-cloud-central_0.1.0_arm64.deb
├── Contents-amd64 (enhanced package discovery)
├── Contents-amd64.gz
└── wild-cloud-central.gpg (binary format for APT)
```
## Release Process
### Standard Release
1. **Update version** in `Makefile`:
```makefile
VERSION := 0.2.0
```
2. **Quality assurance and build**:
```bash
make clean # Clean previous builds
make check # Run quality checks
make build-all # Build all architectures
./tests/integration/test-docker.sh # Integration tests
```
3. **Create packages and repository**:
```bash
make package-all # Create .deb packages
make repo # Build APT repository
```
4. **Deploy**:
```bash
make deploy-repo # Upload to server
```
### Quick Development Release
For amd64-only development releases:
```bash
make clean && make check && make repo && make deploy-repo
```
### Multi-architecture Release
For production releases with full architecture support:
```bash
make clean && make check && make package-all && make repo && make deploy-repo
```
5. **Verify deployment**:
```bash
curl -I https://mywildcloud.org/apt/dists/stable/Release
curl -I https://mywildcloud.org/apt/wild-cloud-central.gpg
```
## User Installation
Users install packages using the modern APT `.sources` format:
```bash
# Download and install GPG key (binary format)
curl -fsSL https://mywildcloud.org/apt/wild-cloud-central.gpg | \
sudo tee /usr/share/keyrings/wild-cloud-central-archive-keyring.gpg > /dev/null
# Add repository using modern .sources format
sudo tee /etc/apt/sources.list.d/wild-cloud-central.sources << 'EOF'
Types: deb
URIs: https://mywildcloud.org/apt
Suites: stable
Components: main
Signed-By: /usr/share/keyrings/wild-cloud-central-archive-keyring.gpg
EOF
# Update and install
sudo apt update
sudo apt install wild-cloud-central
```
### Legacy Installation (Deprecated)
The old `.list` format still works but generates warnings:
```bash
# Download GPG key (requires conversion)
curl -fsSL https://mywildcloud.org/apt/wild-cloud-central.gpg | \
sudo gpg --dearmor -o /usr/share/keyrings/wild-cloud-central.gpg
# Add repository using legacy format (deprecated)
echo 'deb [signed-by=/usr/share/keyrings/wild-cloud-central.gpg] https://mywildcloud.org/apt stable main' | \
sudo tee /etc/apt/sources.list.d/wild-cloud-central.list
```
## Troubleshooting
### GPG Issues
- **"no default secret key"**: Run `./scripts/setup-gpg.sh`
- **Key conflicts**: Delete existing keys before recreating
- **Permission errors**: Ensure `~/.gnupg` has correct permissions (700)
### Repository Issues
- **Package not found**: Verify `dpkg-scanpackages` output
- **Signature verification failed**: Regenerate GPG key and re-sign
- **404 errors**: Check web server configuration and file permissions
- **Legacy format warnings**: Use modern `.sources` format instead of `.list`
- **GPG key mismatch**: Ensure deployed key matches signing key
### Deployment Issues
- **SSH failures**: Verify server credentials in `deploy-repo.sh`
- **Permission denied**: Ensure target directory is writable
- **rsync errors**: Check network connectivity and paths
## Monitoring
### Service Health
```bash
curl https://mywildcloud.org/apt/dists/stable/Release
curl https://mywildcloud.org/apt/wild-cloud-central.gpg
```
### Package Statistics
Monitor download statistics through web server logs:
```bash
grep "wild-cloud-central.*\.deb" /var/log/nginx/access.log | wc -l
```
### Repository Integrity
Verify signatures regularly:
```bash
gpg --verify Release.asc Release
```

23
docs/MAINTENANCE.md Normal file
View File

@@ -0,0 +1,23 @@
# Maintenance Guide
Keep your wild cloud running smoothly.
- [Security Best Practices](./guides/security.md)
- [Monitoring](./guides/monitoring.md)
- [Making backups](./guides/making-backups.md)
- [Restoring backups](./guides/restoring-backups.md)
## Upgrade
- [Upgrade applications](./guides/upgrade-applications.md)
- [Upgrade kubernetes](./guides/upgrade-kubernetes.md)
- [Upgrade Talos](./guides/upgrade-talos.md)
- [Upgrade Wild Cloud](./guides/upgrade-wild-cloud.md)
## Troubleshooting
- [Cluster issues](./guides/troubleshoot-cluster.md)
- [DNS issues](./guides/troubleshoot-dns.md)
- [Service connectivity issues](./guides/troubleshoot-service-connectivity.md)
- [TLS certificate issues](./guides/troubleshoot-tls-certificates.md)
- [Visibility issues](./guides/troubleshoot-visibility.md)

View File

@@ -0,0 +1,79 @@
# Packaging Wild Central
## Desired Experience
This is the desired experience for installing Wild Cloud Central on a fresh Debian/Ubuntu system:
### APT Repository (Recommended)
```bash
# Download and install GPG key
curl -fsSL https://mywildcloud.org/apt/wild-cloud-central.gpg | sudo tee /usr/share/keyrings/wild-cloud-central-archive-keyring.gpg > /dev/null
# Add repository (modern .sources format)
sudo tee /etc/apt/sources.list.d/wild-cloud-central.sources << 'EOF'
Types: deb
URIs: https://mywildcloud.org/apt
Suites: stable
Components: main
Signed-By: /usr/share/keyrings/wild-cloud-central-archive-keyring.gpg
EOF
# Update and install
sudo apt update
sudo apt install wild-cloud-central
```
### Manual Installation
Download the latest `.deb` package from the [releases page](https://github.com/wildcloud/wild-central/releases) and install:
```bash
sudo dpkg -i wild-cloud-central_*.deb
sudo apt-get install -f # Fix any dependency issues
```
## Quick Start
1. **Configure the service** (optional):
```bash
sudo cp /etc/wild-cloud-central/config.yaml.example /etc/wild-cloud-central/config.yaml
sudo nano /etc/wild-cloud-central/config.yaml
```
2. **Start the service**:
```bash
sudo systemctl enable wild-cloud-central
sudo systemctl start wild-cloud-central
```
3. **Access the web interface**:
Open http://your-server-ip in your browser
## Developer tooling
Makefile commands for packaging:
Package targets (create .deb packages):
make package - Create .deb package for current arch
make package-arm64 - Create arm64 .deb package
make package-amd64 - Create amd64 .deb package
make package-all - Create all .deb packages
Repository targets:
make repo - Build APT repository from packages
make deploy-repo - Deploy repository to server
Directory structure:
build/ - Intermediate build artifacts
dist/bin/ - Final binaries for distribution
dist/packages/ - OS packages (.deb files)
dist/repositories/ - APT repository for deployment
Example workflows:
make clean && make repo - Full release build

View File

@@ -0,0 +1,265 @@
# Making Backups
This guide covers how to create backups of your wild-cloud infrastructure using the integrated backup system.
## Overview
The wild-cloud backup system creates encrypted, deduplicated snapshots using restic. It backs up three main components:
- **Applications**: Database dumps and persistent volume data
- **Cluster**: Kubernetes resources and etcd state
- **Configuration**: Wild-cloud repository and settings
## Prerequisites
Before making backups, ensure you have:
1. **Environment configured**: Run `source env.sh` to load backup configuration
2. **Restic repository**: Backup repository configured in `config.yaml`
3. **Backup password**: Set in wild-cloud secrets
4. **Staging directory**: Configured path for temporary backup files
## Backup Components
### Applications (`wild-app-backup`)
Backs up individual applications including:
- **Database dumps**: PostgreSQL/MySQL databases in compressed custom format
- **PVC data**: Application files streamed directly for restic deduplication
- **Auto-discovery**: Finds databases and PVCs based on app manifest.yaml
### Cluster Resources (`wild-backup --cluster-only`)
Backs up cluster-wide resources:
- **Kubernetes resources**: All pods, services, deployments, secrets, configmaps
- **Storage definitions**: PersistentVolumes, PVCs, StorageClasses
- **etcd snapshot**: Complete cluster state for disaster recovery
### Configuration (`wild-backup --home-only`)
Backs up wild-cloud configuration:
- **Repository contents**: All app definitions, manifests, configurations
- **Settings**: Wild-cloud configuration files and customizations
## Making Backups
### Full System Backup (Recommended)
Create a complete backup of everything:
```bash
# Backup all components (apps + cluster + config)
wild-backup
```
This is equivalent to:
```bash
wild-backup --home --apps --cluster
```
### Selective Backups
#### Applications Only
```bash
# All applications
wild-backup --apps-only
# Single application
wild-app-backup discourse
# Multiple applications
wild-app-backup discourse gitea immich
```
#### Cluster Only
```bash
# Kubernetes resources + etcd
wild-backup --cluster-only
```
#### Configuration Only
```bash
# Wild-cloud repository
wild-backup --home-only
```
### Excluding Components
Skip specific components:
```bash
# Skip config, backup apps + cluster
wild-backup --no-home
# Skip applications, backup config + cluster
wild-backup --no-apps
# Skip cluster resources, backup config + apps
wild-backup --no-cluster
```
## Backup Process Details
### Application Backup Process
1. **Discovery**: Parses `manifest.yaml` to find database and PVC dependencies
2. **Database backup**: Creates compressed custom-format dumps
3. **PVC backup**: Streams files directly to staging for restic deduplication
4. **Staging**: Organizes files in clean directory structure
5. **Upload**: Creates individual restic snapshots per application
### Cluster Backup Process
1. **Resource export**: Exports all Kubernetes resources to YAML
2. **etcd snapshot**: Creates point-in-time etcd backup via talosctl
3. **Upload**: Creates single restic snapshot for cluster state
### Restic Snapshots
Each backup creates tagged restic snapshots:
```bash
# View all snapshots
restic snapshots
# Filter by component
restic snapshots --tag discourse # Specific app
restic snapshots --tag cluster # Cluster resources
restic snapshots --tag wc-home # Wild-cloud config
```
## Where Backup Files Are Staged
Before uploading to your restic repository, backup files are organized in a staging directory. This temporary area lets you see exactly what's being backed up and helps with deduplication.
Here's what the staging area looks like:
```
backup-staging/
├── apps/
│ ├── discourse/
│ │ ├── database_20250816T120000Z.dump
│ │ ├── globals_20250816T120000Z.sql
│ │ └── discourse/
│ │ └── data/ # All the actual files
│ ├── gitea/
│ │ ├── database_20250816T120000Z.dump
│ │ └── gitea-data/
│ │ └── data/ # Git repositories, etc.
│ └── immich/
│ ├── database_20250816T120000Z.dump
│ └── immich-data/
│ └── upload/ # Photos and videos
└── cluster/
├── all-resources.yaml # All running services
├── secrets.yaml # Passwords and certificates
├── configmaps.yaml # Configuration data
└── etcd-snapshot.db # Complete cluster state
```
This staging approach means you can examine backup contents before they're uploaded, and restic can efficiently deduplicate files that haven't changed.
## Advanced Usage
### Custom Backup Scripts
Applications can provide custom backup logic:
```bash
# Create apps/myapp/backup.sh for custom behavior
chmod +x apps/myapp/backup.sh
# wild-app-backup will use custom script if present
wild-app-backup myapp
```
### Monitoring Backup Status
```bash
# Check recent snapshots
restic snapshots | head -20
# Check specific app backups
restic snapshots --tag discourse
# Verify backup integrity
restic check
```
### Backup Automation
Set up automated backups with cron:
```bash
# Daily full backup at 2 AM
0 2 * * * cd /data/repos/payne-cloud && source env.sh && wild-backup
# Hourly app backups during business hours
0 9-17 * * * cd /data/repos/payne-cloud && source env.sh && wild-backup --apps-only
```
## Performance Considerations
### Large PVCs (like Immich photos)
The streaming backup approach provides:
- **First backup**: Full transfer time (all files processed)
- **Subsequent backups**: Only changed files processed (dramatically faster)
- **Storage efficiency**: Restic deduplication reduces storage usage
### Network Usage
- **Database dumps**: Compressed at source, efficient transfer
- **PVC data**: Uncompressed transfer, but restic handles deduplication
- **etcd snapshots**: Small files, minimal impact
## Troubleshooting
### Common Issues
**"No databases or PVCs found"**
- App has no `manifest.yaml` with database dependencies
- No PVCs with matching labels in app namespace
- Create custom `backup.sh` script for special cases
**"kubectl not found"**
- Ensure kubectl is installed and configured
- Check cluster connectivity with `kubectl get nodes`
**"Staging directory not set"**
- Configure `cloud.backup.staging` in `config.yaml`
- Ensure directory exists and is writable
**"Could not create etcd backup"**
- Ensure `talosctl` is installed for Talos clusters
- Check control plane node connectivity
- Verify etcd pods are accessible in kube-system namespace
### Backup Verification
Always verify backups periodically:
```bash
# Check restic repository integrity
restic check
# List recent snapshots
restic snapshots --compact
# Test restore to different directory
restic restore latest --target /tmp/restore-test
```
## Security Notes
- **Encryption**: All backups are encrypted with your backup password
- **Secrets**: Kubernetes secrets are included in cluster backups
- **Access control**: Secure your backup repository and passwords
- **Network**: Consider bandwidth usage for large initial backups
## Next Steps
- [Restoring Backups](restoring-backups.md) - Learn how to restore from backups
- Configure automated backup schedules
- Set up backup monitoring and alerting
- Test disaster recovery procedures

50
docs/guides/monitoring.md Normal file
View File

@@ -0,0 +1,50 @@
# System Health Monitoring
## Basic Monitoring
Check system health with:
```bash
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods -A
# Persistent volume claims
kubectl get pvc -A
```
## Advanced Monitoring (Future Implementation)
Consider implementing:
1. **Prometheus + Grafana** for comprehensive monitoring:
```bash
# Placeholder for future implementation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
```
2. **Loki** for log aggregation:
```bash
# Placeholder for future implementation
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack --namespace logging --create-namespace
```
## Additional Resources
This document will be expanded in the future with:
- Detailed backup and restore procedures
- Monitoring setup instructions
- Comprehensive security hardening guide
- Automated maintenance scripts
For now, refer to the following external resources:
- [K3s Documentation](https://docs.k3s.io/)
- [Kubernetes Troubleshooting Guide](https://kubernetes.io/docs/tasks/debug/)
- [Velero Backup Documentation](https://velero.io/docs/latest/)
- [Kubernetes Security Best Practices](https://kubernetes.io/docs/concepts/security/)

View File

@@ -0,0 +1,294 @@
# Restoring Backups
This guide will walk you through restoring your applications and cluster from wild-cloud backups. Hopefully you'll never need this, but when you do, it's critical that the process works smoothly.
## Understanding Restore Types
Your wild-cloud backup system can restore different types of data depending on what you need to recover:
**Application restores** bring back individual applications by restoring their database contents and file storage. This is what you'll use most often - maybe you accidentally deleted something in Discourse, or Gitea got corrupted, or you want to roll back Immich to before a bad update.
**Cluster restores** are for disaster recovery scenarios where you need to rebuild your entire Kubernetes cluster from scratch. This includes restoring all the cluster's configuration and even its internal state.
**Configuration restores** bring back your wild-cloud repository and settings, which contain all the "recipes" for how your infrastructure should be set up.
## Before You Start Restoring
Make sure you have everything needed to perform restores. You need to be in your wild-cloud directory with the environment loaded (`source env.sh`). Your backup repository and password should be configured and working - you can test this by running `restic snapshots` to see your available backups.
Most importantly, make sure you have kubectl access to your cluster, since restores involve creating temporary pods and manipulating storage.
## Restoring Applications
### Basic Application Restore
The most common restore scenario is bringing back a single application. To restore the latest backup of an app:
```bash
wild-app-restore discourse
```
This restores both the database and all file storage for the discourse app. The restore system automatically figures out what the app needs based on its manifest file and what was backed up.
If you want to restore from a specific backup instead of the latest:
```bash
wild-app-restore discourse abc123
```
Where `abc123` is the snapshot ID from `restic snapshots --tag discourse`.
### Partial Restores
Sometimes you only need to restore part of an application. Maybe the database is fine but the files got corrupted, or vice versa.
To restore only the database:
```bash
wild-app-restore discourse --db-only
```
To restore only the file storage:
```bash
wild-app-restore discourse --pvc-only
```
To restore without database roles and permissions (if they're causing conflicts):
```bash
wild-app-restore discourse --skip-globals
```
### Finding Available Backups
To see what backups are available for an app:
```bash
wild-app-restore discourse --list
```
This shows recent snapshots with their IDs, timestamps, and what was included.
## How Application Restores Work
Understanding what happens during a restore can help when things don't go as expected.
### Database Restoration
When restoring a database, the system first downloads the backup files from your restic repository. It then prepares the database by creating any needed roles, disconnecting existing users, and dropping/recreating the database to ensure a clean restore.
For PostgreSQL databases, it uses `pg_restore` with parallel processing to speed up large database imports. For MySQL, it uses standard mysql import commands. The system also handles database ownership and permissions automatically.
### File Storage Restoration
File storage (PVC) restoration is more complex because it involves safely replacing files that might be actively used by running applications.
First, the system creates a safety snapshot using Longhorn. This means if something goes wrong during the restore, you can get back to where you started. Then it scales your application down to zero replicas so no pods are using the storage.
Next, it creates a temporary utility pod with the PVC mounted and copies all the backup files into place, preserving file permissions and structure. Once the data is restored and verified, it removes the utility pod and scales your application back up.
If everything worked correctly, the safety snapshot is automatically deleted. If something went wrong, the safety snapshot is preserved so you can recover manually.
## Cluster Disaster Recovery
Cluster restoration is much less common but critical when you need to rebuild your entire infrastructure.
### Restoring Kubernetes Resources
To restore all cluster resources from a backup:
```bash
# Download cluster backup
restic restore --tag cluster latest --target ./restore/
# Apply all resources
kubectl apply -f restore/cluster/all-resources.yaml
```
You can also restore specific types of resources:
```bash
kubectl apply -f restore/cluster/secrets.yaml
kubectl apply -f restore/cluster/configmaps.yaml
```
### Restoring etcd State
**Warning: This is extremely dangerous and will affect your entire cluster.**
etcd restoration should only be done when rebuilding a cluster from scratch. For Talos clusters:
```bash
talosctl --nodes <control-plane-ip> etcd restore --from ./restore/cluster/etcd-snapshot.db
```
This command stops etcd, replaces its data with the backup, and restarts the cluster. Expect significant downtime while the cluster rebuilds itself.
## Common Disaster Recovery Scenarios
### Complete Application Loss
When an entire application is gone (namespace deleted, pods corrupted, etc.):
```bash
# Make sure the namespace exists
kubectl create namespace discourse --dry-run=client -o yaml | kubectl apply -f -
# Apply the application manifests if needed
kubectl apply -f apps/discourse/
# Restore the application data
wild-app-restore discourse
```
### Complete Cluster Rebuild
When rebuilding a cluster from scratch:
First, build your new cluster infrastructure and install wild-cloud components. Then configure backup access so you can reach your backup repository.
Restore cluster state:
```bash
restic restore --tag cluster latest --target ./restore/
# Apply etcd snapshot using appropriate method for your cluster type
```
Finally, restore all applications:
```bash
# See what applications are backed up
wild-app-restore --list
# Restore each application individually
wild-app-restore discourse
wild-app-restore gitea
wild-app-restore immich
```
### Rolling Back After Bad Changes
Sometimes you need to undo recent changes to an application:
```bash
# See available snapshots
wild-app-restore discourse --list
# Restore from before the problematic changes
wild-app-restore discourse abc123
```
## Cross-Cluster Migration
You can use backups to move applications between clusters:
On the source cluster, create a fresh backup:
```bash
wild-app-backup discourse
```
On the target cluster, deploy the application manifests:
```bash
kubectl apply -f apps/discourse/
```
Then restore the data:
```bash
wild-app-restore discourse
```
## Verifying Successful Restores
After any restore, verify that everything is working correctly.
For databases, check that you can connect and see expected data:
```bash
kubectl exec -n postgres deploy/postgres-deployment -- \
psql -U postgres -d discourse -c "SELECT count(*) FROM posts;"
```
For file storage, check that files exist and applications can start:
```bash
kubectl get pods -n discourse
kubectl logs -n discourse deployment/discourse
```
For web applications, test that you can access them:
```bash
curl -f https://discourse.example.com/latest.json
```
## When Things Go Wrong
### No Snapshots Found
If the restore system can't find backups for an application, check that snapshots exist:
```bash
restic snapshots --tag discourse
```
Make sure you're using the correct app name and that backups were actually created successfully.
### Database Restore Failures
Database restores can fail if the target database isn't accessible or if there are permission issues. Check that your postgres or mysql pods are running and that you can connect to them manually.
Review the restore error messages carefully - they usually indicate whether the problem is with the backup file, database connectivity, or permissions.
### PVC Restore Failures
If PVC restoration fails, check that you have sufficient disk space and that the PVC isn't being used by other pods. The error messages will usually indicate what went wrong.
Most importantly, remember that safety snapshots are preserved when PVC restores fail. You can see them with:
```bash
kubectl get snapshot.longhorn.io -n longhorn-system -l app=wild-app-restore
```
These snapshots let you recover to the pre-restore state if needed.
### Application Won't Start After Restore
If pods fail to start after restoration, check file permissions and ownership. Sometimes the restoration process doesn't perfectly preserve the exact permissions that the application expects.
You can also try scaling the application to zero and back to one, which sometimes resolves transient issues:
```bash
kubectl scale deployment/discourse -n discourse --replicas=0
kubectl scale deployment/discourse -n discourse --replicas=1
```
## Manual Recovery
When automated restore fails, you can always fall back to manual extraction and restoration:
```bash
# Extract backup files to local directory
restic restore --tag discourse latest --target ./manual-restore/
# Manually copy database dump to postgres pod
kubectl cp ./manual-restore/discourse/database_*.dump \
postgres/postgres-deployment-xxx:/tmp/
# Manually restore database
kubectl exec -n postgres deploy/postgres-deployment -- \
pg_restore -U postgres -d discourse /tmp/database_*.dump
```
For file restoration, you'd need to create a utility pod and manually copy files into the PVC.
## Best Practices
Test your restore procedures regularly in a non-production environment. It's much better to discover issues with your backup system during a planned test than during an actual emergency.
Always communicate with users before performing restores, especially if they involve downtime. Document any manual steps you had to take so you can improve the automated process.
After any significant restore, monitor your applications more closely than usual for a few days. Sometimes problems don't surface immediately.
## Security and Access Control
Restore operations are powerful and can be destructive. Make sure only trusted administrators can perform restores, and consider requiring approval or coordination before major restoration operations.
Be aware that cluster restores include all secrets, so they potentially expose passwords, API keys, and certificates. Ensure your backup repository is properly secured.
Remember that Longhorn safety snapshots are preserved when things go wrong. These snapshots may contain sensitive data, so clean them up appropriately once you've resolved any issues.
## What's Next
The best way to get comfortable with restore operations is to practice them in a safe environment. Set up a test cluster and practice restoring applications and data.
Consider creating runbooks for your most likely disaster scenarios, including the specific commands and verification steps for your infrastructure.
Read the [Making Backups](making-backups.md) guide to ensure you're creating the backups you'll need for successful recovery.

View File

@@ -0,0 +1,19 @@
# Troubleshoot Wild Cloud Cluster issues
## General Troubleshooting Steps
1. **Check Node Status**:
```bash
kubectl get nodes
kubectl describe node <node-name>
```
1. **Check Component Status**:
```bash
# Check all pods across all namespaces
kubectl get pods -A
# Look for pods that aren't Running or Ready
kubectl get pods -A | grep -v "Running\|Completed"
```

View File

@@ -0,0 +1,20 @@
# Troubleshoot DNS
If DNS resolution isn't working properly:
1. Check CoreDNS status:
```bash
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -l k8s-app=kube-dns -n kube-system
```
2. Verify CoreDNS configuration:
```bash
kubectl get configmap -n kube-system coredns -o yaml
```
3. Test DNS resolution from inside the cluster:
```bash
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default
```

View File

@@ -0,0 +1,18 @@
# Troubleshoot Service Connectivity
If services can't communicate:
1. Check network policies:
```bash
kubectl get networkpolicies -A
```
2. Verify service endpoints:
```bash
kubectl get endpoints -n <namespace>
```
3. Test connectivity from within the cluster:
```bash
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- wget -O- <service-name>.<namespace>
```

View File

@@ -0,0 +1,24 @@
# Troubleshoot TLS Certificates
If services show invalid certificates:
1. Check certificate status:
```bash
kubectl get certificates -A
```
2. Examine certificate details:
```bash
kubectl describe certificate <cert-name> -n <namespace>
```
3. Check for cert-manager issues:
```bash
kubectl get pods -n cert-manager
kubectl logs -l app=cert-manager -n cert-manager
```
4. Verify the Cloudflare API token is correctly set up:
```bash
kubectl get secret cloudflare-api-token -n internal
```

View File

@@ -0,0 +1,246 @@
# Troubleshoot Service Visibility
This guide covers common issues with accessing services from outside the cluster and how to diagnose and fix them.
## Common Issues
External access to your services might fail for several reasons:
1. **DNS Resolution Issues** - Domain names not resolving to the correct IP address
2. **Network Connectivity Issues** - Traffic can't reach the cluster's external IP
3. **TLS Certificate Issues** - Invalid or missing certificates
4. **Ingress/Service Configuration Issues** - Incorrectly configured routing
## Diagnostic Steps
### 1. Check DNS Resolution
**Symptoms:**
- Browser shows "site cannot be reached" or "server IP address could not be found"
- `ping` or `nslookup` commands fail for your domain
- Your service DNS records don't appear in CloudFlare or your DNS provider
**Checks:**
```bash
# Check if your domain resolves (from outside the cluster)
nslookup yourservice.yourdomain.com
# Check if ExternalDNS is running
kubectl get pods -n externaldns
# Check ExternalDNS logs for errors
kubectl logs -n externaldns -l app=external-dns < /dev/null | grep -i error
kubectl logs -n externaldns -l app=external-dns | grep -i "your-service-name"
# Check if CloudFlare API token is configured correctly
kubectl get secret cloudflare-api-token -n externaldns
```
**Common Issues:**
a) **ExternalDNS Not Running**: The ExternalDNS pod is not running or has errors.
b) **Cloudflare API Token Issues**: The API token is invalid, expired, or doesn't have the right permissions.
c) **Domain Filter Mismatch**: ExternalDNS is configured with a `--domain-filter` that doesn't match your domain.
d) **Annotations Missing**: Service or Ingress is missing the required ExternalDNS annotations.
**Solutions:**
```bash
# 1. Recreate CloudFlare API token secret
kubectl create secret generic cloudflare-api-token \
--namespace externaldns \
--from-literal=api-token="your-api-token" \
--dry-run=client -o yaml | kubectl apply -f -
# 2. Check and set proper annotations on your Ingress:
kubectl annotate ingress your-ingress -n your-namespace \
external-dns.alpha.kubernetes.io/hostname=your-service.your-domain.com
# 3. Restart ExternalDNS
kubectl rollout restart deployment -n externaldns external-dns
```
### 2. Check Network Connectivity
**Symptoms:**
- DNS resolves to the correct IP but the service is still unreachable
- Only some services are unreachable while others work
- Network timeout errors
**Checks:**
```bash
# Check if MetalLB is running
kubectl get pods -n metallb-system
# Check MetalLB IP address pool
kubectl get ipaddresspools.metallb.io -n metallb-system
# Verify the service has an external IP
kubectl get svc -n your-namespace your-service
```
**Common Issues:**
a) **MetalLB Configuration**: The IP pool doesn't match your network or is exhausted.
b) **Firewall Issues**: Firewall is blocking traffic to your cluster's external IP.
c) **Router Configuration**: NAT or port forwarding issues if using a router.
**Solutions:**
```bash
# 1. Check and update MetalLB configuration
kubectl apply -f infrastructure_setup/metallb/metallb-pool.yaml
# 2. Check service external IP assignment
kubectl describe svc -n your-namespace your-service
```
### 3. Check TLS Certificates
**Symptoms:**
- Browser shows certificate errors
- "Your connection is not private" warnings
- Cert-manager logs show errors
**Checks:**
```bash
# Check certificate status
kubectl get certificates -A
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager
# Check if your ingress is using the correct certificate
kubectl get ingress -n your-namespace your-ingress -o yaml
```
**Common Issues:**
a) **Certificate Issuance Failures**: DNS validation or HTTP validation failing.
b) **Wrong Secret Referenced**: Ingress is referencing a non-existent certificate secret.
c) **Expired Certificate**: Certificate has expired and wasn't renewed.
**Solutions:**
```bash
# 1. Check and recreate certificates
kubectl apply -f infrastructure_setup/cert-manager/wildcard-certificate.yaml
# 2. Update ingress to use correct secret
kubectl patch ingress your-ingress -n your-namespace --type=json \
-p='[{"op": "replace", "path": "/spec/tls/0/secretName", "value": "correct-secret-name"}]'
```
### 4. Check Ingress Configuration
**Symptoms:**
- HTTP 404, 503, or other error codes
- Service accessible from inside cluster but not outside
- Traffic routed to wrong service
**Checks:**
```bash
# Check ingress status
kubectl get ingress -n your-namespace
# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik
# Check ingress configuration
kubectl describe ingress -n your-namespace your-ingress
```
**Common Issues:**
a) **Incorrect Service Targeting**: Ingress is pointing to wrong service or port.
b) **Traefik Configuration**: IngressClass or middleware issues.
c) **Path Configuration**: Incorrect path prefixes or regex.
**Solutions:**
```bash
# 1. Verify ingress configuration
kubectl edit ingress -n your-namespace your-ingress
# 2. Check that the referenced service exists
kubectl get svc -n your-namespace
# 3. Restart Traefik if needed
kubectl rollout restart deployment -n kube-system traefik
```
## Advanced Diagnostics
For more complex issues, you can use port-forwarding to test services directly:
```bash
# Port-forward the service directly
kubectl port-forward -n your-namespace svc/your-service 8080:80
# Then test locally
curl http://localhost:8080
```
You can also deploy a debug pod to test connectivity from inside the cluster:
```bash
# Start a debug pod
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh
# Inside the pod, test DNS and connectivity
nslookup your-service.your-namespace.svc.cluster.local
wget -O- http://your-service.your-namespace.svc.cluster.local
```
## ExternalDNS Specifics
ExternalDNS can be particularly troublesome. Here are specific debugging steps:
1. **Check Log Level**: Set `--log-level=debug` for more detailed logs
2. **Check Domain Filter**: Ensure `--domain-filter` includes your domain
3. **Check Provider**: Ensure `--provider=cloudflare` (or your DNS provider)
4. **Verify API Permissions**: CloudFlare token needs Zone.Zone and Zone.DNS permissions
5. **Check TXT Records**: ExternalDNS uses TXT records for ownership tracking
```bash
# Restart with verbose logging
kubectl set env deployment/external-dns -n externaldns -- --log-level=debug
# Check for specific domain errors
kubectl logs -n externaldns -l app=external-dns | grep -i yourservice.yourdomain.com
```
## CloudFlare Specific Issues
When using CloudFlare, additional issues may arise:
1. **API Rate Limiting**: CloudFlare may rate limit frequent API calls
2. **DNS Propagation**: Changes may take time to propagate through CloudFlare's CDN
3. **Proxied Records**: The `external-dns.alpha.kubernetes.io/cloudflare-proxied` annotation controls whether CloudFlare proxies traffic
4. **Access Restrictions**: CloudFlare Access or Page Rules may restrict access
5. **API Token Permissions**: The token must have Zone:Zone:Read and Zone:DNS:Edit permissions
6. **Zone Detection**: If using subdomains, ensure the parent domain is included in the domain filter
Check CloudFlare dashboard for:
- DNS record existence
- API access logs
- DNS settings including proxy status
- Any error messages or rate limit warnings

View File

@@ -0,0 +1,3 @@
# Upgrade Applications
TBD

View File

@@ -0,0 +1,3 @@
# Upgrade Kubernetes
TBD

View File

@@ -0,0 +1,3 @@
# Upgrade Talos
TBD

View File

@@ -0,0 +1,3 @@
# Upgrade Wild Cloud
TBD

17
docs/manual-setup.md Normal file
View File

@@ -0,0 +1,17 @@
# Manual Setup of Wild Cloud Central
If you want to set up from source (not using the debian package).
```bash
# Prerequisites
sudo apt update
sudo apt install dnsmasq
# Disable systemd-resolved
sudo systemctl disable systemd-resolved
sudo systemctl stop systemd-resolved
# Enable dnsmasq
sudo systemctl enable dnsmasq
sudo systemctl start dnsmasq
```