From 1ce4e6d3bd2e7f1907bc590fdb7c27a06b22e9bf Mon Sep 17 00:00:00 2001 From: Paul Payne Date: Tue, 10 Feb 2026 07:29:22 +0000 Subject: [PATCH] chore: Remove deprecated documentation and update .gitignore for development artifacts --- .gitignore | 4 + CLAUDE.md | 197 +- ai/talos-v1.11/README.md | 135 -- ai/talos-v1.11/architecture-and-components.md | 248 --- ai/talos-v1.11/bare-metal-administration.md | 506 ----- ai/talos-v1.11/cli-essentials.md | 382 ---- ai/talos-v1.11/cluster-operations.md | 239 --- ai/talos-v1.11/discovery-and-networking.md | 344 --- ai/talos-v1.11/etcd-management.md | 287 --- ai/talos-v1.11/troubleshooting-guide.md | 480 ----- ai/wildcloud-v.PoC/README.md | 188 -- ai/wildcloud-v.PoC/apps-system.md | 595 ------ ai/wildcloud-v.PoC/bin-scripts.md | 262 --- ai/wildcloud-v.PoC/configuration-system.md | 602 ------ ai/wildcloud-v.PoC/overview.md | 443 ---- ai/wildcloud-v.PoC/project-architecture.md | 446 ---- ai/wildcloud-v.PoC/setup-process.md | 390 ---- docs/app-states.md | 1902 ----------------- wild-cloud | 2 +- 19 files changed, 8 insertions(+), 7644 deletions(-) delete mode 100644 ai/talos-v1.11/README.md delete mode 100644 ai/talos-v1.11/architecture-and-components.md delete mode 100644 ai/talos-v1.11/bare-metal-administration.md delete mode 100644 ai/talos-v1.11/cli-essentials.md delete mode 100644 ai/talos-v1.11/cluster-operations.md delete mode 100644 ai/talos-v1.11/discovery-and-networking.md delete mode 100644 ai/talos-v1.11/etcd-management.md delete mode 100644 ai/talos-v1.11/troubleshooting-guide.md delete mode 100644 ai/wildcloud-v.PoC/README.md delete mode 100644 ai/wildcloud-v.PoC/apps-system.md delete mode 100644 ai/wildcloud-v.PoC/bin-scripts.md delete mode 100644 ai/wildcloud-v.PoC/configuration-system.md delete mode 100644 ai/wildcloud-v.PoC/overview.md delete mode 100644 ai/wildcloud-v.PoC/project-architecture.md delete mode 100644 ai/wildcloud-v.PoC/setup-process.md delete mode 100644 docs/app-states.md diff --git a/.gitignore b/.gitignore index 1acb768..94d3e8c 100644 --- a/.gitignore +++ b/.gitignore @@ -26,6 +26,10 @@ wild-cloud-redmond-data # Development working dir .working/ +# Wild dev tool artifacts +.wild-pids/ +.wild-logs/ + __debug* compact__ .lock diff --git a/CLAUDE.md b/CLAUDE.md index 2b1d58c..fd17527 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,203 +1,12 @@ # CLAUDE.md -## Project Overview +This is a developer directory for Wild Cloud. Wild Cloud is a platform for managing and orchestrating cloud-native applications on local networks using a network appliance called "Wild Central". -This project is called "Wild Cloud". Wild Cloud is a platform for managing and orchestrating cloud-native applications on local networks using a network appliance called "Wild Central". +Wild Central is a lightweight server that runs on a local machine (e.g., a Raspberry Pi) and provides an API for users to manage their Wild Cloud instances. The core Wild Cloud components are in the `wild-cloud` mono-repo: @wild-cloud/CLAUDE.md -Wild Central is a lightweight server that runs on a local machine (e.g., a Raspberry Pi) and provides an API for users to manage their Wild Cloud instances. The core Wild Cloud components are in the `wild-cloud` mono-repo: - -- **API** (`wild-cloud/api`): The Wild Cloud API that runs on Wild Central. @wild-cloud/api/CLAUDE.md -- **CLI** (`wild-cloud/cli`): Command-line interface for managing Wild Cloud instances. @wild-cloud/cli/README.md -- **Web App** (`wild-cloud/web`): Web-based interface for managing Wild Cloud instances. @wild-cloud/web/CLAUDE.md -- **Dist** (`wild-cloud/dist`): Distribution package for setting up Wild Central devices via apt. @wild-cloud/dist/README.md - -All data and config for an instance of Wild Central is defined in Central's environment in the WILD_API_DATA_DIR variable. For ease of access, a symlink to our Wild API data dir is made at ./wild-cloud-redmond-data. - -A Wild Cloud instance is a kubernetes (k8s) environment that runs Wild Cloud services and applications. Wild Cloud instances can be created, managed, and monitored using the Wild Cloud API running on a Wild Central device. Wild Cloud applications are custom packages designed to be deployed to Wild Cloud instances. They consist of kustomize templates and a Wild Cloud app manifest file that describes the application and how it should be deployed configured and deployed in a Wild Cloud instance. Wild Cloud applications are stored in a "Wild Directory". The directory contained in the wild-directory folder is the official Wild Directory. @wild-directory/CLAUDE.md -The Wild Cloud API maintains data for each Wild Cloud instance in its configured WILD_API_DATA_DIR. A data directory is intended to be checked into version control (e.g., git) to track changes to the configuration of Wild Cloud instances and their deployed applications over time. These are designed to follow infrastructure-as-code principles, allowing experienced devops users to manage their Wild Cloud instances using familiar tools and workflows. +The Wild Cloud API maintains data for each Wild Cloud instance in its configured WILD_API_DATA_DIR. A data directory is intended to be checked into version control (e.g., git) to track changes to the configuration of Wild Cloud instances and their deployed applications over time. These are designed to follow infrastructure-as-code and gitops principles, allowing experienced devops users to manage their Wild Cloud instances using familiar tools and workflows. For ease of access, a symlink to our Wild API data dir is made at ./wild-cloud-redmond-data. `payne-cloud` is a PRODUCTION instance. Do not make changes to it without considering the impact and receiving permission. For development, use `test-cloud`. We have a public website for Wild Cloud (https://mywildcloud.org) with source in `wild-cloud/docs`. - -## Additional Documentation - -### Info about Talos - -- @ai/talos-v1.11/README.md -- @ai/talos-v1.11/architecture-and-components.md -- @ai/talos-v1.11/cli-essentials.md -- @ai/talos-v1.11/cluster-operations.md -- @ai/talos-v1.11/discovery-and-networking.md -- @ai/talos-v1.11/etcd-management.md -- @ai/talos-v1.11/bare-metal-administration.md -- @ai/talos-v1.11/troubleshooting-guide.md - -## Implementation Philosophy - -## Core Philosophy - -Embodies a Zen-like minimalism that values simplicity and clarity above all. This approach reflects: - -- **Wabi-sabi philosophy**: Embracing simplicity and the essential. Each line serves a clear purpose without unnecessary embellishment. -- **KISS**: The solution should be as simple as possible, but no simpler. -- **YAGNI**: Avoid building features or abstractions that aren't immediately needed. The code handles what's needed now rather than anticipating every possible future scenario. -- **Trust in emergence**: Complex systems work best when built from simple, well-defined components that do one thing well. -- **Pragmatic trust**: The developer trusts external systems enough to interact with them directly, handling failures as they occur rather than assuming they'll happen. -- **Consistency is key**: Uniform patterns and conventions make the codebase easier to understand and maintain. If you introduce a new pattern, make sure it's consistently applied. There should be one obvious way to do things. - -This development philosophy values clear, concise documentation, readable code, and belief that good architecture emerges from simplicity rather than being imposed through complexity. - -## Core Design Principles - -### 1. Ruthless Simplicity - -- **KISS principle taken to heart**: Keep everything as simple as possible, but no simpler -- **Minimize abstractions**: Every layer of abstraction must justify its existence -- **Start minimal, grow as needed**: Begin with the simplest implementation that meets current needs -- **Avoid future-proofing**: Don't build for hypothetical future requirements -- **Question everything**: Regularly challenge complexity in the codebase - -### 2. Architectural Integrity with Minimal Implementation - -- **Preserve key architectural patterns**: Maintain clear boundaries and responsibilities -- **Simplify implementations**: Maintain pattern benefits with dramatically simpler code -- **Scrappy but structured**: Lightweight implementations of solid architectural foundations -- **End-to-end thinking**: Focus on complete flows rather than perfect components - -### 3. Library vs Custom Code - -Choosing between custom code and external libraries is a judgment call that evolves with your requirements. There's no rigid rule - it's about understanding trade-offs and being willing to revisit decisions as needs change. - -#### The Evolution Pattern - -Your approach might naturally evolve: -- **Start simple**: Custom code for basic needs (20 lines handles it) -- **Growing complexity**: Switch to a library when requirements expand -- **Hitting limits**: Back to custom when you outgrow the library's capabilities - -This isn't failure - it's natural evolution. Each stage was the right choice at that time. - -#### When Custom Code Makes Sense - -Custom code often wins when: -- The need is simple and well-understood -- You want code perfectly tuned to your exact requirements -- Libraries would require significant "hacking" or workarounds -- The problem is unique to your domain -- You need full control over the implementation - -#### When Libraries Make Sense - -Libraries shine when: -- They solve complex problems you'd rather not tackle (auth, crypto, video encoding) -- They align well with your needs without major modifications -- The problem is well-solved with mature, battle-tested solutions -- Configuration alone can adapt them to your requirements -- The complexity they handle far exceeds the integration cost - -#### Making the Judgment Call - -Ask yourself: -- How well does this library align with our actual needs? -- Are we fighting the library or working with it? -- Is the integration clean or does it require workarounds? -- Will our future requirements likely stay within this library's capabilities? -- Is the problem complex enough to justify the dependency? - -#### Recognizing Misalignment - -Watch for signs you're fighting your current approach: -- Spending more time working around the library than using it -- Your simple custom solution has grown complex and fragile -- You're monkey-patching or heavily wrapping a library -- The library's assumptions fundamentally conflict with your needs - -#### Stay Flexible - -Remember that complexity isn't destroyed, only moved. Libraries shift complexity from your code to someone else's - that's often a great trade, but recognize what you're doing. - -The key is avoiding lock-in. Keep library integration points minimal and isolated so you can switch approaches when needed. There's no shame in moving from custom to library or library to custom. Requirements change, understanding deepens, and the right answer today might not be the right answer tomorrow. Make the best decision with current information, and be ready to evolve. - -## Technical Implementation Guidelines - -### API Layer - -- Implement only essential endpoints -- Minimal middleware with focused validation -- Clear error responses with useful messages -- Consistent patterns across endpoints - -### Storage - -- Prefer simple file storage -- Simple schema focused on current needs - -## Development Approach - -### Vertical Slices - -- Implement complete end-to-end functionality slices -- Start with core user journeys -- Get data flowing through all layers early -- Add features horizontally only after core flows work - -### Iterative Implementation - -- 80/20 principle: Focus on high-value, low-effort features first -- One working feature > multiple partial features -- Validate with real usage before enhancing -- Be willing to refactor early work as patterns emerge - -### Testing Strategy - -- Focus on critical path testing initially -- Add unit tests for complex logic and edge cases -- Testing pyramid: 60% unit, 30% integration, 10% end-to-end - -### Error Handling - -- Handle common errors robustly -- Log detailed information for debugging -- Provide clear error messages to users -- Fail fast and visibly during development - -## Decision-Making Framework - -When faced with implementation decisions, ask these questions: - -1. **Necessity**: "Do we actually need this right now?" -2. **Simplicity**: "What's the simplest way to solve this problem?" -3. **Directness**: "Can we solve this more directly?" -4. **Value**: "Does the complexity add proportional value?" -5. **Maintenance**: "How easy will this be to understand and change later?" - -## Areas to Embrace Complexity - -Some areas justify additional complexity: - -1. **Security**: Never compromise on security fundamentals -2. **Data integrity**: Ensure data consistency and reliability -3. **Core user experience**: Make the primary user flows smooth and reliable -4. **Error visibility**: Make problems obvious and diagnosable - -## Areas to Aggressively Simplify - -Push for extreme simplicity in these areas: - -1. **Internal abstractions**: Minimize layers between components -2. **Generic "future-proof" code**: Resist solving non-existent problems -3. **Edge case handling**: Handle the common cases well first -4. **Framework usage**: Use only what you need from frameworks -5. **State management**: Keep state simple and explicit - -## Remember - -- It's easier to add complexity later than to remove it -- Code you don't write has no bugs -- Favor clarity over cleverness -- The best code is often the simplest - -This philosophy document serves as the foundational guide for all implementation decisions in the project. - diff --git a/ai/talos-v1.11/README.md b/ai/talos-v1.11/README.md deleted file mode 100644 index ffc1e92..0000000 --- a/ai/talos-v1.11/README.md +++ /dev/null @@ -1,135 +0,0 @@ -# Talos v1.11 Agent Context Documentation - -This directory contains comprehensive documentation extracted from the official Talos v1.11 documentation, organized specifically to help AI agents become expert Talos cluster administrators. - -## Documentation Structure - -### Core Operations -- **[cluster-operations.md](cluster-operations.md)** - Essential cluster operations including upgrades, node management, and configuration -- **[cli-essentials.md](cli-essentials.md)** - Key talosctl commands and usage patterns for daily administration - -### System Understanding -- **[architecture-and-components.md](architecture-and-components.md)** - Deep dive into Talos architecture, components, and design principles -- **[discovery-and-networking.md](discovery-and-networking.md)** - Cluster discovery mechanisms and network configuration - -### Specialized Operations -- **[etcd-management.md](etcd-management.md)** - etcd operations, maintenance, backup, and disaster recovery -- **[bare-metal-administration.md](bare-metal-administration.md)** - Bare metal specific configurations, security, and hardware management -- **[troubleshooting-guide.md](troubleshooting-guide.md)** - Systematic approaches to diagnosing and resolving common issues - -## Quick Reference - -### Essential Commands for New Agents -```bash -# Cluster health check -talosctl -n ,, health --control-plane-nodes ,, - -# Node information -talosctl get members -talosctl -n version - -# Service status -talosctl -n services -talosctl -n service kubelet - -# System resources -talosctl -n memory -talosctl -n disks - -# Logs and events -talosctl -n dmesg | tail -50 -talosctl -n logs kubelet -talosctl -n events --since=1h -``` - -### Critical Procedures -- **Bootstrap**: `talosctl bootstrap --nodes ` -- **Backup etcd**: `talosctl -n etcd snapshot db.snapshot` -- **Upgrade OS**: `talosctl upgrade --nodes --image ghcr.io/siderolabs/installer:v1.11.x` -- **Upgrade K8s**: `talosctl --nodes upgrade-k8s --to v1.34.1` - -### Emergency Commands -- **Node reset**: `talosctl -n reset` -- **Force reset**: `talosctl -n reset --graceful=false --reboot` -- **Disaster recovery**: `talosctl -n bootstrap --recover-from=./db.snapshot` -- **Rollback**: `talosctl rollback --nodes ` - -### Bare Metal Specific Commands -- **Check hardware**: `talosctl -n disks`, `talosctl -n read /proc/cpuinfo` -- **Network interfaces**: `talosctl -n get addresses`, `talosctl -n get routes` -- **Extensions**: `talosctl -n get extensions` -- **Encryption status**: `talosctl -n get encryptionconfig -o yaml` -- **Hardware monitoring**: `talosctl -n dmesg | grep -i error` - -## Key Concepts for Agents - -### Architecture Fundamentals -- **Immutable OS**: Single image, atomic updates, A-B rollback system -- **API-driven**: All management through gRPC API, no SSH/shell access -- **Controller pattern**: Kubernetes-style resource controllers for system management -- **Minimal attack surface**: Only services necessary for Kubernetes - -### Control Plane Design -- **etcd quorum**: Requires majority for operations (3-node=2, 5-node=3) -- **Bootstrap process**: One-time initialization of etcd cluster -- **HA considerations**: Odd numbers of nodes, avoid even numbers -- **Upgrade strategy**: Rolling upgrades with automatic rollback on failure - -### Network and Discovery -- **Service discovery**: Encrypted discovery service for cluster membership -- **KubeSpan**: Optional WireGuard mesh networking -- **mTLS everywhere**: All Talos API communication secured -- **Discovery registries**: Service (default) and Kubernetes (deprecated) - -### Bare Metal Considerations -- **META configuration**: Network config embedded in disk images -- **Hardware compatibility**: Driver support and firmware requirements -- **Disk encryption**: LUKS2 with TPM, static keys, or node ID -- **SecureBoot**: UKI images with embedded signatures -- **System extensions**: Hardware-specific drivers and tools -- **Performance tuning**: CPU governors, IOMMU, memory management - -## Common Administration Patterns - -### Daily Operations -1. Check cluster health across all nodes -2. Monitor resource usage and capacity -3. Review system events and logs -4. Verify etcd health and backup status -5. Monitor discovery service connectivity - -### Maintenance Windows -1. Plan upgrade sequence (workers first, then control plane) -2. Create etcd backup before major changes -3. Apply configuration changes with dry-run first -4. Monitor upgrade progress and be ready to rollback -5. Verify cluster functionality after changes - -### Troubleshooting Workflow -1. **Gather information**: Health, version, resources, logs -2. **Check connectivity**: Network, discovery, API endpoints -3. **Examine services**: Status of critical services -4. **Review logs**: System events, service logs, kernel messages -5. **Apply fixes**: Configuration patches, service restarts, node resets - -## Best Practices for Agents - -### Configuration Management -- Use reproducible configuration workflow (secrets + patches) -- Always dry-run configuration changes first -- Store machine configurations in version control -- Test configuration changes in non-production first - -### Operational Safety -- Take etcd snapshots before major changes -- Upgrade one node at a time -- Monitor upgrade progress and have rollback ready -- Test disaster recovery procedures regularly - -### Performance Optimization -- Monitor etcd fragmentation and defragment when needed -- Scale vertically before horizontally for control plane -- Use appropriate hardware for etcd (fast storage, low network latency) -- Monitor resource usage trends and capacity planning - -This documentation provides the essential knowledge needed to effectively administer Talos Linux clusters, organized by operational context and complexity level. \ No newline at end of file diff --git a/ai/talos-v1.11/architecture-and-components.md b/ai/talos-v1.11/architecture-and-components.md deleted file mode 100644 index 459e030..0000000 --- a/ai/talos-v1.11/architecture-and-components.md +++ /dev/null @@ -1,248 +0,0 @@ -# Talos Architecture and Components Guide - -This guide provides deep understanding of Talos Linux architecture and system components for effective cluster administration. - -## Core Architecture Principles - -Talos is designed to be: -- **Atomic**: Distributed as a single, versioned, signed, immutable image -- **Modular**: Composed of separate components with defined gRPC interfaces -- **Minimal**: Focused init system that runs only services necessary for Kubernetes - -## File System Architecture - -### Partition Layout -- **EFI**: Stores EFI boot data -- **BIOS**: Used for GRUB's second stage boot -- **BOOT**: Contains boot loader, initramfs, and kernel data -- **META**: Stores node metadata (node IDs, etc.) -- **STATE**: Stores machine configuration, node identity, cluster discovery, KubeSpan data -- **EPHEMERAL**: Stores ephemeral state, mounted at `/var` - -### Root File System Structure -Three-layer design: -1. **Base Layer**: Read-only squashfs mounted as loop device (immutable base) -2. **Runtime Layer**: tmpfs filesystems for runtime needs (`/dev`, `/proc`, `/run`, `/sys`, `/tmp`, `/system`) -3. **Overlay Layer**: overlayfs for persistent data backed by XFS at `/var` - -#### Special Directories -- `/system`: Internal files that need to be writable (recreated each boot) - - Example: `/system/etc/hosts` bind-mounted over `/etc/hosts` -- `/var`: Owned by Kubernetes, contains persistent data: - - etcd data (control plane nodes) - - kubelet data - - containerd data - - Survives reboots and upgrades, wiped on reset - -## Core Components - -### machined (PID 1) -**Role**: Talos replacement for traditional init process -**Functions**: -- Machine configuration management -- API handling -- Resource and controller management -- Service lifecycle management - -**Managed Services**: -- containerd -- etcd (control plane nodes) -- kubelet -- networkd -- trustd -- udevd - -**Architecture**: Uses controller-runtime pattern similar to Kubernetes controllers - -### apid (API Gateway) -**Role**: gRPC API endpoint for all Talos interactions -**Functions**: -- Routes requests to appropriate components -- Provides proxy capabilities for multi-node operations -- Handles authentication and authorization - -**Usage Patterns**: -```bash -# Direct node communication -talosctl -e - -# Proxy through endpoint to specific nodes -talosctl -e -n - -# Multi-node operations -talosctl -e -n ,, -``` - -### trustd (Trust Management) -**Role**: Establishes and maintains trust within the system -**Functions**: -- Root of Trust implementation -- PKI data distribution for control plane bootstrap -- Certificate management -- Secure file placement operations - -### containerd (Container Runtime) -**Role**: Industry-standard container runtime -**Namespaces**: -- `system`: Talos services -- `k8s.io`: Kubernetes services - -### udevd (Device Management) -**Role**: Device file manager (eudev implementation) -**Functions**: -- Kernel device notification handling -- Device node management in `/dev` -- Hardware discovery and setup - -## Control Plane Architecture - -### etcd Cluster Design -**Critical Concepts**: -- **Quorum**: Majority of members must agree on leader -- **Membership**: Formal etcd cluster membership required -- **Consensus**: Uses Raft protocol for distributed consensus - -**Quorum Requirements**: -- 3 nodes: Requires 2 for quorum (tolerates 1 failure) -- 5 nodes: Requires 3 for quorum (tolerates 2 failures) -- Even numbers are worse than odd (4 nodes still only tolerates 1 failure) - -### Control Plane Components -**Running as Static Pods on Control Plane Nodes**: - -#### kube-apiserver -- Kubernetes API endpoint -- Connects to local etcd instance -- Handles all API operations - -#### kube-controller-manager -- Runs control loops -- Manages cluster state reconciliation -- Handles node lifecycle, replication, etc. - -#### kube-scheduler -- Pod placement decisions -- Resource-aware scheduling -- Constraint satisfaction - -### Bootstrap Process -1. **etcd Bootstrap**: One node chosen as bootstrap node, initializes etcd cluster -2. **Static Pods**: Control plane components start as static pods via kubelet -3. **API Availability**: Control plane endpoint becomes available -4. **Manifest Injection**: Bootstrap manifests (join tokens, RBAC, etc.) injected -5. **Cluster Formation**: Other control plane nodes join etcd cluster -6. **HA Control Plane**: All control plane nodes run full component set - -## Resource System Architecture - -### Controller-Runtime Pattern -Talos uses Kubernetes-style controller pattern: -- **Resources**: Typed configuration and state objects -- **Controllers**: Reconcile desired vs actual state -- **Events**: Reactive architecture for state changes - -### Resource Namespaces -- `config`: Machine configuration resources -- `cluster`: Cluster membership and discovery -- `controlplane`: Control plane component configurations -- `secrets`: Certificate and key management -- `network`: Network configuration and state - -### Key Resources -```bash -# Machine configuration -talosctl get machineconfig -talosctl get machinetype - -# Cluster membership -talosctl get members -talosctl get affiliates -talosctl get identities - -# Control plane -talosctl get apiserverconfig -talosctl get controllermanagerconfig -talosctl get schedulerconfig - -# Network -talosctl get addresses -talosctl get routes -talosctl get nodeaddresses -``` - -## Network Architecture - -### Network Stack -- **CNI**: Container Network Interface for pod networking -- **Host Networking**: Node-to-node communication -- **Service Discovery**: Built-in cluster member discovery -- **KubeSpan**: Optional WireGuard mesh networking - -### Discovery Service Integration -- **Service Registry**: External discovery service (default: discovery.talos.dev) -- **Kubernetes Registry**: Deprecated, uses Kubernetes Node resources -- **Encrypted Communication**: All discovery data encrypted before transmission - -## Security Architecture - -### Immutable Base -- Read-only root filesystem -- Signed and verified boot process -- Atomic updates with rollback capability - -### Process Isolation -- Minimal attack surface -- No shell access -- No arbitrary user services -- Container-based workload isolation - -### Network Security -- Mutual TLS (mTLS) for all API communication -- Certificate-based node authentication -- Optional WireGuard mesh networking (KubeSpan) -- Encrypted service discovery - -### Kernel Hardening -Configured according to Kernel Self Protection Project (KSPP) recommendations: -- Stack protection -- Control flow integrity -- Memory protection features -- Attack surface reduction - -## Extension Points - -### Machine Configuration -- Declarative configuration management -- Patch-based configuration updates -- Runtime configuration validation - -### System Extensions -- Kernel modules -- System services (limited) -- Network configuration -- Storage configuration - -### Kubernetes Integration -- Automatic kubelet configuration -- Bootstrap manifest management -- Certificate lifecycle management -- Node lifecycle automation - -## Performance Characteristics - -### etcd Performance -- Performance decreases with cluster size -- Network latency affects consensus performance -- Storage I/O directly impacts etcd performance - -### Resource Requirements -- **Control Plane Nodes**: Higher memory for etcd, CPU for control plane -- **Worker Nodes**: Resources scale with workload requirements -- **Network**: Low latency crucial for etcd performance - -### Scaling Patterns -- **Horizontal Scaling**: Add worker nodes for capacity -- **Vertical Scaling**: Increase control plane node resources for performance -- **Control Plane Scaling**: Odd numbers (3, 5) for availability - -This architecture enables Talos to provide a secure, minimal, and operationally simple platform for running Kubernetes clusters while maintaining the reliability and performance characteristics needed for production workloads. \ No newline at end of file diff --git a/ai/talos-v1.11/bare-metal-administration.md b/ai/talos-v1.11/bare-metal-administration.md deleted file mode 100644 index 005ecfd..0000000 --- a/ai/talos-v1.11/bare-metal-administration.md +++ /dev/null @@ -1,506 +0,0 @@ -# Bare Metal Talos Administration Guide - -This guide covers bare metal specific operations, configurations, and best practices for Talos Linux clusters. - -## META-Based Network Configuration - -Talos supports META-based network configuration for bare metal deployments where configuration is embedded in the disk image. - -### Basic META Configuration -```yaml -# META configuration for bare metal networking -machine: - network: - interfaces: - - interface: eth0 - addresses: - - 192.168.1.100/24 - routes: - - network: 0.0.0.0/0 - gateway: 192.168.1.1 - mtu: 1500 - nameservers: - - 8.8.8.8 - - 1.1.1.1 -``` - -### Advanced Network Configurations - -#### VLAN Configuration -```yaml -machine: - network: - interfaces: - - interface: eth0.100 # VLAN 100 - vlan: - parentDevice: eth0 - vid: 100 - addresses: - - 192.168.100.10/24 - routes: - - network: 192.168.100.0/24 -``` - -#### Interface Bonding -```yaml -machine: - network: - interfaces: - - interface: bond0 - bond: - mode: 802.3ad - lacpRate: fast - xmitHashPolicy: layer3+4 - miimon: 100 - updelay: 200 - downdelay: 200 - interfaces: - - eth0 - - eth1 - addresses: - - 192.168.1.100/24 - routes: - - network: 0.0.0.0/0 - gateway: 192.168.1.1 -``` - -#### Bridge Configuration -```yaml -machine: - network: - interfaces: - - interface: br0 - bridge: - stp: - enabled: false - interfaces: - - eth0 - - eth1 - addresses: - - 192.168.1.100/24 - routes: - - network: 0.0.0.0/0 - gateway: 192.168.1.1 -``` - -### Network Troubleshooting Commands -```bash -# Check interface configuration -talosctl -n get addresses -talosctl -n get routes -talosctl -n get links - -# Check network configuration -talosctl -n get networkconfig -o yaml - -# Test network connectivity -talosctl -n list /sys/class/net -talosctl -n read /proc/net/dev -``` - -## Disk Encryption for Bare Metal - -### LUKS2 Encryption Configuration -```yaml -machine: - systemDiskEncryption: - state: - provider: luks2 - keys: - - slot: 0 - static: - passphrase: "your-secure-passphrase" - ephemeral: - provider: luks2 - keys: - - slot: 0 - nodeID: {} -``` - -### TPM-Based Encryption -```yaml -machine: - systemDiskEncryption: - state: - provider: luks2 - keys: - - slot: 0 - tpm: {} - ephemeral: - provider: luks2 - keys: - - slot: 0 - tpm: {} -``` - -### Key Management Operations -```bash -# Check encryption status -talosctl -n get encryptionconfig -o yaml - -# Rotate encryption keys -talosctl -n apply-config --file updated-config.yaml --mode staged -``` - -## SecureBoot Implementation - -### UKI (Unified Kernel Image) Setup -SecureBoot requires UKI format images with embedded signatures. - -#### Generate SecureBoot Keys -```bash -# Generate platform key (PK) -talosctl gen secureboot uki --platform-key-path platform.key --platform-cert-path platform.crt - -# Generate PCR signing key -talosctl gen secureboot pcr --pcr-key-path pcr.key --pcr-cert-path pcr.crt - -# Generate database entries -talosctl gen secureboot database --enrolled-certificate platform.crt -``` - -#### Machine Configuration for SecureBoot -```yaml -machine: - secureboot: - enabled: true - uklPath: /boot/vmlinuz - systemDiskEncryption: - state: - provider: luks2 - keys: - - slot: 0 - tpm: - pcrTargets: - - 0 - - 1 - - 7 -``` - -### UEFI Configuration -- Enable SecureBoot in UEFI firmware -- Enroll platform keys and certificates -- Configure TPM 2.0 for PCR measurements -- Set boot order for UKI images - -## Hardware-Specific Configurations - -### Performance Tuning for Bare Metal - -#### CPU Governor Configuration -```yaml -machine: - sysfs: - "devices.system.cpu.cpu0.cpufreq.scaling_governor": "performance" - "devices.system.cpu.cpu1.cpufreq.scaling_governor": "performance" -``` - -#### Hardware Vulnerability Mitigations -```yaml -machine: - kernel: - args: - - mitigations=off # For maximum performance (less secure) - # or - - mitigations=auto # Default balanced approach -``` - -#### IOMMU Configuration -```yaml -machine: - kernel: - args: - - intel_iommu=on - - iommu=pt -``` - -### Memory Management -```yaml -machine: - kernel: - args: - - hugepages=1024 # 1GB hugepages - - transparent_hugepage=never -``` - -## Ingress Firewall for Bare Metal - -### Basic Firewall Configuration -```yaml -machine: - network: - firewall: - defaultAction: block - rules: - - name: allow-talos-api - portSelector: - ports: - - 50000 - - 50001 - ingress: - - subnet: 192.168.1.0/24 - - name: allow-kubernetes-api - portSelector: - ports: - - 6443 - ingress: - - subnet: 0.0.0.0/0 - - name: allow-etcd - portSelector: - ports: - - 2379 - - 2380 - ingress: - - subnet: 192.168.1.0/24 -``` - -### Advanced Firewall Rules -```yaml -machine: - network: - firewall: - defaultAction: block - rules: - - name: allow-ssh-management - portSelector: - ports: - - 22 - ingress: - - subnet: 10.0.1.0/24 # Management network only - - name: allow-monitoring - portSelector: - ports: - - 9100 # Node exporter - - 10250 # kubelet metrics - ingress: - - subnet: 192.168.1.0/24 -``` - -## System Extensions for Bare Metal - -### Common Bare Metal Extensions -```yaml -machine: - install: - extensions: - - image: ghcr.io/siderolabs/iscsi-tools:latest - - image: ghcr.io/siderolabs/util-linux-tools:latest - - image: ghcr.io/siderolabs/drbd:latest -``` - -### Storage Extensions -```yaml -machine: - install: - extensions: - - image: ghcr.io/siderolabs/zfs:latest - - image: ghcr.io/siderolabs/nut-client:latest - - image: ghcr.io/siderolabs/smartmontools:latest -``` - -### Checking Extension Status -```bash -# List installed extensions -talosctl -n get extensions - -# Check extension services -talosctl -n get extensionserviceconfigs -``` - -## Static Pod Configuration for Bare Metal - -### Local Storage Static Pods -```yaml -machine: - pods: - - name: local-storage-provisioner - namespace: kube-system - image: rancher/local-path-provisioner:v0.0.24 - args: - - --config-path=/etc/config/config.json - env: - - name: POD_NAMESPACE - value: kube-system - volumeMounts: - - name: config - mountPath: /etc/config - - name: local-storage - mountPath: /opt/local-path-provisioner - volumes: - - name: config - hostPath: - path: /etc/local-storage - - name: local-storage - hostPath: - path: /var/lib/local-storage -``` - -### Hardware Monitoring Static Pods -```yaml -machine: - pods: - - name: node-exporter - namespace: monitoring - image: prom/node-exporter:latest - args: - - --path.rootfs=/host - - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) - securityContext: - runAsNonRoot: true - runAsUser: 65534 - volumeMounts: - - name: proc - mountPath: /host/proc - readOnly: true - - name: sys - mountPath: /host/sys - readOnly: true - - name: rootfs - mountPath: /host - readOnly: true - volumes: - - name: proc - hostPath: - path: /proc - - name: sys - hostPath: - path: /sys - - name: rootfs - hostPath: - path: / -``` - -## Bare Metal Boot Asset Management - -### PXE Boot Configuration -For network booting, configure DHCP/TFTP with appropriate boot assets: - -```bash -# Download kernel and initramfs for PXE -curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/vmlinuz-amd64 -curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/initramfs-amd64.xz -``` - -### USB Boot Asset Creation -```bash -# Write installer image to USB -sudo dd if=metal-amd64.iso of=/dev/sdX bs=4M status=progress -``` - -### Image Factory Integration -For custom bare metal images: -```bash -# Generate schematic for bare metal with extensions -curl -X POST --data-binary @schematic.yaml \ - https://factory.talos.dev/schematics - -# Download custom installer -curl -LO https://factory.talos.dev/image//v1.11.0/metal-amd64.iso -``` - -## Hardware Compatibility and Drivers - -### Check Hardware Support -```bash -# Check PCI devices -talosctl -n read /proc/bus/pci/devices - -# Check USB devices -talosctl -n read /proc/bus/usb/devices - -# Check loaded kernel modules -talosctl -n read /proc/modules - -# Check hardware information -talosctl -n read /proc/cpuinfo -talosctl -n read /proc/meminfo -``` - -### Common Hardware Issues - -#### Network Interface Issues -```bash -# Check interface status -talosctl -n list /sys/class/net/ - -# Check driver information -talosctl -n read /sys/class/net/eth0/device/driver - -# Check firmware loading -talosctl -n dmesg | grep firmware -``` - -#### Storage Controller Issues -```bash -# Check block devices -talosctl -n disks - -# Check SMART status (if smartmontools extension installed) -talosctl -n list /dev/disk/by-id/ -``` - -## Bare Metal Monitoring and Maintenance - -### Hardware Health Monitoring -```bash -# Check system temperatures (if available) -talosctl -n read /sys/class/thermal/thermal_zone0/temp - -# Check power supply status -talosctl -n read /sys/class/power_supply/*/status - -# Monitor system events for hardware issues -talosctl -n dmesg | grep -i error -talosctl -n dmesg | grep -i "machine check" -``` - -### Performance Monitoring -```bash -# Check CPU performance -talosctl -n read /proc/cpuinfo | grep MHz -talosctl -n cgroups --preset cpu - -# Check memory performance -talosctl -n memory -talosctl -n cgroups --preset memory - -# Check I/O performance -talosctl -n read /proc/diskstats -``` - -## Security Hardening for Bare Metal - -### BIOS/UEFI Security -- Enable SecureBoot -- Disable unused boot devices -- Set administrator passwords -- Enable TPM 2.0 -- Disable legacy boot modes - -### Physical Security -- Secure physical access to servers -- Use chassis intrusion detection -- Implement network port security -- Consider hardware-based attestation - -### Network Security -```yaml -machine: - network: - firewall: - defaultAction: block - rules: - # Only allow necessary services - - name: allow-cluster-traffic - portSelector: - ports: - - 6443 # Kubernetes API - - 2379 # etcd client - - 2380 # etcd peer - - 10250 # kubelet API - - 50000 # Talos API - ingress: - - subnet: 192.168.1.0/24 -``` - -This bare metal guide provides comprehensive coverage of hardware-specific configurations, performance optimization, security hardening, and operational practices for Talos Linux on physical servers. \ No newline at end of file diff --git a/ai/talos-v1.11/cli-essentials.md b/ai/talos-v1.11/cli-essentials.md deleted file mode 100644 index 98cae3a..0000000 --- a/ai/talos-v1.11/cli-essentials.md +++ /dev/null @@ -1,382 +0,0 @@ -# Talosctl CLI Essentials - -This guide covers essential talosctl commands and usage patterns for effective Talos cluster administration. - -## Command Structure and Context - -### Basic Command Pattern -```bash -talosctl [global-flags] [command-flags] [arguments] - -# Examples: -talosctl -n get members -talosctl --nodes , service kubelet -talosctl -e -n upgrade --image -``` - -### Global Flags -- `-e, --endpoints`: API endpoints to connect to -- `-n, --nodes`: Target nodes for commands (defaults to first endpoint if omitted) -- `--talosconfig`: Path to Talos configuration file -- `--context`: Configuration context to use - -### Configuration Management -```bash -# Use specific config file -export TALOSCONFIG=/path/to/talosconfig - -# List available contexts -talosctl config contexts - -# Switch context -talosctl config context - -# View current config -talosctl config info -``` - -## Cluster Management Commands - -### Bootstrap and Node Management -```bash -# Bootstrap etcd cluster on first control plane node -talosctl bootstrap --nodes - -# Apply machine configuration -talosctl apply-config --nodes --file -talosctl apply-config --nodes --file --mode reboot -talosctl apply-config --nodes --file --dry-run - -# Reset node (wipe and reboot) -talosctl reset --nodes -talosctl reset --nodes --graceful=false --reboot - -# Reboot node -talosctl reboot --nodes - -# Shutdown node -talosctl shutdown --nodes -``` - -### Configuration Patching -```bash -# Patch machine configuration -talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]' - -# Patch with file -talosctl -n patch mc --patch @patch.yaml --mode reboot - -# Edit machine config interactively -talosctl -n edit mc --mode staged -``` - -## System Information and Monitoring - -### Node Status and Health -```bash -# Cluster member information -talosctl get members -talosctl get affiliates -talosctl get identities - -# Node health check -talosctl -n health -talosctl -n ,, health --control-plane-nodes ,, - -# System information -talosctl -n version -talosctl -n get machineconfig -talosctl -n get machinetype -``` - -### Resource Monitoring -```bash -# CPU and memory usage -talosctl -n cpu -talosctl -n memory - -# Disk usage and information -talosctl -n disks -talosctl -n df - -# Network interfaces -talosctl -n interfaces -talosctl -n get addresses -talosctl -n get routes - -# Process information -talosctl -n processes -talosctl -n cgroups --preset memory -talosctl -n cgroups --preset cpu -``` - -### Service Management -```bash -# List all services -talosctl -n services - -# Check specific service status -talosctl -n service kubelet -talosctl -n service containerd -talosctl -n service etcd - -# Restart service -talosctl -n service kubelet restart - -# Start/stop service -talosctl -n service start -talosctl -n service stop -``` - -## Logging and Diagnostics - -### Log Retrieval -```bash -# Kernel logs -talosctl -n dmesg -talosctl -n dmesg -f # Follow mode -talosctl -n dmesg --tail=100 - -# Service logs -talosctl -n logs kubelet -talosctl -n logs containerd -talosctl -n logs etcd -talosctl -n logs machined - -# Follow logs -talosctl -n logs kubelet -f -``` - -### System Events -```bash -# Monitor system events -talosctl -n events -talosctl -n events --tail - -# Filter events -talosctl -n events --since=1h -talosctl -n events --grep=error -``` - -## File System and Container Operations - -### File Operations -```bash -# List files/directories -talosctl -n list /var/log -talosctl -n list /etc/kubernetes - -# Copy files to/from node -talosctl -n copy /local/file /remote/path -talosctl -n cp /var/log/containers/app.log ./app.log - -# Read file contents -talosctl -n read /etc/resolv.conf -talosctl -n cat /var/log/audit/audit.log -``` - -### Container Operations -```bash -# List containers -talosctl -n containers -talosctl -n containers -k # Kubernetes containers - -# Container logs -talosctl -n logs --kubernetes - -# Execute in container -talosctl -n exec --kubernetes -- -``` - -## Kubernetes Integration - -### Kubernetes Cluster Operations -```bash -# Get kubeconfig -talosctl kubeconfig -talosctl kubeconfig --nodes -talosctl kubeconfig --force --nodes - -# Bootstrap manifests -talosctl -n get manifests -talosctl -n get manifests -o yaml | yq eval-all '.spec | .[] | splitDoc' - > manifests.yaml - -# Upgrade Kubernetes -talosctl --nodes upgrade-k8s --to v1.34.1 -talosctl --nodes upgrade-k8s --to v1.34.1 --dry-run -``` - -### Resource Inspection -```bash -# Control plane component configs -talosctl -n get apiserverconfig -o yaml -talosctl -n get controllermanagerconfig -o yaml -talosctl -n get schedulerconfig -o yaml - -# etcd configuration -talosctl -n get etcdconfig -o yaml -``` - -## etcd Management - -### etcd Operations -```bash -# etcd cluster status -talosctl -n ,, etcd status - -# etcd members -talosctl -n etcd members - -# etcd snapshots -talosctl -n etcd snapshot db.snapshot - -# etcd maintenance -talosctl -n etcd defrag -talosctl -n etcd alarm list -talosctl -n etcd alarm disarm - -# Leadership management -talosctl -n etcd forfeit-leadership -``` - -### Disaster Recovery -```bash -# Bootstrap from snapshot -talosctl -n bootstrap --recover-from=./db.snapshot -talosctl -n bootstrap --recover-from=./db.snapshot --recover-skip-hash-check -``` - -## Upgrade and Maintenance - -### OS Upgrades -```bash -# Upgrade Talos OS -talosctl upgrade --nodes --image ghcr.io/siderolabs/installer:v1.11.x -talosctl upgrade --nodes --image ghcr.io/siderolabs/installer:v1.11.x --stage - -# Monitor upgrade progress -talosctl upgrade --nodes --image --wait -talosctl upgrade --nodes --image --wait --debug - -# Rollback -talosctl rollback --nodes -``` - -## Resource System Commands - -### Resource Management -```bash -# List resource types -talosctl get rd - -# Get specific resources -talosctl get -talosctl get -o yaml -talosctl get --namespace= - -# Watch resources -talosctl get --watch - -# Common resource types -talosctl get machineconfig -talosctl get members -talosctl get services -talosctl get networkconfig -talosctl get secrets -``` - -## Local Development - -### Local Cluster Management -```bash -# Create local cluster -talosctl cluster create -talosctl cluster create --controlplanes 3 --workers 2 - -# Destroy local cluster -talosctl cluster destroy - -# Show local cluster status -talosctl cluster show -``` - -## Advanced Usage Patterns - -### Multi-Node Operations -```bash -# Run command on multiple nodes -talosctl -e -n ,, - -# Different endpoint and target nodes -talosctl -e -n , -``` - -### Output Formatting -```bash -# JSON output -talosctl -n get members -o json - -# YAML output -talosctl -n get machineconfig -o yaml - -# Table output (default) -talosctl -n get members -o table - -# Custom column output -talosctl -n get members -o columns=HOSTNAME,MACHINE\ TYPE,OS -``` - -### Filtering and Selection -```bash -# Filter resources -talosctl get members --search -talosctl get services --search kubelet - -# Namespace filtering -talosctl get secrets --namespace=secrets -talosctl get affiliates --namespace=cluster-raw -``` - -## Common Command Workflows - -### Initial Cluster Setup -```bash -# 1. Generate configurations -talosctl gen config cluster-name https://cluster-endpoint:6443 - -# 2. Apply to nodes -talosctl apply-config --nodes --file controlplane.yaml -talosctl apply-config --nodes --file worker.yaml - -# 3. Bootstrap cluster -talosctl bootstrap --nodes - -# 4. Get kubeconfig -talosctl kubeconfig --nodes -``` - -### Cluster Health Check -```bash -# Check all aspects of cluster health -talosctl -n ,, health --control-plane-nodes ,, -talosctl -n ,, etcd status -talosctl -n ,, service kubelet -kubectl get nodes -kubectl get pods --all-namespaces -``` - -### Node Troubleshooting -```bash -# System diagnostics -talosctl -n dmesg | tail -100 -talosctl -n services | grep -v Running -talosctl -n logs kubelet | tail -50 -talosctl -n events --since=1h - -# Resource usage -talosctl -n memory -talosctl -n df -talosctl -n processes | head -20 -``` - -This CLI reference provides the essential commands and patterns needed for day-to-day Talos cluster administration and troubleshooting. \ No newline at end of file diff --git a/ai/talos-v1.11/cluster-operations.md b/ai/talos-v1.11/cluster-operations.md deleted file mode 100644 index 16f12f6..0000000 --- a/ai/talos-v1.11/cluster-operations.md +++ /dev/null @@ -1,239 +0,0 @@ -# Talos Cluster Operations Guide - -This guide covers essential cluster operations for Talos Linux v1.11 administrators. - -## Upgrading Operations - -### Talos OS Upgrades - -Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image. - -#### Upgrade Process -```bash -# Upgrade a single node -talosctl upgrade --nodes --image ghcr.io/siderolabs/installer:v1.11.x - -# Use --stage flag if upgrade fails due to open files -talosctl upgrade --nodes --image ghcr.io/siderolabs/installer:v1.11.x --stage - -# Monitor upgrade progress -talosctl dmesg -f -talosctl upgrade --wait --debug -``` - -#### Upgrade Sequence -1. Node cordons itself in Kubernetes -2. Node drains existing workloads -3. Internal processes shut down -4. Filesystems unmount -5. Disk verification and image upgrade -6. Bootloader set to boot once with new image -7. Node reboots -8. Node rejoins cluster and uncordons - -#### Rollback -```bash -talosctl rollback --nodes -``` - -### Kubernetes Upgrades - -Kubernetes upgrades are separate from OS upgrades and non-disruptive. - -#### Automated Upgrade (Recommended) -```bash -# Check what will be upgraded -talosctl --nodes upgrade-k8s --to v1.34.1 --dry-run - -# Perform upgrade -talosctl --nodes upgrade-k8s --to v1.34.1 -``` - -#### Manual Component Upgrades -For manual control, patch each component individually: - -**API Server:** -```bash -talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]' -``` - -**Controller Manager:** -```bash -talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]' -``` - -**Scheduler:** -```bash -talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]' -``` - -**Kubelet:** -```bash -talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]' -``` - -## Node Management - -### Adding Control Plane Nodes -1. Apply machine configuration to new node -2. Node automatically joins etcd cluster via control plane endpoint -3. Control plane components start automatically - -### Removing Control Plane Nodes -```bash -# Recommended approach - reset then delete -talosctl -n reset -kubectl delete node -``` - -### Adding Worker Nodes -1. Apply worker machine configuration -2. Node automatically joins via bootstrap token - -### Removing Worker Nodes -```bash -kubectl drain --ignore-daemonsets --delete-emptydir-data -kubectl delete node -talosctl -n reset -``` - -## Configuration Management - -### Applying Configuration Changes -```bash -# Apply config with automatic mode detection -talosctl apply-config --nodes --file - -# Apply with specific modes -talosctl apply-config --nodes --file --mode no-reboot -talosctl apply-config --nodes --file --mode reboot -talosctl apply-config --nodes --file --mode staged - -# Dry run to preview changes -talosctl apply-config --nodes --file --dry-run -``` - -### Configuration Patching -```bash -# Patch machine configuration -talosctl -n patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]' - -# Patch with file -talosctl -n patch mc --patch @patch.yaml -``` - -### Retrieving Current Configuration -```bash -# Get machine configuration -talosctl -n get mc v1alpha1 -o yaml - -# Get effective configuration -talosctl -n get machineconfig -o yaml -``` - -## Cluster Health Monitoring - -### Node Status -```bash -# Check node status -talosctl -n get members -talosctl -n health - -# Check system services -talosctl -n services -talosctl -n service -``` - -### Resource Monitoring -```bash -# System resources -talosctl -n memory -talosctl -n cpu -talosctl -n disks - -# Process information -talosctl -n processes -talosctl -n cgroups --preset memory -``` - -### Log Monitoring -```bash -# Kernel logs -talosctl -n dmesg -talosctl -n dmesg -f # Follow mode - -# Service logs -talosctl -n logs -talosctl -n logs kubelet -``` - -## Control Plane Best Practices - -### Cluster Sizing Recommendations -- **3 nodes**: Sufficient for most use cases, tolerates 1 node failure -- **5 nodes**: Better availability (tolerates 2 node failures), higher resource cost -- **Avoid even numbers**: 2 or 4 nodes provide worse availability than odd numbers - -### Node Replacement Strategy -- **Failed node**: Remove first, then add replacement -- **Healthy node**: Add replacement first, then remove old node - -### Performance Considerations -- etcd performance decreases as cluster scales -- 5-node cluster commits ~5% fewer writes than 3-node cluster -- Vertically scale nodes for performance, don't add more nodes - -## Machine Configuration Versioning - -### Reproducible Configuration Workflow -Store only: -- `secrets.yaml` (generated once at cluster creation) -- Patch files (YAML/JSON patches describing differences from defaults) - -Generate configs when needed: -```bash -# Generate fresh configs with existing secrets -talosctl gen config --with-secrets secrets.yaml - -# Apply patches to generated configs -talosctl gen config --with-secrets secrets.yaml --config-patch @patch.yaml -``` - -This prevents configuration drift after automated upgrades. - -## Troubleshooting Common Issues - -### Upgrade Failures -- **Invalid installer image**: Check image reference and network connectivity -- **Filesystem unmount failure**: Use `--stage` flag -- **Boot failure**: System automatically rolls back to previous version -- **Workload issues**: Use `talosctl rollback` to revert - -### Node Join Issues -- Verify network connectivity to control plane endpoint -- Check discovery service configuration -- Validate machine configuration syntax -- Ensure bootstrap process completed on initial control plane node - -### Control Plane Quorum Loss -- Identify healthy nodes with `talosctl etcd status` -- Follow disaster recovery procedures if quorum cannot be restored -- Use etcd snapshots for cluster recovery - -## Security Considerations - -### Certificate Rotation -Talos automatically rotates certificates, but monitor expiration: -```bash -talosctl -n get secrets -``` - -### Pod Security -Control plane nodes are tainted by default to prevent workload scheduling. This protects: -- Control plane from resource starvation -- Credentials from workload exposure - -### Network Security -- All API communication uses mutual TLS (mTLS) -- Discovery service data is encrypted before transmission -- WireGuard (KubeSpan) provides mesh networking security \ No newline at end of file diff --git a/ai/talos-v1.11/discovery-and-networking.md b/ai/talos-v1.11/discovery-and-networking.md deleted file mode 100644 index b814320..0000000 --- a/ai/talos-v1.11/discovery-and-networking.md +++ /dev/null @@ -1,344 +0,0 @@ -# Discovery and Networking Guide - -This guide covers Talos cluster discovery mechanisms, network configuration, and connectivity troubleshooting. - -## Cluster Discovery System - -Talos includes built-in node discovery that allows cluster members to find each other and maintain membership information. - -### Discovery Registries - -#### Service Registry (Default) -- **External Service**: Uses public discovery service at `https://discovery.talos.dev/` -- **Encryption**: All data encrypted with AES-GCM before transmission -- **Functionality**: Works without dependency on etcd/Kubernetes -- **Advantages**: Available even when control plane is down - -#### Kubernetes Registry (Deprecated) -- **Data Source**: Uses Kubernetes Node resources and annotations -- **Limitation**: Incompatible with Kubernetes 1.32+ due to AuthorizeNodeWithSelectors -- **Status**: Disabled by default, deprecated - -### Discovery Configuration -```yaml -cluster: - discovery: - enabled: true - registries: - service: - disabled: false # Default - kubernetes: - disabled: true # Deprecated, disabled by default -``` - -**To disable service registry**: -```yaml -cluster: - discovery: - enabled: true - registries: - service: - disabled: true -``` - -## Discovery Data Flow - -### Service Registry Process -1. **Data Encryption**: Node encrypts affiliate data with cluster key -2. **Endpoint Encryption**: Endpoints separately encrypted for deduplication -3. **Data Submission**: Node submits own data + observed peer endpoints -4. **Server Processing**: Discovery service aggregates and deduplicates data -5. **Data Distribution**: Encrypted updates sent to all cluster members -6. **Local Processing**: Nodes decrypt data for cluster discovery and KubeSpan - -### Data Protection -- **Cluster Isolation**: Cluster ID used as key selector -- **End-to-End Encryption**: Discovery service cannot decrypt affiliate data -- **Memory-Only Storage**: Data stored in memory with encrypted snapshots -- **No Sensitive Exposure**: Service only sees encrypted blobs and cluster metadata - -## Discovery Resources - -### Node Identity -```bash -# View node's unique identity -talosctl get identities -o yaml -``` - -**Output**: -```yaml -spec: - nodeId: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd -``` - -**Identity Characteristics**: -- Base62 encoded random 32 bytes -- URL-safe encoding -- Preserved in STATE partition (`node-identity.yaml`) -- Survives reboots and upgrades -- Regenerated on reset/wipe - -### Affiliates (Proposed Members) -```bash -# View discovered affiliates (proposed cluster members) -talosctl get affiliates -``` - -**Output**: -``` -ID VERSION HOSTNAME MACHINE TYPE ADDRESSES -2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 2 talos-default-controlplane-2 controlplane ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"] -``` - -### Members (Approved Members) -```bash -# View cluster members -talosctl get members -``` - -**Output**: -``` -ID VERSION HOSTNAME MACHINE TYPE OS ADDRESSES -talos-default-controlplane-1 2 talos-default-controlplane-1 controlplane Talos (v1.11.0) ["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"] -``` - -### Raw Registry Data -```bash -# View data from specific registries -talosctl get affiliates --namespace=cluster-raw -``` - -**Output shows registry sources**: -``` -ID VERSION HOSTNAME -k8s/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 3 talos-default-controlplane-2 -service/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 23 talos-default-controlplane-2 -``` - -## Network Architecture - -### Network Layers - -#### Host Networking -- **Node-to-Node**: Direct IP connectivity between cluster nodes -- **Control Plane**: API server communication via control plane endpoint -- **Discovery**: HTTPS connection to discovery service (port 443) - -#### Container Networking -- **CNI**: Container Network Interface for pod networking -- **Service Mesh**: Optional service mesh implementations -- **Network Policies**: Kubernetes network policy enforcement - -#### Optional: KubeSpan (WireGuard Mesh) -- **Mesh Networking**: Full mesh WireGuard connections -- **Discovery Integration**: Uses discovery service for peer coordination -- **Encryption**: WireGuard public keys distributed via discovery -- **Use Cases**: Multi-cloud, hybrid, NAT traversal - -### Network Configuration Patterns - -#### Basic Network Setup -```yaml -machine: - network: - interfaces: - - interface: eth0 - dhcp: true -``` - -#### Static IP Configuration -```yaml -machine: - network: - interfaces: - - interface: eth0 - addresses: - - 192.168.1.100/24 - routes: - - network: 0.0.0.0/0 - gateway: 192.168.1.1 - mtu: 1500 - nameservers: - - 8.8.8.8 - - 1.1.1.1 -``` - -#### Multiple Interface Configuration -```yaml -machine: - network: - interfaces: - - interface: eth0 # Management interface - dhcp: true - - interface: eth1 # Kubernetes traffic - addresses: - - 10.0.1.100/24 - routes: - - network: 10.0.0.0/16 - gateway: 10.0.1.1 -``` - -## KubeSpan Configuration - -### Basic KubeSpan Setup -```yaml -machine: - network: - kubespan: - enabled: true -``` - -### Advanced KubeSpan Configuration -```yaml -machine: - network: - kubespan: - enabled: true - advertiseKubernetesNetworks: true - allowDownPeerBypass: true - mtu: 1420 # Account for WireGuard overhead - filters: - endpoints: - - 0.0.0.0/0 # Allow all endpoints -``` - -**KubeSpan Features**: -- Automatic peer discovery via discovery service -- NAT traversal capabilities -- Encrypted mesh networking -- Kubernetes network advertisement -- Fault tolerance with peer bypass - -## Network Troubleshooting - -### Discovery Issues - -#### Check Discovery Service Connectivity -```bash -# Test connectivity to discovery service -talosctl get affiliates - -# Check discovery configuration -talosctl get discoveryconfig -o yaml - -# Monitor discovery events -talosctl events --tail -``` - -#### Common Discovery Problems -1. **No Affiliates Discovered**: - - Check discovery service connectivity - - Verify cluster ID matches across nodes - - Confirm discovery is enabled - -2. **Partial Affiliate List**: - - Network connectivity issues between nodes - - Discovery service regional availability - - Firewall blocking discovery traffic - -3. **Discovery Service Unreachable**: - - Network connectivity to discovery.talos.dev:443 - - Corporate firewall/proxy configuration - - DNS resolution issues - -### Network Connectivity Testing - -#### Basic Network Tests -```bash -# Test network interfaces -talosctl get addresses -talosctl get routes -talosctl get nodeaddresses - -# Check network configuration -talosctl get networkconfig -o yaml - -# Test connectivity -talosctl -n ping -``` - -#### Inter-Node Connectivity -```bash -# Test control plane endpoint -talosctl health --control-plane-nodes ,, - -# Check etcd connectivity -talosctl -n etcd members - -# Test Kubernetes API -kubectl get nodes -``` - -#### KubeSpan Troubleshooting -```bash -# Check KubeSpan status -talosctl get kubespanpeerspecs -talosctl get kubespanpeerstatuses - -# Monitor WireGuard connections -talosctl -n interfaces - -# Check KubeSpan logs -talosctl -n logs controller-runtime | grep kubespan -``` - -### Network Performance Optimization - -#### Network Interface Tuning -```yaml -machine: - network: - interfaces: - - interface: eth0 - mtu: 9000 # Jumbo frames if supported - dhcp: true -``` - -#### KubeSpan Performance -- Adjust MTU for WireGuard overhead (typically -80 bytes) -- Consider endpoint filters for large clusters -- Monitor WireGuard peer connection stability - -## Security Considerations - -### Discovery Security -- **Encrypted Communication**: All discovery data encrypted end-to-end -- **Cluster Isolation**: Cluster ID prevents cross-cluster data access -- **No Sensitive Data**: Only encrypted metadata transmitted -- **Network Security**: HTTPS transport with certificate validation - -### Network Security -- **mTLS**: All Talos API communication uses mutual TLS -- **Certificate Rotation**: Automatic certificate lifecycle management -- **Network Policies**: Implement Kubernetes network policies for workloads -- **Firewall Rules**: Restrict network access to necessary ports only - -### Required Network Ports -- **6443**: Kubernetes API server -- **2379-2380**: etcd client/peer communication -- **10250**: kubelet API -- **50000**: Talos API (apid) -- **443**: Discovery service (outbound) -- **51820**: KubeSpan WireGuard (if enabled) - -## Operational Best Practices - -### Monitoring -- Monitor discovery service connectivity -- Track cluster member changes -- Alert on network partitions -- Monitor KubeSpan peer status - -### Backup and Recovery -- Document network configuration -- Backup discovery service configuration -- Test network recovery procedures -- Plan for discovery service outages - -### Scaling Considerations -- Discovery service scales to thousands of nodes -- KubeSpan mesh scales to hundreds of nodes efficiently -- Consider network segmentation for large clusters -- Plan for multi-region deployments - -This networking foundation enables Talos clusters to maintain connectivity and membership across various network topologies while providing security and performance optimization options. \ No newline at end of file diff --git a/ai/talos-v1.11/etcd-management.md b/ai/talos-v1.11/etcd-management.md deleted file mode 100644 index a68fed8..0000000 --- a/ai/talos-v1.11/etcd-management.md +++ /dev/null @@ -1,287 +0,0 @@ -# etcd Management and Disaster Recovery Guide - -This guide covers etcd database operations, maintenance, and disaster recovery procedures for Talos Linux clusters. - -## etcd Health Monitoring - -### Basic Health Checks -```bash -# Check etcd status across all control plane nodes -talosctl -n ,, etcd status - -# Check etcd alarms -talosctl -n etcd alarm list - -# Check etcd members -talosctl -n etcd members - -# Check service status -talosctl -n service etcd -``` - -### Understanding etcd Status Output -``` -NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS -172.20.0.2 a49c021e76e707db 17 MB 4.5 MB (26.10%) ecebb05b59a776f1 53391 4 53391 false -``` - -**Key Metrics**: -- **DB SIZE**: Total database size on disk -- **IN USE**: Actual data size (fragmentation = DB SIZE - IN USE) -- **LEADER**: Current etcd cluster leader -- **RAFT INDEX**: Consensus log position -- **LEARNER**: Whether node is still joining cluster - -## Space Quota Management - -### Default Configuration -- Default space quota: 2 GiB -- Recommended maximum: 8 GiB -- Database locks when quota exceeded - -### Quota Exceeded Handling -**Symptoms**: -```bash -talosctl -n etcd alarm list -# Output: ALARM: NOSPACE -``` - -**Resolution**: -1. Increase quota in machine configuration: -```yaml -cluster: - etcd: - extraArgs: - quota-backend-bytes: 4294967296 # 4 GiB -``` - -2. Apply configuration and reboot: -```bash -talosctl -n apply-config --file updated-config.yaml --mode reboot -``` - -3. Clear the alarm: -```bash -talosctl -n etcd alarm disarm -``` - -## Database Defragmentation - -### When to Defragment -- In use/DB size ratio < 0.5 (heavily fragmented) -- Database size exceeds quota but actual data is small -- Performance degradation due to fragmentation - -### Defragmentation Process -```bash -# Check fragmentation status -talosctl -n ,, etcd status - -# Defragment single node (resource-intensive operation) -talosctl -n etcd defrag - -# Verify defragmentation results -talosctl -n etcd status -``` - -**Important Notes**: -- Defragment one node at a time -- Operation blocks reads/writes during execution -- Can significantly improve performance if heavily fragmented - -### Post-Defragmentation Verification -After successful defragmentation, DB size should closely match IN USE size: -``` -NODE MEMBER DB SIZE IN USE -172.20.0.2 a49c021e76e707db 4.5 MB 4.5 MB (100.00%) -``` - -## Backup Operations - -### Regular Snapshots -```bash -# Create consistent snapshot -talosctl -n etcd snapshot db.snapshot -``` - -**Output Example**: -``` -etcd snapshot saved to "db.snapshot" (2015264 bytes) -snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136 -``` - -### Disaster Snapshots -When etcd cluster is unhealthy and normal snapshot fails: -```bash -# Copy database directly (may be inconsistent) -talosctl -n cp /var/lib/etcd/member/snap/db . -``` - -### Automated Backup Strategy -- Schedule regular snapshots (daily/hourly based on change frequency) -- Store snapshots in multiple locations -- Test restore procedures regularly -- Document recovery procedures - -## Disaster Recovery - -### Pre-Recovery Assessment -**Check if Recovery is Necessary**: -```bash -# Query etcd health on all control plane nodes -talosctl -n ,, service etcd - -# Check member list consistency -talosctl -n etcd members -talosctl -n etcd members -talosctl -n etcd members -``` - -**Recovery is needed when**: -- Quorum is lost (majority of nodes down) -- etcd data corruption -- Complete cluster failure - -### Recovery Prerequisites -1. **Latest etcd snapshot** (preferably consistent) -2. **Machine configuration backup**: -```bash -talosctl -n get mc v1alpha1 -o yaml | yq eval '.spec' - -``` -3. **No init-type nodes** (deprecated, incompatible with recovery) - -### Recovery Procedure - -#### Step 1: Prepare Control Plane Nodes -```bash -# If nodes have hardware issues, replace them with same configuration -# If nodes are running but etcd is corrupted, wipe EPHEMERAL partition: -talosctl -n reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL -``` - -#### Step 2: Verify etcd State -All etcd services should be in "Preparing" state: -```bash -talosctl -n service etcd -# Expected: STATE: Preparing -``` - -#### Step 3: Bootstrap from Snapshot -```bash -# Bootstrap cluster from snapshot -talosctl -n bootstrap --recover-from=./db.snapshot - -# For direct database copies, skip hash check: -talosctl -n bootstrap --recover-from=./db --recover-skip-hash-check -``` - -#### Step 4: Verify Recovery -**Monitor kernel logs** for recovery progress: -```bash -talosctl -n dmesg -f -``` - -**Expected log entries**: -``` -recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136 -{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot"} -``` - -**Verify cluster health**: -```bash -# etcd should become healthy on bootstrap node -talosctl -n service etcd - -# Kubernetes control plane should start -kubectl get nodes - -# Other control plane nodes should join automatically -talosctl -n ,, etcd status -``` - -## etcd Version Management - -### Downgrade Process (v3.6 to v3.5) -**Prerequisites**: -- Healthy cluster running v3.6.x -- Recent backup snapshot -- Downgrade only one minor version at a time - -#### Step 1: Validate Downgrade -```bash -talosctl -n etcd downgrade validate 3.5 -``` - -#### Step 2: Enable Downgrade -```bash -talosctl -n etcd downgrade enable 3.5 -``` - -#### Step 3: Verify Schema Migration -```bash -# Check storage version migrated to 3.5 -talosctl -n ,, etcd status -# Verify STORAGE column shows 3.5.0 -``` - -#### Step 4: Patch Machine Configuration -```bash -# Transfer leadership if node is leader -talosctl -n etcd forfeit-leadership - -# Create patch file -cat > etcd-patch.yaml < patch machineconfig --patch @etcd-patch.yaml --mode reboot -``` - -#### Step 5: Repeat for All Control Plane Nodes -Continue patching remaining control plane nodes one by one. - -## Operational Best Practices - -### Monitoring -- Monitor database size and fragmentation regularly -- Set up alerts for space quota approaching limits -- Track etcd performance metrics (request latency, leader changes) -- Monitor disk I/O and network latency - -### Maintenance Windows -- Schedule defragmentation during low-traffic periods -- Coordinate with application teams for maintenance windows -- Test backup/restore procedures in non-production environments - -### Performance Optimization -- Use fast storage (NVMe SSDs preferred) -- Minimize network latency between control plane nodes -- Monitor and tune etcd configuration based on workload - -### Security -- Encrypt etcd data at rest -- Secure backup storage with appropriate access controls -- Regularly rotate certificates -- Monitor for unauthorized access attempts - -## Troubleshooting Common Issues - -### Split Brain Prevention -- Ensure odd number of control plane nodes -- Monitor network connectivity between nodes -- Use dedicated network for control plane communication when possible - -### Performance Issues -- Check disk I/O latency -- Monitor memory usage -- Consider vertical scaling before adding nodes -- Review etcd request patterns and optimize applications - -### Backup/Restore Issues -- Test restore procedures regularly -- Verify backup integrity -- Ensure consistent network and storage configuration -- Document and practice disaster recovery procedures \ No newline at end of file diff --git a/ai/talos-v1.11/troubleshooting-guide.md b/ai/talos-v1.11/troubleshooting-guide.md deleted file mode 100644 index 7a3f294..0000000 --- a/ai/talos-v1.11/troubleshooting-guide.md +++ /dev/null @@ -1,480 +0,0 @@ -# Talos Troubleshooting Guide - -This guide provides systematic approaches to diagnosing and resolving common Talos cluster issues. - -## General Troubleshooting Methodology - -### 1. Gather Information -```bash -# Node status and health -talosctl -n health -talosctl -n version -talosctl -n get members - -# System resources -talosctl -n memory -talosctl -n disks -talosctl -n processes | head -20 - -# Service status -talosctl -n services -``` - -### 2. Check Logs -```bash -# Kernel logs (system-level issues) -talosctl -n dmesg | tail -100 - -# Service logs -talosctl -n logs machined -talosctl -n logs kubelet -talosctl -n logs containerd - -# System events -talosctl -n events --since=1h -``` - -### 3. Network Connectivity -```bash -# Discovery and membership -talosctl get affiliates -talosctl get members - -# Network interfaces -talosctl -n interfaces -talosctl -n get addresses - -# Control plane connectivity -kubectl get nodes -talosctl -n ,, etcd status -``` - -## Bootstrap and Initial Setup Issues - -### Cluster Bootstrap Failures - -**Symptoms**: Bootstrap command fails or times out -**Diagnosis**: -```bash -# Check etcd service state -talosctl -n service etcd - -# Check if node is trying to join instead of bootstrap -talosctl -n logs etcd | grep -i bootstrap - -# Verify machine configuration -talosctl -n get machineconfig -o yaml -``` - -**Common Causes & Solutions**: -1. **Wrong node type**: Ensure using `controlplane`, not deprecated `init` -2. **Network issues**: Verify control plane endpoint connectivity -3. **Configuration errors**: Check machine configuration validity -4. **Previous bootstrap**: etcd data exists from previous attempts - -**Resolution**: -```bash -# Reset node if previous bootstrap data exists -talosctl -n reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL - -# Re-apply configuration and bootstrap -talosctl apply-config --nodes --file controlplane.yaml -talosctl bootstrap --nodes -``` - -### Node Join Issues - -**Symptoms**: New nodes don't join cluster -**Diagnosis**: -```bash -# Check discovery -talosctl get affiliates -talosctl get members - -# Check bootstrap token -kubectl get secrets -n kube-system | grep bootstrap-token - -# Check kubelet logs -talosctl -n logs kubelet | grep -i certificate -``` - -**Common Solutions**: -```bash -# Regenerate bootstrap token if expired -kubeadm token create --print-join-command - -# Verify discovery service connectivity -talosctl -n get affiliates --namespace=cluster-raw - -# Check machine configuration matches cluster -talosctl -n get machineconfig -o yaml -``` - -## Control Plane Issues - -### etcd Problems - -**etcd Won't Start**: -```bash -# Check etcd service status and logs -talosctl -n service etcd -talosctl -n logs etcd - -# Check etcd data directory -talosctl -n list /var/lib/etcd - -# Check disk space and permissions -talosctl -n df -``` - -**etcd Quorum Loss**: -```bash -# Check member status -talosctl -n ,, etcd status -talosctl -n etcd members - -# Identify healthy members -for ip in IP1 IP2 IP3; do - echo "=== Node $ip ===" - talosctl -n $ip service etcd -done -``` - -**Solution for Quorum Loss**: -1. If majority available: Remove failed members, add replacements -2. If majority lost: Follow disaster recovery procedure - -### API Server Issues - -**API Server Not Responding**: -```bash -# Check API server pod status -kubectl get pods -n kube-system | grep apiserver - -# Check API server configuration -talosctl -n get apiserverconfig -o yaml - -# Check control plane endpoint -curl -k https://:6443/healthz -``` - -**Common Solutions**: -```bash -# Restart kubelet to reload static pods -talosctl -n service kubelet restart - -# Check for configuration issues -talosctl -n logs kubelet | grep apiserver - -# Verify etcd connectivity -talosctl -n etcd status -``` - -## Node-Level Issues - -### Kubelet Problems - -**Kubelet Service Issues**: -```bash -# Check kubelet status and logs -talosctl -n service kubelet -talosctl -n logs kubelet | tail -50 - -# Check kubelet configuration -talosctl -n get kubeletconfig -o yaml - -# Check container runtime -talosctl -n service containerd -``` - -**Common Kubelet Issues**: -1. **Certificate problems**: Check certificate expiration and rotation -2. **Container runtime issues**: Verify containerd health -3. **Resource constraints**: Check memory and disk space -4. **Network connectivity**: Verify API server connectivity - -### Container Runtime Issues - -**Containerd Problems**: -```bash -# Check containerd service -talosctl -n service containerd -talosctl -n logs containerd - -# List containers -talosctl -n containers -talosctl -n containers -k # Kubernetes containers - -# Check containerd configuration -talosctl -n read /etc/cri/conf.d/cri.toml -``` - -**Common Solutions**: -```bash -# Restart containerd -talosctl -n service containerd restart - -# Check disk space for container images -talosctl -n df - -# Clean up unused containers/images -# (This happens automatically via kubelet GC) -``` - -## Network Issues - -### Network Connectivity Problems - -**Node-to-Node Connectivity**: -```bash -# Test basic network connectivity -talosctl -n interfaces -talosctl -n get routes - -# Test specific connectivity -talosctl -n read /etc/resolv.conf - -# Check network configuration -talosctl -n get networkconfig -o yaml -``` - -**DNS Resolution Issues**: -```bash -# Check DNS configuration -talosctl -n read /etc/resolv.conf - -# Test DNS resolution -talosctl -n exec --kubernetes coredns-pod -- nslookup kubernetes.default.svc.cluster.local -``` - -### Discovery Service Issues - -**Discovery Not Working**: -```bash -# Check discovery configuration -talosctl get discoveryconfig -o yaml - -# Check affiliate discovery -talosctl get affiliates -talosctl get affiliates --namespace=cluster-raw - -# Test discovery service connectivity -curl -v https://discovery.talos.dev/ -``` - -**KubeSpan Issues** (if enabled): -```bash -# Check KubeSpan configuration -talosctl get kubespanconfig -o yaml - -# Check peer status -talosctl get kubespanpeerspecs -talosctl get kubespanpeerstatuses - -# Check WireGuard interface -talosctl -n interfaces | grep kubespan -``` - -## Upgrade Issues - -### OS Upgrade Problems - -**Upgrade Fails or Hangs**: -```bash -# Check upgrade status -talosctl -n dmesg | grep -i upgrade -talosctl -n events | grep -i upgrade - -# Use staged upgrade for filesystem lock issues -talosctl upgrade --nodes --image --stage - -# Monitor upgrade progress -talosctl upgrade --nodes --image --wait --debug -``` - -**Boot Issues After Upgrade**: -```bash -# Check boot logs -talosctl -n dmesg | head -100 - -# System automatically rolls back on boot failure -# Check current version -talosctl -n version - -# Manual rollback if needed -talosctl rollback --nodes -``` - -### Kubernetes Upgrade Issues - -**K8s Upgrade Failures**: -```bash -# Check upgrade status -talosctl --nodes upgrade-k8s --to --dry-run - -# Check individual component status -kubectl get pods -n kube-system -talosctl -n get apiserverconfig -o yaml -``` - -**Version Mismatch Issues**: -```bash -# Check version consistency -kubectl get nodes -o wide -talosctl -n ,, version - -# Check component versions -kubectl get pods -n kube-system -o wide -``` - -## Resource and Performance Issues - -### Memory and Storage Problems - -**Out of Memory**: -```bash -# Check memory usage -talosctl -n memory -talosctl -n processes --sort-by=memory | head -20 - -# Check for memory pressure -kubectl describe node | grep -A 10 Conditions - -# Check OOM events -talosctl -n dmesg | grep -i "out of memory" -``` - -**Disk Space Issues**: -```bash -# Check disk usage -talosctl -n df -talosctl -n disks - -# Check specific directories -talosctl -n list /var/lib/containerd -talosctl -n list /var/lib/etcd - -# Clean up if needed (automatic GC usually handles this) -kubectl describe node | grep -A 5 "Disk Pressure" -``` - -### Performance Issues - -**Slow Cluster Response**: -```bash -# Check API server response time -time kubectl get nodes - -# Check etcd performance -talosctl -n etcd status -# Look for high DB size vs IN USE ratio (fragmentation) - -# Check system load -talosctl -n cpu -talosctl -n memory -``` - -**High CPU/Memory Usage**: -```bash -# Identify resource-heavy processes -talosctl -n processes --sort-by=cpu | head -10 -talosctl -n processes --sort-by=memory | head -10 - -# Check cgroup usage -talosctl -n cgroups --preset memory -talosctl -n cgroups --preset cpu -``` - -## Configuration Issues - -### Machine Configuration Problems - -**Invalid Configuration**: -```bash -# Validate configuration before applying -talosctl validate -f machineconfig.yaml - -# Check current configuration -talosctl -n get machineconfig -o yaml - -# Compare with expected configuration -diff <(talosctl -n get mc v1alpha1 -o yaml) expected-config.yaml -``` - -**Configuration Drift**: -```bash -# Check configuration version -talosctl -n get machineconfig - -# Re-apply configuration if needed -talosctl apply-config --nodes --file corrected-config.yaml --dry-run -talosctl apply-config --nodes --file corrected-config.yaml -``` - -## Emergency Procedures - -### Node Unresponsive - -**Complete Node Failure**: -1. **Physical access required**: Power cycle or hardware reset -2. **Check hardware**: Memory, disk, network interface status -3. **Boot issues**: May require bootable recovery media - -**Partial Connectivity**: -```bash -# Try different network interfaces if multiple available -talosctl -e -n health - -# Check if specific services are running -talosctl -n service machined -talosctl -n service apid -``` - -### Cluster-Wide Failures - -**All Control Plane Nodes Down**: -1. **Assess scope**: Determine if data corruption or hardware failure -2. **Recovery strategy**: Use etcd backup if available -3. **Rebuild process**: May require complete cluster rebuild - -**Follow disaster recovery procedures** as documented in etcd-management.md. - -### Emergency Reset Procedures - -**Single Node Reset**: -```bash -# Graceful reset (preserves some data) -talosctl -n reset - -# Force reset (wipes all data) -talosctl -n reset --graceful=false --reboot - -# Selective wipe (preserve STATE partition) -talosctl -n reset --system-labels-to-wipe=EPHEMERAL -``` - -**Cluster Reset** (DESTRUCTIVE): -```bash -# Reset all nodes (DANGER: DATA LOSS) -for ip in IP1 IP2 IP3; do - talosctl -n $ip reset --graceful=false --reboot -done -``` - -## Monitoring and Alerting - -### Key Metrics to Monitor -- Node resource usage (CPU, memory, disk) -- etcd health and performance -- Control plane component status -- Network connectivity -- Certificate expiration -- Discovery service connectivity - -### Log Locations for External Monitoring -- Kernel logs: `talosctl dmesg` -- Service logs: `talosctl logs ` -- System events: `talosctl events` -- Kubernetes events: `kubectl get events` - -This troubleshooting guide provides systematic approaches to identify and resolve the most common issues encountered in Talos cluster operations. \ No newline at end of file diff --git a/ai/wildcloud-v.PoC/README.md b/ai/wildcloud-v.PoC/README.md deleted file mode 100644 index eb605de..0000000 --- a/ai/wildcloud-v.PoC/README.md +++ /dev/null @@ -1,188 +0,0 @@ -# Wild Cloud Agent Context Documentation - -This directory contains comprehensive documentation about the Wild Cloud project, designed to provide AI agents (like Claude Code) with the context needed to effectively help users with Wild Cloud development, deployment, and operations. - -## Documentation Overview - -### 📚 Core Documentation Files - -1. **[overview.md](./overview.md)** - Complete project introduction and getting started guide - - What Wild Cloud is and why it exists - - Technology stack and architecture overview - - Quick start guide and common use cases - - Best practices and troubleshooting - -2. **[bin-scripts.md](./bin-scripts.md)** - Complete CLI reference - - All 34+ `wild-*` commands with usage examples - - Command categories (setup, apps, config, operations) - - Script dependencies and execution order - - Common usage patterns - -3. **[setup-process.md](./setup-process.md)** - Infrastructure deployment deep dive - - Complete setup phases and dependencies - - Talos Linux and Kubernetes cluster deployment - - Core services installation (MetalLB, Traefik, cert-manager, etc.) - - Network configuration and DNS management - -4. **[apps-system.md](./apps-system.md)** - Application management system - - App structure and lifecycle management - - Template system and configuration - - Available applications and their features - - Creating custom applications - -5. **[configuration-system.md](./configuration-system.md)** - Configuration and secrets management - - `config.yaml` and `secrets.yaml` structure - - Template processing with gomplate - - Environment setup and validation - - Security best practices - -6. **[project-architecture.md](./project-architecture.md)** - Project structure and organization - - Wild Cloud repository structure - - User cloud directory layout - - File permissions and security model - - Development and deployment patterns - -## Quick Reference Guide - -### Essential Commands -```bash -# Setup & Initialization -wild-init # Initialize new cloud -wild-setup # Complete deployment -wild-health # System health check - -# Application Management -wild-apps-list # List available apps -wild-app-add # Configure app -wild-app-deploy # Deploy app - -# Configuration -wild-config # Read config -wild-config-set # Set config -wild-secret # Read secret -``` - -### Key File Locations - -**Wild Cloud Repository** (`WC_ROOT`): -- `bin/` - All CLI commands -- `apps/` - Application templates -- `setup/` - Infrastructure templates -- `docs/` - Documentation - -**User Cloud Directory** (`WC_HOME`): -- `config.yaml` - Main configuration -- `secrets.yaml` - Sensitive data -- `apps/` - Deployed app configs -- `.wildcloud/` - Project marker - -### Application Categories - -- **Content**: Ghost (blog), Discourse (forum) -- **Media**: Immich (photos) -- **Development**: Gitea (Git), Docker Registry -- **Databases**: PostgreSQL, MySQL, Redis -- **AI/ML**: vLLM (LLM inference) - -## Technology Stack Summary - -### Core Infrastructure -- **Talos Linux** - Immutable Kubernetes OS -- **Kubernetes** - Container orchestration -- **MetalLB** - Load balancing -- **Traefik** - Ingress/reverse proxy -- **Longhorn** - Distributed storage -- **cert-manager** - TLS certificates - -### Management Tools -- **gomplate** - Template processing -- **Kustomize** - Configuration management -- **restic** - Backup system -- **kubectl/talosctl** - Cluster management - -## Common Agent Tasks - -### When Users Ask About... - -**"How do I deploy X?"** -- Check apps-system.md for application management -- Look for X in available applications list -- Reference app deployment lifecycle - -**"Setup isn't working"** -- Review setup-process.md for troubleshooting -- Check bin-scripts.md for command options -- Verify prerequisites and dependencies - -**"How do I configure Y?"** -- Check configuration-system.md for config management -- Look at project-architecture.md for file locations -- Review template processing documentation - -**"What does wild-X command do?"** -- Reference bin-scripts.md for complete command documentation -- Check command categories and usage patterns -- Look at dependencies between commands - -### Development Tasks - -**Creating New Apps**: -1. Review apps-system.md "Creating Custom Apps" section -2. Follow Wild Cloud app structure conventions -3. Use project-architecture.md for file organization -4. Test with standard app deployment workflow - -**Modifying Infrastructure**: -1. Check setup-process.md for infrastructure components -2. Review configuration-system.md for template processing -3. Understand project-architecture.md file relationships -4. Test changes carefully in development environment - -**Troubleshooting Issues**: -1. Use bin-scripts.md for diagnostic commands -2. Check setup-process.md for component validation -3. Review configuration-system.md for config problems -4. Reference apps-system.md for application issues - -## Best Practices for Agents - -### Understanding User Context -- Always check if they're in a Wild Cloud directory (look for `.wildcloud/`) -- Determine if they need setup help vs operational help -- Consider their experience level (beginner vs advanced) -- Check what applications they're trying to deploy - -### Providing Help -- Reference specific documentation sections for detailed info -- Provide exact command syntax from bin-scripts.md -- Explain prerequisites and dependencies -- Offer validation steps to verify success - -### Safety Considerations -- Always recommend testing in development first -- Warn about destructive operations (delete, reset) -- Emphasize backup importance before major changes -- Explain security implications of configuration changes - -### Common Gotchas -- `secrets.yaml` has restricted permissions (600) -- Templates need processing before deployment -- Dependencies between applications must be satisfied -- Node hardware detection requires maintenance mode boot - -## Documentation Maintenance - -This documentation should be updated when: -- New commands are added to `bin/` -- New applications are added to `apps/` -- Infrastructure components change -- Configuration schema evolves -- Best practices are updated - -Each documentation file includes: -- Complete coverage of its topic area -- Practical examples and use cases -- Troubleshooting guidance -- References to related documentation - -This comprehensive context should enable AI agents to provide expert-level assistance with Wild Cloud projects across all aspects of the system. \ No newline at end of file diff --git a/ai/wildcloud-v.PoC/apps-system.md b/ai/wildcloud-v.PoC/apps-system.md deleted file mode 100644 index a7a6f9d..0000000 --- a/ai/wildcloud-v.PoC/apps-system.md +++ /dev/null @@ -1,595 +0,0 @@ -# Wild Cloud Apps System - -The Wild Cloud apps system provides a streamlined way to deploy and manage applications on your Kubernetes cluster. It uses Kustomize for configuration management and follows a standardized structure for consistent deployment patterns. - -## App Structure and Components - -### Directory Structure -Each subdirectory represents a Wild Cloud app. Each app directory contains: - -**Required Files:** -- `manifest.yaml` - App metadata and configuration -- `kustomization.yaml` - Kustomize configuration with Wild Cloud labels - -**Standard Configuration Files (one or more YAML files containing Kubernetes resource definitions):** -``` -apps/myapp/ -├── manifest.yaml # Required: App metadata and configuration -├── kustomization.yaml # Required: Kustomize configuration with Wild Cloud labels -├── namespace.yaml # Kubernetes namespace definition -├── deployment.yaml # Application deployment -├── service.yaml # Kubernetes service definition -├── ingress.yaml # HTTPS ingress with external DNS -├── pvc.yaml # Persistent volume claims (if needed) -├── db-init-job.yaml # Database initialization (if needed) -└── configmap.yaml # Configuration data (if needed) -``` - -### App Manifest (`manifest.yaml`) - -The required `manifest.yaml` file contains metadata about the app. Here's an example `manifest.yaml` file: - -```yaml -name: myapp -description: A brief description of the application and its purpose. -version: 1.0.0 -icon: https://example.com/icon.png -requires: - - name: postgres -defaultConfig: - image: myapp/server:1.0.0 - domain: myapp.{{ .cloud.domain }} - timezone: UTC - storage: 10Gi - dbHostname: postgres.postgres.svc.cluster.local - dbUsername: myapp -defaultSecrets: - - apps.myapp.dbPassword - - apps.postgres.password -``` - -**Manifest Fields**: -- `name` - The name of the app, used for identification (must match directory name) -- `description` - A brief description of the app -- `version` - The version of the app (should generally follow the versioning scheme of the app itself) -- `icon` - A URL to an icon representing the app -- `requires` - A list of other apps that this app depends on (each entry should be the name of another app) -- `defaultConfig` - A set of default configuration values for the app (when an app is added using `wild-app-add`, these values will be added to the Wild Cloud `config.yaml` file) -- `defaultSecrets` - A list of secrets that must be set in the Wild Cloud `secrets.yaml` file for the app to function properly (these secrets are typically sensitive information like database passwords or API keys; keys with random values will be generated automatically when the app is added) - -### Kustomization Configuration - -Wild Cloud apps use standard Kustomize with required Wild Cloud labels: - -```yaml -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -namespace: myapp -labels: - - includeSelectors: true - pairs: - app: myapp - managedBy: kustomize - partOf: wild-cloud -resources: - - namespace.yaml - - deployment.yaml - - service.yaml - - ingress.yaml - - pvc.yaml - - db-init-job.yaml -``` - -**Kustomization Requirements**: -- Every Wild Cloud kustomization should include the Wild Cloud labels in its `kustomization.yaml` file (this allows Wild Cloud to identify and manage the app correctly) -- The `app` label and `namespace` should match the app's name/directory -- **includeSelectors: true** - Automatically applies labels to all resources AND their selectors - -#### Standard Wild Cloud Labels - -Wild Cloud uses a consistent labeling strategy across all apps: - -```yaml -labels: - - includeSelectors: true - pairs: - app: myapp # The app name (matches directory) - managedBy: kustomize # Managed by Kustomize - partOf: wild-cloud # Part of Wild Cloud ecosystem -``` - -The `includeSelectors: true` setting automatically applies these labels to all resources AND their selectors, which means: - -1. **Resource labels** - All resources get the standard Wild Cloud labels -2. **Selector labels** - All selectors automatically include these labels for robust selection - -This allows individual resources to use simple, component-specific selectors: - -```yaml -selector: - matchLabels: - component: web -``` - -Which Kustomize automatically expands to: - -```yaml -selector: - matchLabels: - app: myapp - component: web - managedBy: kustomize - partOf: wild-cloud -``` - -### Template System - -Wild Cloud apps are actually **templates** that get compiled with your specific configuration when you run `wild-app-add`. This allows for: - -- **Dynamic Configuration** - Reference user settings via `{{ .apps.appname.key }}` -- **Gomplate Processing** - Full template capabilities including conditionals and loops -- **Secret Integration** - Automatic secret generation and referencing -- **Domain Management** - Automatic subdomain assignment based on your domain - -**Template Variable Examples**: -```yaml -# Configuration references -image: "{{ .apps.myapp.image }}" -domain: "{{ .apps.myapp.domain }}" -namespace: "{{ .apps.myapp.namespace }}" - -# Cloud-wide settings -timezone: "{{ .cloud.timezone }}" -domain_suffix: "{{ .cloud.domain }}" - -# Conditional logic -{{- if .apps.myapp.enableSSL }} -- name: ENABLE_SSL - value: "true" -{{- end }} -``` - -## App Lifecycle Management - -### 1. Discovery Phase -**Command**: `wild-apps-list` - -Lists all available applications with metadata: -```bash -wild-apps-list --verbose # Detailed view with descriptions -wild-apps-list --json # JSON output for automation -``` - -Shows: -- App name and description -- Version and dependencies -- Installation status -- Required configuration - -### 2. Configuration Phase -**Command**: `wild-app-add ` - -Processes app templates and prepares for deployment: - -**What it does**: -1. Reads app manifest directly from Wild Cloud repository -2. Merges default configuration with existing `config.yaml` -3. Generates required secrets automatically -4. Compiles templates with gomplate using your configuration -5. Creates ready-to-deploy Kustomize files in `apps//` - -**Generated Files**: -- Compiled Kubernetes manifests (no more template variables) -- Standard Kustomize configuration -- App-specific configuration merged into your `config.yaml` -- Required secrets added to your `secrets.yaml` - -### 3. Deployment Phase -**Command**: `wild-app-deploy ` - -Deploys the app to your Kubernetes cluster: - -**Deployment Process**: -1. Creates namespace if it doesn't exist -2. Handles app dependencies (deploys required apps first) -3. Creates secrets from your `secrets.yaml` -4. Applies Kustomize configuration to cluster -5. Copies TLS certificates to app namespace -6. Validates deployment success - -**Options**: -- `--force` - Overwrite existing resources -- `--dry-run` - Preview changes without applying - -### 4. Operations Phase - -**Monitoring**: `wild-app-doctor ` -- Runs app-specific diagnostic tests -- Checks pod status, resource usage, connectivity -- Options: `--keep`, `--follow`, `--timeout` - -**Updates**: Re-run `wild-app-add` then `wild-app-deploy` -- Use `--force` flag to overwrite existing configuration -- Updates configuration changes -- Handles image updates -- Preserves persistent data - -**Removal**: `wild-app-delete ` -- Deletes namespace and all resources -- Removes local configuration files -- Options: `--force` for no confirmation - -## Configuration System - -### Configuration Storage - -**Global Configuration** (`config.yaml`): -```yaml -cloud: - domain: example.com - timezone: America/New_York -apps: - myapp: - domain: app.example.com - image: myapp:1.0.0 - storage: 20Gi - timezone: UTC -``` - -**Secrets Management** (`secrets.yaml`): -```yaml -apps: - myapp: - dbPassword: "randomly-generated-password" - adminPassword: "user-set-password" - postgres: - password: "randomly-generated-password" -``` - -### Secret Generation - -When you run `wild-app-add`, required secrets are automatically generated: -- **Random Generation**: 32-character base64 strings for passwords/keys -- **User Prompts**: For secrets that need specific values -- **Preservation**: Existing secrets are never overwritten -- **Permissions**: `secrets.yaml` has 600 permissions (owner-only) - -### Configuration Commands -```bash -# Read app configuration -wild-config apps.myapp.domain - -# Set app configuration -wild-config-set apps.myapp.storage "50Gi" - -# Read app secrets -wild-secret apps.myapp.dbPassword - -# Set app secrets -wild-secret-set apps.myapp.adminPassword "my-secure-password" -``` - -## Networking and DNS - -### External DNS Integration - -Wild Cloud apps automatically manage DNS records through ingress annotations: - -```yaml -metadata: - annotations: - external-dns.alpha.kubernetes.io/target: {{ .cloud.domain }} - external-dns.alpha.kubernetes.io/cloudflare-proxied: "false" -``` - -**How it works**: -1. App ingress created with external-dns annotations -2. ExternalDNS controller detects new ingress -3. Creates CNAME record: `app.yourdomain.com` → `yourdomain.com` -4. DNS resolves to MetalLB load balancer IP -5. Traefik routes traffic to appropriate service - -### HTTPS Certificate Management - -Automatic TLS certificates via cert-manager: - -```yaml -metadata: - annotations: - traefik.ingress.kubernetes.io/router.tls: "true" - traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt -spec: - tls: - - hosts: - - {{ .apps.myapp.domain }} - secretName: myapp-tls -``` - -**Certificate Lifecycle**: -1. Ingress created with TLS configuration -2. cert-manager detects certificate requirement -3. Let's Encrypt challenge initiated automatically -4. Certificate issued and stored in Kubernetes secret -5. Traefik uses certificate for TLS termination -6. Automatic renewal before expiration - -## Database Integration - -### Database Initialization Jobs - -Apps that require databases use initialization jobs to set up the database before the main application starts: - -```yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: myapp-db-init -spec: - template: - spec: - containers: - - name: db-init - image: postgres:15 - command: - - /bin/bash - - -c - - | - PGPASSWORD=$ROOT_PASSWORD psql -h $DB_HOST -U postgres -c " - CREATE DATABASE IF NOT EXISTS $DB_NAME; - CREATE USER $DB_USER WITH PASSWORD '$DB_PASSWORD'; - GRANT ALL PRIVILEGES ON DATABASE $DB_NAME TO $DB_USER; - " - env: - - name: DB_HOST - value: {{ .apps.myapp.dbHostname }} - - name: ROOT_PASSWORD - valueFrom: - secretKeyRef: - name: myapp-secrets - key: apps.postgres.password - restartPolicy: OnFailure -``` - -**Database URL Secrets**: For apps requiring database URLs with embedded credentials, always use dedicated secrets: - -```yaml -# In manifest.yaml -defaultSecrets: - - apps.myapp.dbUrl - -# Generated secret (by wild-app-add) -apps: - myapp: - dbUrl: "postgresql://myapp:password123@postgres.postgres.svc.cluster.local/myapp" -``` - -### Supported Databases - -Wild Cloud apps commonly integrate with: -- **PostgreSQL** - Via `postgres` app dependency -- **MySQL** - Via `mysql` app dependency -- **Redis** - Via `redis` app dependency -- **SQLite** - For apps with embedded database needs - -## Storage Management - -### Persistent Volume Claims - -Apps requiring persistent storage define PVCs: - -```yaml -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: myapp-data -spec: - accessModes: - - ReadWriteOnce - storageClassName: longhorn - resources: - requests: - storage: {{ .apps.myapp.storage }} -``` - -**Storage Integration**: -- **Longhorn Storage Class** - Distributed, replicated storage -- **Dynamic Provisioning** - Automatic volume creation -- **Backup Support** - Via `wild-app-backup` command -- **Expansion** - Update storage size in configuration - -### Backup and Restore - -**Application Backup**: `wild-app-backup ` -- Discovers databases and PVCs automatically -- Creates restic snapshots with deduplication -- Supports PostgreSQL and MySQL database backups -- Streams PVC data for efficient storage - -**Application Restore**: `wild-app-restore ` -- Restores from restic snapshots -- Options: `--db-only`, `--pvc-only`, `--skip-globals` -- Creates safety snapshots before destructive operations - -## Security Considerations - -### Pod Security Standards - -All Wild Cloud apps comply with Pod Security Standards: - -```yaml -spec: - template: - spec: - securityContext: - runAsNonRoot: true - runAsUser: 999 - runAsGroup: 999 - seccompProfile: - type: RuntimeDefault - containers: - - name: app - securityContext: - allowPrivilegeEscalation: false - capabilities: - drop: - - ALL - readOnlyRootFilesystem: false # Set to true when possible -``` - -### Secret Management - -- **Kubernetes Secrets** - All sensitive data stored as Kubernetes secrets -- **Secret References** - Apps reference secrets via `secretKeyRef`, never inline -- **Full Dotted Paths** - Always use complete secret paths (e.g., `apps.myapp.dbPassword`) -- **No Plaintext** - Secrets never stored in manifests or config files - -### Network Policies - -Apps can define network policies for traffic isolation: -```yaml -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: myapp-network-policy -spec: - podSelector: - matchLabels: - app: myapp - ingress: - - from: - - namespaceSelector: - matchLabels: - name: traefik -``` - -## Available Applications - -Wild Cloud includes apps for common self-hosted services: - -### Content Management -- **Ghost** - Publishing platform for blogs and websites -- **Discourse** - Community discussion platform - -### Development & Project Management Tools -- **Gitea** - Self-hosted Git service with web interface -- **OpenProject** - Open-source project management software -- **Docker Registry** - Private container image registry - -### Media & File Management -- **Immich** - Self-hosted photo and video backup solution - -### Communication -- **Keila** - Newsletter and email marketing platform -- **Listmonk** - Newsletter and mailing list manager - -### Databases -- **PostgreSQL** - Relational database service -- **MySQL** - Relational database service -- **Redis** - In-memory data structure store -- **Memcached** - Distributed memory caching system - -### AI/ML -- **vLLM** - Fast LLM inference server with OpenAI-compatible API - -### Examples & Templates -- **example-admin** - Example admin interface application -- **example-app** - Template application for development reference - -## Creating Custom Apps - -### App Development Process - -1. **Create Directory**: `apps/myapp/` -2. **Write Manifest**: Define metadata and configuration -3. **Create Resources**: Kubernetes manifests with templates -4. **Test Locally**: Use `wild-app-add` and `wild-app-deploy` -5. **Validate**: Ensure all resources deploy correctly - -### Best Practices - -**Manifest Design**: -- Include comprehensive `defaultConfig` for all configurable values -- List all `defaultSecrets` the app needs -- Specify dependencies in `requires` field -- Use semantic versioning - -**Template Usage**: -- Reference configuration via `{{ .apps.myapp.key }}` -- Use conditionals for optional features -- Include proper gomplate syntax for lists and objects -- Test template compilation - -**Resource Configuration**: -- Always include Wild Cloud standard labels -- Use appropriate security contexts -- Define resource requests and limits -- Include health checks and probes - -**Storage and Networking**: -- Use Longhorn storage class for persistence -- Include external-dns annotations for automatic DNS -- Configure TLS certificates via cert-manager annotations -- Follow database initialization patterns for data apps - -### Converting from Helm Charts - -Wild Cloud provides tooling to convert Helm charts to Wild Cloud apps: - -```bash -# Convert Helm chart to Kustomize base -helm fetch --untar --untardir charts stable/mysql -helm template --output-dir base --namespace mysql mysql charts/mysql -cd base/mysql -kustomize create --autodetect - -# Then customize for Wild Cloud: -# 1. Add manifest.yaml -# 2. Replace hardcoded values with templates -# 3. Update labels to Wild Cloud standard -# 4. Configure secrets properly -``` - -## Troubleshooting Applications - -### Common Issues - -**App Won't Start**: -- Check pod logs: `kubectl logs -n deployment/` -- Verify secrets exist: `kubectl get secrets -n ` -- Check resource constraints: `kubectl describe pod -n ` - -**Database Connection Issues**: -- Verify database is running: `kubectl get pods -n ` -- Check database initialization job: `kubectl logs job/-db-init -n ` -- Validate database credentials in secrets - -**DNS/Certificate Issues**: -- Check ingress status: `kubectl get ingress -n ` -- Verify certificate creation: `kubectl get certificates -n ` -- Check external-dns logs: `kubectl logs -n external-dns deployment/external-dns` - -**Storage Issues**: -- Check PVC status: `kubectl get pvc -n ` -- Verify Longhorn cluster health: Access Longhorn UI -- Check storage class availability: `kubectl get storageclass` - -### Diagnostic Tools - -```bash -# App-specific diagnostics -wild-app-doctor - -# Resource inspection -kubectl get all -n -kubectl describe deployment/ -n - -# Log analysis -kubectl logs -f deployment/ -n -kubectl logs job/-db-init -n - -# Configuration verification -wild-config apps. -wild-secret apps. -``` - -The Wild Cloud apps system provides a powerful, consistent way to deploy and manage self-hosted applications with enterprise-grade features like automatic HTTPS, DNS management, backup/restore, and integrated security. \ No newline at end of file diff --git a/ai/wildcloud-v.PoC/bin-scripts.md b/ai/wildcloud-v.PoC/bin-scripts.md deleted file mode 100644 index 7426fdd..0000000 --- a/ai/wildcloud-v.PoC/bin-scripts.md +++ /dev/null @@ -1,262 +0,0 @@ -# Wild Cloud CLI Scripts Reference - -Wild Cloud provides 34+ command-line tools (all prefixed with `wild-`) for managing your personal Kubernetes cloud infrastructure. These scripts handle everything from initial setup to day-to-day operations. - -## Script Categories - -### 🚀 Initial Setup & Scaffolding - -**`wild-init`** - Initialize new Wild Cloud instance -- Creates `.wildcloud` directory structure -- Copies template files from repository -- Sets up basic configuration (email, domains, cluster name) -- **Usage**: `wild-init` -- **When to use**: First command to run in a new directory - -**`wild-setup`** - Master setup orchestrator -- Runs complete Wild Cloud deployment sequence -- Options: `--skip-cluster`, `--skip-services` -- Executes: cluster setup → services setup -- **Usage**: `wild-setup [options]` -- **When to use**: After `wild-init` for complete automated setup - -**`wild-update-docs`** - Copy documentation to cloud directory -- Options: `--force` to overwrite existing docs -- Copies `/docs` from repository to your cloud home -- **Usage**: `wild-update-docs [--force]` - -### ⚙️ Configuration Management - -**`wild-config`** - Read configuration values -- Access YAML paths from `config.yaml` (e.g., `cluster.name`, `cloud.domain`) -- Option: `--check` to test key existence -- **Usage**: `wild-config ` or `wild-config --check ` - -**`wild-config-set`** - Write configuration values -- Sets values using YAML paths, creates config file if needed -- **Usage**: `wild-config-set ` - -**`wild-secret`** - Read secret values -- Similar to `wild-config` but for sensitive data in `secrets.yaml` -- File has restrictive permissions (600) -- **Usage**: `wild-secret ` or `wild-secret --check ` - -**`wild-secret-set`** - Write secret values -- Generates random values if none provided (32-char base64) -- **Usage**: `wild-secret-set [value]` - -**`wild-compile-template`** - Process gomplate templates -- Uses `config.yaml` and `secrets.yaml` as template context -- **Usage**: `wild-compile-template < template.yaml` - -**`wild-compile-template-dir`** - Process template directories -- Options: `--clean` to remove destination first -- **Usage**: `wild-compile-template-dir ` - -### 🏗️ Cluster Infrastructure Management - -**`wild-setup-cluster`** - Complete cluster setup (Phases 1-3) -- Automated control plane node setup and bootstrapping -- Configures Talos control plane nodes using wild-node-setup -- Options: `--skip-hardware` -- **Usage**: `wild-setup-cluster [options]` -- **Requires**: `wild-init` completed first - -**`wild-cluster-config-generate`** - Generate Talos cluster config -- Creates base `controlplane.yaml` and `worker.yaml` -- Generates cluster secrets using `talosctl gen config` -- **Usage**: `wild-cluster-config-generate` - -**`wild-node-setup`** - Complete node lifecycle management -- Handles detect → configure → patch → deploy for individual nodes -- Automatically detects maintenance mode and handles IP transitions -- Options: `--reconfigure`, `--no-deploy` -- **Usage**: `wild-node-setup [options]` -- **Examples**: - - `wild-node-setup control-1` (complete setup) - - `wild-node-setup worker-1 --reconfigure` (force node reconfiguration) - - `wild-node-setup control-2 --no-deploy` (configuration only) - -**`wild-node-detect`** - Hardware detection utility -- Discovers network interfaces and disks from maintenance mode -- Returns JSON with hardware specifications and maintenance mode status -- **Usage**: `wild-node-detect ` -- **Note**: Primarily used internally by `wild-node-setup` - -**`wild-cluster-node-ip`** - Get node IP addresses -- Sources: config.yaml, kubectl, or talosctl -- Options: `--from-config`, `--from-talosctl` -- **Usage**: `wild-cluster-node-ip [options]` - -### 🔧 Cluster Services Management - -**`wild-setup-services`** - Set up all cluster services (Phase 4) -- Manages MetalLB, Traefik, cert-manager, etc. in dependency order -- Options: `--fetch` for fresh templates, `--no-deploy` for config-only -- **Usage**: `wild-setup-services [options]` -- **Requires**: Working Kubernetes cluster - -**`wild-service-setup`** - Complete service lifecycle management -- Handles fetch → configure → deploy for individual services -- Options: `--fetch` for fresh templates, `--no-deploy` for config-only -- **Usage**: `wild-service-setup [--fetch] [--no-deploy]` -- **Examples**: - - `wild-service-setup cert-manager` (configure + deploy) - - `wild-service-setup cert-manager --fetch` (fetch + configure + deploy) - - `wild-service-setup cert-manager --no-deploy` (configure only) - -**`wild-dashboard-token`** - Get Kubernetes dashboard token -- Extracts token for dashboard authentication -- Copies to clipboard if available -- **Usage**: `wild-dashboard-token` - -**`wild-cluster-secret-copy`** - Copy secrets between namespaces -- **Usage**: `wild-cluster-secret-copy [target-ns2]` - -### 📱 Application Management - -**`wild-apps-list`** - List available applications -- Shows metadata, installation status, dependencies -- Options: `--verbose`, `--json`, `--yaml` -- **Usage**: `wild-apps-list [options]` - -**`wild-app-add`** - Configure app from repository -- Processes manifest.yaml with configuration -- Generates required secrets automatically -- Options: `--force` to overwrite existing app files -- **Usage**: `wild-app-add [--force]` - -**`wild-app-deploy`** - Deploy application to cluster -- Creates namespaces, handles dependencies -- Options: `--force`, `--dry-run` -- **Usage**: `wild-app-deploy [options]` - -**`wild-app-delete`** - Remove application -- Deletes namespace and all resources -- Options: `--force`, `--dry-run` -- **Usage**: `wild-app-delete [options]` - -**`wild-app-doctor`** - Run application diagnostics -- Executes app-specific diagnostic tests -- Options: `--keep`, `--follow`, `--timeout` -- **Usage**: `wild-app-doctor [options]` - -### 💾 Backup & Restore - -**`wild-backup`** - Comprehensive backup system -- Backs up home directory, apps, and cluster resources -- Options: `--home-only`, `--apps-only`, `--cluster-only` -- Uses restic for deduplication -- **Usage**: `wild-backup [options]` - -**`wild-app-backup`** - Application-specific backups -- Discovers databases and PVCs automatically -- Supports PostgreSQL and MySQL -- Options: `--all` for all applications -- **Usage**: `wild-app-backup [--all]` - -**`wild-app-restore`** - Application restore -- Restores databases and/or PVC data -- Options: `--db-only`, `--pvc-only`, `--skip-globals` -- **Usage**: `wild-app-restore [options]` - -### 🔍 Utilities & Helpers - -**`wild-health`** - Comprehensive infrastructure validation -- Validates core components (MetalLB, Traefik, CoreDNS) -- Checks installed services (cert-manager, ExternalDNS, Kubernetes Dashboard) -- Tests DNS resolution, routing, certificates, and storage systems -- **Usage**: `wild-health` - -**`wild-talos-schema`** - Talos schema management -- Handles configuration schema operations -- **Usage**: `wild-talos-schema [options]` - -**`wild-cluster-node-boot-assets-download`** - Download Talos assets -- Downloads installation images for nodes -- **Usage**: `wild-cluster-node-boot-assets-download` - -**`wild-dnsmasq-install`** - Install dnsmasq services -- Sets up DNS and DHCP for cluster networking -- **Usage**: `wild-dnsmasq-install` - -## Common Usage Patterns - -### Complete Setup from Scratch -```bash -wild-init # Initialize cloud directory -wild-setup # Complete automated setup -# or step by step: -wild-setup-cluster # Just cluster infrastructure -wild-setup-services # Just cluster services -``` - -### Individual Service Management -```bash -# Most common - reconfigure and deploy service -wild-service-setup cert-manager - -# Get fresh templates and deploy (for updates) -wild-service-setup cert-manager --fetch - -# Configure only, don't deploy (for planning) -wild-service-setup cert-manager --no-deploy - -# Fix failed service and resume setup -wild-service-setup cert-manager --fetch -wild-setup-services # Resume full setup if needed -``` - -### Application Management -```bash -wild-apps-list # See available apps -wild-app-add ghost # Configure app -wild-app-deploy ghost # Deploy to cluster -wild-app-doctor ghost # Troubleshoot issues -``` - -### Configuration Management -```bash -wild-config cluster.name # Read values -wild-config-set apps.ghost.domain "blog.example.com" # Write values -wild-secret apps.ghost.adminPassword # Read secrets -wild-secret-set apps.ghost.apiKey # Generate random secret -``` - -### Cluster Operations -```bash -wild-cluster-node-ip control-1 # Get node IP -wild-dashboard-token # Get dashboard access -wild-health # Check system health -``` - -## Script Design Principles - -1. **Consistent Interface**: All scripts use `--help` and follow common argument patterns -2. **Error Handling**: All scripts use `set -e` and `set -o pipefail` for robust error handling -3. **Idempotent**: Scripts check existing state before making changes -4. **Template-Driven**: Extensive use of gomplate for configuration flexibility -5. **Environment-Aware**: Scripts source `wild-common.sh` and initialize Wild Cloud environment -6. **Progressive Disclosure**: Complex operations broken into phases with individual controls - -## Dependencies Between Scripts - -### Setup Phase Dependencies -1. `wild-init` → creates basic structure -2. `wild-setup-cluster` → provisions infrastructure -3. `wild-setup-services` → installs cluster services -4. `wild-setup` → orchestrates all phases - -### App Deployment Pipeline -1. `wild-apps-list` → discover applications -2. `wild-app-add` → configure and prepare application -3. `wild-app-deploy` → deploy to cluster - -### Node Management Flow -1. `wild-cluster-config-generate` → base configurations -2. `wild-node-setup ` → atomic node operations (detect → patch → deploy) - - Internally uses `wild-node-detect` for hardware discovery - - Generates node-specific patches and final configurations - - Deploys configuration to target node - -All scripts are designed to work together as a cohesive Infrastructure as Code system for personal Kubernetes deployments. \ No newline at end of file diff --git a/ai/wildcloud-v.PoC/configuration-system.md b/ai/wildcloud-v.PoC/configuration-system.md deleted file mode 100644 index 5960705..0000000 --- a/ai/wildcloud-v.PoC/configuration-system.md +++ /dev/null @@ -1,602 +0,0 @@ -# Wild Cloud Configuration System - -Wild Cloud uses a comprehensive configuration management system that handles both non-sensitive configuration data and sensitive secrets through separate files and commands. The system supports YAML path-based access, template processing, and environment-specific customization. - -## Configuration Architecture - -### Core Components - -1. **`config.yaml`** - Main configuration file for non-sensitive settings -2. **`secrets.yaml`** - Encrypted/protected storage for sensitive data -3. **`.wildcloud/`** - Project marker and cache directory -4. **`env.sh`** - Environment setup and path configuration -5. **Template System** - gomplate-based dynamic configuration processing - -### File Structure of a Wild Cloud Project - -``` -your-cloud-directory/ -├── .wildcloud/ # Project marker and cache -│ ├── cache/ # Downloaded templates and temporary files -│ └── logs/ # Operation logs -├── config.yaml # Main configuration (tracked in git) -├── secrets.yaml # Sensitive data (NOT tracked in git, 600 perms) -├── env.sh # Environment setup (auto-generated) -├── apps/ # Deployed application configurations -├── setup/ # Infrastructure setup files -└── docs/ # Project documentation -``` - -## Configuration File (`config.yaml`) - -### Structure and Organization - -The configuration file uses a hierarchical YAML structure for organizing settings: - -```yaml -# Cloud-wide settings -cloud: - domain: "example.com" - email: "admin@example.com" - timezone: "America/New_York" - -# Cluster infrastructure settings -cluster: - name: "wild-cluster" - nodeCount: 3 - network: - subnet: "192.168.1.0/24" - gateway: "192.168.1.1" - dnsServer: "192.168.1.50" - metallbPool: "192.168.1.80-89" - controlPlaneVip: "192.168.1.90" - nodes: - control-1: - ip: "192.168.1.91" - mac: "00:11:22:33:44:55" - interface: "eth0" - disk: "/dev/sda" - control-2: - ip: "192.168.1.92" - mac: "00:11:22:33:44:56" - interface: "eth0" - disk: "/dev/sda" - -# Application-specific settings -apps: - ghost: - domain: "blog.example.com" - image: "ghost:5.0.0" - storage: "10Gi" - timezone: "UTC" - namespace: "ghost" - immich: - domain: "photos.example.com" - serverImage: "ghcr.io/immich-app/immich-server:release" - storage: "250Gi" - namespace: "immich" - -# Service configurations -services: - traefik: - replicas: 2 - dashboard: true - longhorn: - defaultReplicas: 3 - storageClass: "longhorn" -``` - -### Configuration Commands - -**Reading Configuration Values**: -```bash -# Read simple values -wild-config cloud.domain # "example.com" -wild-config cluster.name # "wild-cluster" - -# Read nested values -wild-config apps.ghost.domain # "blog.example.com" -wild-config cluster.nodes.control-1.ip # "192.168.1.91" - -# Check if key exists -wild-config --check apps.newapp.domain # Returns exit code 0/1 -``` - -**Writing Configuration Values**: -```bash -# Set simple values -wild-config-set cloud.domain "newdomain.com" -wild-config-set cluster.nodeCount 5 - -# Set nested values -wild-config-set apps.ghost.storage "20Gi" -wild-config-set cluster.nodes.worker-1.ip "192.168.1.94" - -# Set complex values (JSON format) -wild-config-set apps.ghost '{"domain":"blog.com","storage":"50Gi"}' -``` - -### Configuration Sections - -#### Cloud Settings (`cloud.*`) -Global settings that affect the entire Wild Cloud deployment: - -```yaml -cloud: - domain: "example.com" # Primary domain for services - email: "admin@example.com" # Contact email for certificates - timezone: "America/New_York" # Default timezone for services - backupLocation: "s3://backup" # Backup storage location - monitoring: true # Enable monitoring services -``` - -#### Cluster Settings (`cluster.*`) -Infrastructure and node configuration: - -```yaml -cluster: - name: "production-cluster" - version: "v1.28.0" - network: - subnet: "10.0.0.0/16" # Cluster network range - serviceCIDR: "10.96.0.0/12" # Service network range - podCIDR: "10.244.0.0/16" # Pod network range - nodes: - control-1: - ip: "10.0.0.10" - role: "controlplane" - taints: [] - worker-1: - ip: "10.0.0.20" - role: "worker" - labels: - node-type: "compute" -``` - -#### Application Settings (`apps.*`) -Per-application configuration that overrides defaults from app manifests: - -```yaml -apps: - postgresql: - storage: "100Gi" - maxConnections: 200 - sharedBuffers: "256MB" - redis: - memory: "1Gi" - persistence: true - ghost: - domain: "blog.example.com" - theme: "casper" - storage: "10Gi" - replicas: 2 -``` - -## Secrets Management (`secrets.yaml`) - -### Security Model - -The `secrets.yaml` file stores all sensitive data with the following security measures: - -- **File Permissions**: Automatically set to 600 (owner read/write only) -- **Git Exclusion**: Included in `.gitignore` by default -- **Encryption Support**: Can be encrypted at rest using tools like `age` or `gpg` -- **Access Control**: Only Wild Cloud commands can read/write secrets - -### Secret Structure - -```yaml -# Generated cluster secrets -cluster: - talos: - secrets: "base64-encoded-cluster-secrets" - adminKey: "talos-admin-private-key" - kubernetes: - adminToken: "k8s-admin-service-account-token" - -# Application secrets -apps: - postgresql: - rootPassword: "randomly-generated-32-char-string" - replicationPassword: "randomly-generated-32-char-string" - ghost: - dbPassword: "randomly-generated-password" - adminPassword: "user-set-password" - jwtSecret: "randomly-generated-jwt-secret" - immich: - dbPassword: "randomly-generated-password" - dbUrl: "postgresql://immich:password@postgres:5432/immich" - jwtSecret: "jwt-signing-key" - -# External service credentials -external: - cloudflare: - apiToken: "cloudflare-dns-api-token" - letsencrypt: - email: "admin@example.com" - backup: - s3AccessKey: "backup-s3-access-key" - s3SecretKey: "backup-s3-secret-key" -``` - -### Secret Commands - -**Reading Secrets**: -```bash -# Read secret values -wild-secret apps.postgresql.rootPassword -wild-secret cluster.kubernetes.adminToken - -# Check if secret exists -wild-secret --check apps.newapp.apiKey -``` - -**Writing Secrets**: -```bash -# Set specific secret value -wild-secret-set apps.ghost.adminPassword "my-secure-password" - -# Generate random secret (if no value provided) -wild-secret-set apps.newapp.apiKey # Generates 32-char base64 string - -# Set complex secret (JSON format) -wild-secret-set apps.database '{"user":"admin","password":"secret"}' -``` - -### Automatic Secret Generation - -When you run `wild-app-add`, Wild Cloud automatically generates required secrets: - -1. **Reads App Manifest**: Identifies `defaultSecrets` list -2. **Checks Existing Secrets**: Never overwrites existing values -3. **Generates Missing Secrets**: Creates secure random values -4. **Updates secrets.yaml**: Adds new secrets with proper structure - -**Example App Manifest**: -```yaml -name: ghost -defaultSecrets: - - apps.ghost.dbPassword # Auto-generated if missing - - apps.ghost.jwtSecret # Auto-generated if missing - - apps.postgresql.password # Auto-generated if missing (dependency) -``` - -**Resulting secrets.yaml**: -```yaml -apps: - ghost: - dbPassword: "aB3kL9mN2pQ7rS8tU1vW4xY5zA6bC0dE" - jwtSecret: "jF2gH5iJ8kL1mN4oP7qR0sT3uV6wX9yZ" - postgresql: - password: "eE8fF1gG4hH7iI0jJ3kK6lL9mM2nN5oO" -``` - -## Template System - -### gomplate Integration - -Wild Cloud uses [gomplate](https://gomplate.ca/) for dynamic configuration processing, allowing templates to access both configuration and secrets: - -```yaml -# Template example (before processing) -apiVersion: v1 -kind: ConfigMap -metadata: - name: ghost-config - namespace: {{ .apps.ghost.namespace }} -data: - url: "https://{{ .apps.ghost.domain }}" - timezone: "{{ .apps.ghost.timezone | default .cloud.timezone }}" - database_host: "{{ .apps.postgresql.hostname }}" - # Conditionals - {{- if .apps.ghost.enableSSL }} - ssl_enabled: "true" - {{- end }} - # Loops - allowed_domains: | - {{- range .apps.ghost.allowedDomains }} - - {{ . }} - {{- end }} -``` - -### Template Processing Commands - -**Process Single Template**: -```bash -# From stdin -cat template.yaml | wild-compile-template > output.yaml - -# With custom context -echo "domain: {{ .cloud.domain }}" | wild-compile-template -``` - -**Process Template Directory**: -```bash -# Recursively process all templates -wild-compile-template-dir source-dir output-dir - -# Clean destination first -wild-compile-template-dir --clean source-dir output-dir -``` - -### Template Context - -Templates have access to the complete configuration and secrets context: - -```go -// Available template variables -.cloud.* // All cloud configuration -.cluster.* // All cluster configuration -.apps.* // All application configuration -.services.* // All service configuration - -// Special functions -.cloud.domain // Primary domain -default "fallback" // Default value if key missing -env "VAR_NAME" // Environment variable -file "path/to/file" // File contents -``` - -**Template Examples**: -```yaml -# Basic variable substitution -domain: {{ .apps.myapp.domain }} - -# Default values -timezone: {{ .apps.myapp.timezone | default .cloud.timezone }} - -# Conditionals -{{- if .apps.myapp.enableFeature }} -feature_enabled: true -{{- else }} -feature_enabled: false -{{- end }} - -# Lists and iteration -allowed_hosts: -{{- range .apps.myapp.allowedHosts }} - - {{ . }} -{{- end }} - -# Complex expressions -replicas: {{ if eq .cluster.environment "production" }}3{{ else }}1{{ end }} -``` - -## Environment Setup - -### Environment Detection - -Wild Cloud automatically detects and configures the environment through several mechanisms: - -**Project Detection**: -- Searches for `.wildcloud` directory in current or parent directories -- Sets `WC_HOME` to the directory containing `.wildcloud` -- Fails if no Wild Cloud project found - -**Repository Detection**: -- Locates Wild Cloud repository (source code) -- Sets `WC_ROOT` to repository location -- Used for accessing app templates and setup scripts - -### Environment Variables - -**Key Environment Variables**: -```bash -WC_HOME="/path/to/your-cloud" # Your cloud directory -WC_ROOT="/path/to/wild-cloud-repo" # Wild Cloud repository -PATH="$WC_ROOT/bin:$PATH" # Wild Cloud commands available -KUBECONFIG="$WC_HOME/.kube/config" # Kubernetes configuration -TALOSCONFIG="$WC_HOME/.talos/config" # Talos configuration -``` - -**Environment Setup Script** (`env.sh`): -```bash -#!/bin/bash -# Auto-generated environment setup - -export WC_HOME="/home/user/my-cloud" -export WC_ROOT="/opt/wild-cloud" -export PATH="$WC_ROOT/bin:$PATH" -export KUBECONFIG="$WC_HOME/.kubeconfig" -export TALOSCONFIG="$WC_HOME/setup/cluster-nodes/generated/talosconfig" - -# Source this file to set up Wild Cloud environment -# source env.sh -``` - -### Common Script Pattern - -Most Wild Cloud scripts follow this initialization pattern: - -```bash -#!/bin/bash -set -e -set -o pipefail - -# Initialize Wild Cloud environment -if [ -z "${WC_ROOT}" ]; then - print "WC_ROOT is not set." - exit 1 -else - source "${WC_ROOT}/scripts/common.sh" - init_wild_env -fi - -# Script logic here... -``` - -## Configuration Validation - -### Schema Validation - -Wild Cloud validates configuration against expected schemas: - -**Cluster Configuration Validation**: -- Node IP addresses are valid and unique -- Network ranges don't overlap -- Required fields are present -- Hardware specifications meet minimums - -**Application Configuration Validation**: -- Domain names are valid DNS names -- Storage sizes use valid Kubernetes formats -- Image references are valid container images -- Dependencies are satisfied - -### Validation Commands - -```bash -# Validate current configuration -wild-config --validate - -# Check specific configuration sections -wild-config --validate --section cluster -wild-config --validate --section apps.ghost - -# Test template compilation -wild-compile-template --validate < template.yaml -``` - -## Configuration Best Practices - -### Organization - -**Hierarchical Structure**: -- Group related settings under common prefixes -- Use consistent naming conventions -- Keep application configs under `apps.*` -- Separate infrastructure from application settings - -**Documentation**: -```yaml -# Document complex configurations -cluster: - # Node configuration - update IPs after hardware changes - nodes: - control-1: - ip: "192.168.1.91" # Main control plane node - interface: "eth0" # Primary network interface -``` - -### Security - -**Configuration Security**: -- Never store secrets in `config.yaml` -- Use `wild-secret-set` for all sensitive data -- Regularly rotate generated secrets -- Backup `secrets.yaml` securely - -**Access Control**: -```bash -# Ensure proper permissions -chmod 600 secrets.yaml -chmod 644 config.yaml - -# Restrict directory access -chmod 755 your-cloud-directory -chmod 700 .wildcloud/ -``` - -### Version Control - -**Git Integration**: -```gitignore -# .gitignore for Wild Cloud projects -secrets.yaml # Never commit secrets -.wildcloud/cache/ # Temporary files -.wildcloud/logs/ # Operation logs -setup/cluster-nodes/generated/ # Generated cluster configs -.kube/ # Kubernetes configs -.talos/ # Talos configs -``` - -**Configuration Changes**: -- Commit `config.yaml` changes with descriptive messages -- Tag major configuration changes -- Use branches for experimental configurations -- Document configuration changes in commit messages - -### Backup and Recovery - -**Configuration Backup**: -```bash -# Backup configuration and secrets -wild-backup --home-only - -# Export configuration for disaster recovery -cp config.yaml config-backup-$(date +%Y%m%d).yaml -cp secrets.yaml secrets-backup-$(date +%Y%m%d).yaml.gpg # Encrypt first -``` - -**Recovery Process**: -1. Restore `config.yaml` from backup -2. Decrypt and restore `secrets.yaml` -3. Re-run `wild-setup` if needed -4. Validate configuration with `wild-config --validate` - -## Advanced Configuration - -### Multi-Environment Setup - -**Development Environment**: -```yaml -cloud: - domain: "dev.example.com" -cluster: - name: "dev-cluster" - nodeCount: 1 -apps: - ghost: - domain: "blog.dev.example.com" - replicas: 1 -``` - -**Production Environment**: -```yaml -cloud: - domain: "example.com" -cluster: - name: "prod-cluster" - nodeCount: 5 -apps: - ghost: - domain: "blog.example.com" - replicas: 3 -``` - -### Configuration Inheritance - -**Base Configuration**: -```yaml -# config.base.yaml -cloud: - timezone: "UTC" - email: "admin@example.com" -apps: - postgresql: - storage: "10Gi" -``` - -**Environment-Specific Override**: -```yaml -# config.prod.yaml (merged with base) -apps: - postgresql: - storage: "100Gi" # Override for production - replicas: 3 # Additional production setting -``` - -### Dynamic Configuration - -**Runtime Configuration Updates**: -```bash -# Update configuration without restart -wild-config-set apps.ghost.replicas 3 -wild-app-deploy ghost # Apply changes - -# Rolling updates -wild-config-set apps.ghost.image "ghost:5.1.0" -wild-app-deploy ghost --rolling-update -``` - -The Wild Cloud configuration system provides a powerful, secure, and flexible foundation for managing complex infrastructure deployments while maintaining simplicity for common use cases. \ No newline at end of file diff --git a/ai/wildcloud-v.PoC/overview.md b/ai/wildcloud-v.PoC/overview.md deleted file mode 100644 index d3c4e27..0000000 --- a/ai/wildcloud-v.PoC/overview.md +++ /dev/null @@ -1,443 +0,0 @@ -# Wild Cloud Overview - -Wild Cloud is a complete, production-ready Kubernetes infrastructure designed for personal use. It combines enterprise-grade technologies to create a self-hosted cloud platform with automated deployment, HTTPS certificates, and web management interfaces. - -## What is Wild Cloud? - -### Vision -In a world where digital lives are increasingly controlled by large corporations, Wild Cloud puts you back in control by providing: - -- **Privacy**: Your data stays on your hardware, under your control -- **Ownership**: No subscription fees or sudden price increases -- **Freedom**: Run the apps you want, the way you want them -- **Learning**: Gain valuable skills in modern cloud technologies -- **Resilience**: Reduce reliance on third-party services that can disappear - -### Core Capabilities - -**Complete Infrastructure Stack**: -- Kubernetes cluster with Talos Linux -- Automatic HTTPS certificates via Let's Encrypt -- Load balancing with MetalLB -- Ingress routing with Traefik -- Distributed storage with Longhorn -- DNS management with CoreDNS and ExternalDNS - -**Application Platform**: -- One-command application deployment -- Pre-built apps for common self-hosted services -- Automatic database setup and configuration -- Integrated backup and restore system -- Web-based management interfaces - -**Enterprise Features**: -- High availability and fault tolerance -- Automated certificate management -- Network policies and security contexts -- Monitoring and observability -- Infrastructure as code principles - -## Technology Stack - -### Core Infrastructure -- **Talos Linux** - Immutable OS designed for Kubernetes -- **Kubernetes** - Container orchestration platform -- **MetalLB** - Load balancer for bare metal deployments -- **Traefik** - HTTP reverse proxy and ingress controller -- **Longhorn** - Distributed block storage system -- **cert-manager** - Automatic TLS certificate management - -### Supporting Services -- **CoreDNS** - DNS server for service discovery -- **ExternalDNS** - Automatic DNS record management -- **Kubernetes Dashboard** - Web UI for cluster management -- **restic** - Backup solution with deduplication -- **gomplate** - Template processor for configurations - -### Development Tools -- **Kustomize** - Kubernetes configuration management -- **kubectl** - Kubernetes command line interface -- **talosctl** - Talos Linux management tool -- **Bats** - Testing framework for bash scripts - -## Architecture Overview - -``` -┌─────────────────────────────────────────────────────────────────┐ -│ Internet │ -└─────────────────┬───────────────────────────────────────────────┘ - │ -┌─────────────────▼───────────────────────────────────────────────┐ -│ DNS Provider │ -│ (Cloudflare, Route53, etc.) │ -└─────────────────┬───────────────────────────────────────────────┘ - │ -┌─────────────────▼───────────────────────────────────────────────┐ -│ Your Network │ -│ ┌─────────────┐ ┌─────────────────────────────────────────┐ │ -│ │ dnsmasq │ │ Kubernetes Cluster │ │ -│ │ Server │ │ ┌─────────────┐ ┌─────────────────┐ │ │ -│ │ │ │ │ MetalLB │ │ Traefik │ │ │ -│ │ DNS + DHCP │ │ │ LoadBalancer│ │ Ingress │ │ │ -│ └─────────────┘ │ └─────────────┘ └─────────────────┘ │ │ -│ │ ┌───────────────────────────────────┐ │ │ -│ │ │ Applications │ │ │ -│ │ │ Ghost, Immich, Gitea, vLLM... │ │ │ -│ │ └───────────────────────────────────┘ │ │ -│ └─────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────────────────────┘ -``` - -### Traffic Flow -1. **External Request** → DNS resolution via provider -2. **DNS Response** → Points to your cluster's external IP -3. **Network Request** → Hits MetalLB load balancer -4. **Load Balancer** → Routes to Traefik ingress controller -5. **Ingress Controller** → Terminates TLS and routes to application -6. **Application** → Serves content from Kubernetes pod - -## Getting Started - -### Prerequisites - -**Hardware Requirements**: -- Minimum 3 nodes for high availability -- 8GB RAM per node (16GB+ recommended) -- 100GB+ storage per node -- Gigabit network connectivity -- x86_64 architecture - -**Network Requirements**: -- All nodes on same network segment -- One dedicated machine for dnsmasq (can be lightweight) -- Static IP assignments or DHCP reservations -- Internet connectivity for downloads and certificates - -### Quick Start Guide - -#### 1. Install Dependencies -```bash -# Clone Wild Cloud repository -git clone https://github.com/your-org/wild-cloud -cd wild-cloud - -# Install required tools -scripts/setup-utils.sh -``` - -#### 2. Initialize Your Cloud -```bash -# Create and initialize new cloud directory -mkdir my-cloud && cd my-cloud -wild-init - -# Follow interactive setup prompts for: -# - Domain name configuration -# - Email for certificates -# - Network settings -``` - -#### 3. Deploy Infrastructure -```bash -# Complete automated setup -wild-setup - -# Or step-by-step: -wild-setup-cluster # Deploy Kubernetes cluster -wild-setup-services # Install core services -``` - -#### 4. Deploy Your First App -```bash -# List available applications -wild-apps-list - -# Deploy a blog -wild-app-add ghost -wild-app-deploy ghost - -# Access at https://ghost.yourdomain.com -``` - -#### 5. Verify Deployment -```bash -# Check system health -wild-health - -# Access Kubernetes dashboard -wild-dashboard-token -# Visit https://dashboard.internal.yourdomain.com -``` - -## Key Concepts - -### Configuration Management - -Wild Cloud uses a dual-file configuration system: - -**`config.yaml`** - Non-sensitive settings: -```yaml -cloud: - domain: "example.com" - email: "admin@example.com" -apps: - ghost: - domain: "blog.example.com" - storage: "10Gi" -``` - -**`secrets.yaml`** - Sensitive data (auto-generated): -```yaml -apps: - ghost: - dbPassword: "secure-random-password" - postgresql: - rootPassword: "another-secure-password" -``` - -### Template System - -All configurations are templates processed with gomplate: - -**Before Processing** (in repository): -```yaml -domain: {{ .apps.ghost.domain }} -storage: {{ .apps.ghost.storage | default "5Gi" }} -``` - -**After Processing** (in your cloud): -```yaml -domain: blog.example.com -storage: 10Gi -``` - -### Application Lifecycle - -1. **Discovery**: `wild-apps-list` - Browse available apps -2. **Configuration**: `wild-app-add app-name` - Configure and prepare application -3. **Deployment**: `wild-app-deploy app-name` - Deploy to cluster -4. **Operations**: `wild-app-doctor app-name` - Monitor and troubleshoot - -## Available Applications - -### Content Management & Publishing -- **Ghost** - Modern publishing platform for blogs and websites -- **Discourse** - Community discussion platform with modern features - -### Media & File Management -- **Immich** - Self-hosted photo and video backup solution - -### Development Tools -- **Gitea** - Self-hosted Git service with web interface -- **Docker Registry** - Private container image registry - -### Communication & Marketing -- **Keila** - Newsletter and email marketing platform -- **Listmonk** - High-performance newsletter and mailing list manager - -### Databases & Caching -- **PostgreSQL** - Advanced open-source relational database -- **MySQL** - Popular relational database management system -- **Redis** - In-memory data structure store and cache -- **Memcached** - Distributed memory caching system - -### AI & Machine Learning -- **vLLM** - High-performance LLM inference server with OpenAI-compatible API - -## Core Commands Reference - -### Setup & Initialization -```bash -wild-init # Initialize new cloud directory -wild-setup # Complete infrastructure deployment -wild-setup-cluster # Deploy Kubernetes cluster only -wild-setup-services # Deploy cluster services only -``` - -### Application Management -```bash -wild-apps-list # List available applications -wild-app-add # Configure application -wild-app-deploy # Deploy to cluster -wild-app-delete # Remove application -wild-app-doctor # Run diagnostics -``` - -### Configuration Management -```bash -wild-config # Read configuration value -wild-config-set # Set configuration value -wild-secret # Read secret value -wild-secret-set # Set secret value -``` - -### Operations & Monitoring -```bash -wild-health # System health check -wild-dashboard-token # Get dashboard access token -wild-backup # Backup system and apps -wild-app-backup # Backup specific application -``` - -## Best Practices - -### Security -- Never commit `secrets.yaml` to version control -- Use strong, unique passwords for all services -- Regularly update system and application images -- Monitor certificate expiration and renewal -- Implement network policies for production workloads - -### Configuration Management -- Store `config.yaml` in version control with proper .gitignore -- Document configuration changes in commit messages -- Use branches for experimental configurations -- Backup configuration files before major changes -- Test configuration changes in development environment - -### Operations -- Monitor cluster health with `wild-health` -- Set up regular backup schedules with `wild-backup` -- Keep applications updated with latest security patches -- Monitor disk usage and expand storage as needed -- Document custom configurations and procedures - -### Development -- Follow Wild Cloud app structure conventions -- Use proper Kubernetes security contexts -- Include comprehensive health checks and probes -- Test applications thoroughly before deployment -- Document application-specific configuration requirements - -## Common Use Cases - -### Personal Blog/Website -```bash -# Deploy Ghost blog with custom domain -wild-config-set apps.ghost.domain "blog.yourdomain.com" -wild-app-add ghost -wild-app-deploy ghost -``` - -### Photo Management -```bash -# Deploy Immich for photo backup and management -wild-app-add postgresql immich -wild-app-deploy postgresql immich -``` - -### Development Environment -```bash -# Set up Git hosting and container registry -wild-app-add gitea docker-registry -wild-app-deploy gitea docker-registry -``` - -### AI/ML Workloads -```bash -# Deploy vLLM for local AI inference -wild-config-set apps.vllm.model "Qwen/Qwen2.5-7B-Instruct" -wild-app-add vllm -wild-app-deploy vllm -``` - -## Troubleshooting - -### Common Issues - -**Cluster Not Responding**: -```bash -# Check node status -kubectl get nodes -talosctl health - -# Verify network connectivity -ping -``` - -**Applications Not Starting**: -```bash -# Check pod status -kubectl get pods -n - -# View logs -kubectl logs deployment/ -n - -# Run diagnostics -wild-app-doctor -``` - -**Certificate Issues**: -```bash -# Check certificate status -kubectl get certificates -A - -# View cert-manager logs -kubectl logs -n cert-manager deployment/cert-manager -``` - -**DNS Problems**: -```bash -# Test DNS resolution -nslookup - -# Check external-dns logs -kubectl logs -n external-dns deployment/external-dns -``` - -### Getting Help - -**Documentation**: -- Check `docs/` directory for detailed guides -- Review application-specific README files -- Consult Kubernetes and Talos documentation - -**Community Support**: -- Report issues on GitHub repository -- Join community forums and discussions -- Share configurations and troubleshooting tips - -**Professional Support**: -- Consider professional services for production deployments -- Engage with cloud infrastructure consultants -- Participate in training and certification programs - -## Advanced Topics - -### Custom Applications - -Create your own Wild Cloud applications: - -1. **Create App Directory**: `apps/myapp/` -2. **Define Manifest**: Include metadata and configuration defaults -3. **Create Templates**: Kubernetes resources with gomplate variables -4. **Test Deployment**: Use standard Wild Cloud workflow -5. **Share**: Contribute back to the community - -### Multi-Environment Deployments - -Manage multiple Wild Cloud instances: - -- **Development**: Single-node cluster for testing -- **Staging**: Multi-node cluster mirroring production -- **Production**: Full HA cluster with monitoring and backups - -### Integration with External Services - -Extend Wild Cloud capabilities: - -- **External DNS Providers**: Cloudflare, Route53, Google DNS -- **Backup Storage**: S3, Google Cloud Storage, Azure Blob -- **Monitoring**: Prometheus, Grafana, AlertManager -- **CI/CD**: GitLab CI, GitHub Actions, Jenkins - -### Performance Optimization - -Optimize for your workloads: - -- **Resource Allocation**: CPU and memory limits/requests -- **Storage Performance**: NVMe SSDs, storage classes -- **Network Optimization**: Network policies, service mesh -- **Scaling**: Horizontal pod autoscaling, cluster autoscaling - -Wild Cloud provides a solid foundation for personal cloud infrastructure while maintaining the flexibility to grow and adapt to changing needs. Whether you're running a simple blog or a complex multi-service application, Wild Cloud's enterprise-grade technologies ensure your infrastructure is reliable, secure, and maintainable. \ No newline at end of file diff --git a/ai/wildcloud-v.PoC/project-architecture.md b/ai/wildcloud-v.PoC/project-architecture.md deleted file mode 100644 index b9668b1..0000000 --- a/ai/wildcloud-v.PoC/project-architecture.md +++ /dev/null @@ -1,446 +0,0 @@ -# Wild Cloud Project Architecture - -Wild Cloud consists of two main directory structures: the **Wild Cloud Repository** (source code and templates) and **User Cloud Directories** (individual deployments). Understanding this architecture is essential for working with Wild Cloud effectively. - -## Architecture Overview - -``` -Wild Cloud Repository (/path/to/wild-cloud-repo) ← Source code, templates, scripts - ↓ -User Cloud Directory (/path/to/my-cloud) ← Individual deployment instance - ↓ -Kubernetes Cluster ← Running infrastructure -``` - -## Wild Cloud Repository Structure - -The Wild Cloud repository (`WC_ROOT`) contains the source code, templates, and tools: - -### `/bin/` - Command Line Interface -**Purpose**: All Wild Cloud CLI commands and utilities -``` -bin/ -├── wild-* # All user-facing commands (34+ scripts) -├── wild-common.sh # Common utilities and functions -├── README.md # CLI documentation -└── helm-chart-to-kustomize # Utility for converting Helm charts -``` - -**Key Commands**: -- **Setup**: `wild-init`, `wild-setup`, `wild-setup-cluster`, `wild-setup-services` -- **Apps**: `wild-app-*`, `wild-apps-list` -- **Config**: `wild-config*`, `wild-secret*` -- **Operations**: `wild-backup`, `wild-health`, `wild-dashboard-token` - -### `/apps/` - Application Templates -**Purpose**: Pre-built applications ready for deployment -``` -apps/ -├── README.md # Apps system documentation -├── ghost/ # Blog publishing platform -│ ├── manifest.yaml # App metadata and defaults -│ ├── kustomization.yaml # Kustomize configuration -│ ├── deployment.yaml # Kubernetes deployment -│ ├── service.yaml # Service definition -│ ├── ingress.yaml # HTTPS ingress -│ └── ... -├── immich/ # Photo management -├── gitea/ # Git hosting -├── postgresql/ # Database service -├── vllm/ # AI/LLM inference -└── ... -``` - -**Application Categories**: -- **Content Management**: Ghost, Discourse -- **Media**: Immich -- **Development**: Gitea, Docker Registry -- **Databases**: PostgreSQL, MySQL, Redis -- **AI/ML**: vLLM -- **Infrastructure**: Memcached, NFS - -### `/scripts/` - Utility Scripts -**Purpose**: Installation and utility scripts -``` -scripts/ -├── setup-utils.sh # Install dependencies -└── install-wild-cloud-dependencies.sh -``` - -### `/docs/` - Documentation -**Purpose**: User guides and documentation -``` -docs/ -├── guides/ # Setup and usage guides -├── agent-context/ # Agent documentation -│ └── wildcloud/ # Context files for AI agents -└── *.md # Various documentation files -``` - -### `/test/` - Test Suite -**Purpose**: Automated testing with Bats -``` -test/ -├── bats/ # Bats testing framework -├── fixtures/ # Test data and configurations -├── run_bats_tests.sh # Test runner -└── *.bats # Individual test files -``` - -### Root Files -``` -/ -├── README.md # Project overview -├── CLAUDE.md # AI assistant context -├── LICENSE # GNU AGPLv3 -├── CONTRIBUTING.md # Contribution guidelines -├── env.sh # Environment setup -├── .gitignore # Git exclusions -└── .gitmodules # Git submodules -``` - -## User Cloud Directory Structure - -Each user deployment (`WC_HOME`) is an independent cloud instance: - -### Directory Layout -``` -my-cloud/ # User's cloud directory -├── .wildcloud/ # Project marker and cache -│ ├── cache/ # Downloaded templates -│ │ ├── apps/ # Cached app templates -│ │ └── services/ # Cached service templates -│ └── logs/ # Operation logs -├── config.yaml # Main configuration -├── secrets.yaml # Sensitive data (600 permissions) -├── env.sh # Environment setup (auto-generated) -├── apps/ # Deployed application configs -│ ├── ghost/ # Compiled ghost configuration -│ ├── postgresql/ # Database configuration -│ └── ... -├── setup/ # Infrastructure configurations -│ ├── cluster-nodes/ # Node-specific configurations -│ │ └── generated/ # Generated Talos configs -│ └── cluster-services/ # Compiled service configurations -├── docs/ # Project-specific documentation -├── .kube/ # Kubernetes configuration -│ └── config # kubectl configuration -├── .talos/ # Talos configuration -│ └── config # talosctl configuration -└── backups/ # Local backup staging -``` - -### Configuration Files - -**`config.yaml`** - Main configuration (version controlled): -```yaml -cloud: - domain: "example.com" - email: "admin@example.com" -cluster: - name: "my-cluster" - nodeCount: 3 -apps: - ghost: - domain: "blog.example.com" -``` - -**`secrets.yaml`** - Sensitive data (not version controlled): -```yaml -apps: - ghost: - dbPassword: "generated-password" - postgresql: - rootPassword: "generated-password" -cluster: - talos: - secrets: "base64-encoded-secrets" -``` - -**`.wildcloud/`** - Project metadata: -- Marks directory as Wild Cloud project -- Contains cached templates and temporary files -- Used for project detection by scripts - -### Generated Directories - -**`apps/`** - Compiled application configurations: -- Created by `wild-app-add` command -- Contains ready-to-deploy Kubernetes manifests -- Templates processed with user configuration -- Each app in separate subdirectory - -**`setup/cluster-nodes/generated/`** - Talos configurations: -- Base cluster configuration (`controlplane.yaml`, `worker.yaml`) -- Node-specific patches and final configs -- Cluster secrets and certificates -- Generated by `wild-cluster-config-generate` - -**`setup/cluster-services/`** - Kubernetes services: -- Compiled service configurations -- Generated by `wild-cluster-services-configure` -- Ready for deployment to cluster - -## Template Processing Flow - -### From Repository to Deployment - -1. **Template Storage**: Templates stored in repository with placeholder variables -2. **Configuration Merge**: `wild-app-add` reads templates directly from repository and merges app defaults with user config -3. **Template Compilation**: gomplate processes templates with user data -4. **Manifest Generation**: Final Kubernetes manifests created in user directory -5. **Deployment**: `wild-app-deploy` applies manifests to cluster - -### Template Variables - -**Repository Templates** (before processing): -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: ghost - namespace: {{ .apps.ghost.namespace }} -spec: - replicas: {{ .apps.ghost.replicas | default 1 }} - template: - spec: - containers: - - name: ghost - image: "{{ .apps.ghost.image }}" - env: - - name: url - value: "https://{{ .apps.ghost.domain }}" -``` - -**User Directory** (after processing): -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: ghost - namespace: ghost -spec: - replicas: 2 - template: - spec: - containers: - - name: ghost - image: "ghost:5.0.0" - env: - - name: url - value: "https://blog.example.com" -``` - -## File Permissions and Security - -### Security Model - -**Configuration Security**: -```bash -config.yaml # 644 (readable by group) -secrets.yaml # 600 (owner only) -.wildcloud/ # 755 (standard directory) -apps/ # 755 (standard directory) -``` - -**Git Integration**: -```gitignore -# Automatically excluded from version control -secrets.yaml # Never commit secrets -.wildcloud/cache/ # Temporary files -.wildcloud/logs/ # Operation logs -setup/cluster-nodes/generated/ # Generated configs -.kube/ # Kubernetes configs -.talos/ # Talos configs -backups/ # Backup files -``` - -### Access Patterns - -**Read Operations**: -- Scripts read config and secrets via `wild-config` and `wild-secret` -- Template processor accesses both files for compilation -- Kubernetes tools read generated manifests - -**Write Operations**: -- Only Wild Cloud commands modify config and secrets -- Manual editing supported but not recommended -- Backup processes create read-only copies - -## Development Workflow - -### Repository Development - -**Setup Development Environment**: -```bash -git clone https://github.com/username/wild-cloud -cd wild-cloud -source env.sh # Set up environment -scripts/setup-utils.sh # Install dependencies -``` - -**Testing Changes**: -```bash -# Test specific functionality -test/run_bats_tests.sh - -# Test with real cloud directory -cd /path/to/test-cloud -wild-app-add myapp # Test app changes -wild-setup-cluster --dry-run # Test cluster changes -``` - -### User Workflow - -**Initial Setup**: -```bash -mkdir my-cloud && cd my-cloud -wild-init # Initialize project -wild-setup # Deploy infrastructure -``` - -**Daily Operations**: -```bash -wild-apps-list # Browse available apps -wild-app-add ghost # Configure app -wild-app-deploy ghost # Deploy to cluster -``` - -**Configuration Management**: -```bash -wild-config apps.ghost.domain # Read configuration -wild-config-set apps.ghost.storage "20Gi" # Update configuration -wild-app-deploy ghost # Apply changes -``` - -## Integration Points - -### External Systems - -**DNS Providers**: -- Cloudflare API for DNS record management -- Route53 support for AWS domains -- Generic webhook support for other providers - -**Certificate Authorities**: -- Let's Encrypt (primary) -- Custom CA support -- Manual certificate import - -**Storage Backends**: -- Local storage via Longhorn -- NFS network storage -- Cloud storage integration (S3, etc.) - -**Backup Systems**: -- Restic for deduplication and encryption -- S3-compatible storage backends -- Local and remote backup targets - -### Kubernetes Integration - -**Custom Resources**: -- Traefik IngressRoute and Middleware -- cert-manager Certificate and Issuer -- Longhorn Volume and Engine -- ExternalDNS DNSEndpoint - -**Standard Resources**: -- Deployments, Services, ConfigMaps -- Ingress, PersistentVolumes, Secrets -- NetworkPolicies, ServiceAccounts -- Jobs, CronJobs, DaemonSets - -## Extensibility Points - -### Custom Applications - -**Create New Apps**: -1. Create directory in `apps/` -2. Define `manifest.yaml` with metadata -3. Create Kubernetes resource templates -4. Test with `wild-app-add` and `wild-app-deploy` - -**App Requirements**: -- Follow Wild Cloud labeling standards -- Use gomplate template syntax -- Include external-dns annotations -- Implement proper security contexts - -### Custom Services - -**Add Infrastructure Services**: -1. Create directory in `setup/cluster-services/` -2. Define installation and configuration scripts -3. Create Kubernetes manifests with templates -4. Integrate with service deployment pipeline - -### Script Extensions - -**Extend CLI**: -- Add scripts to `bin/` directory with `wild-` prefix -- Follow common script patterns (error handling, help text) -- Source `wild-common.sh` for utilities -- Use configuration system for customization - -## Deployment Patterns - -### Single-Node Development - -**Configuration**: -```yaml -cluster: - nodeCount: 1 - nodes: - all-in-one: - roles: ["controlplane", "worker"] -``` - -**Suitable For**: -- Development and testing -- Learning Kubernetes concepts -- Small personal deployments -- Resource-constrained environments - -### Multi-Node Production - -**Configuration**: -```yaml -cluster: - nodeCount: 5 - nodes: - control-1: { role: "controlplane" } - control-2: { role: "controlplane" } - control-3: { role: "controlplane" } - worker-1: { role: "worker" } - worker-2: { role: "worker" } -``` - -**Suitable For**: -- Production workloads -- High availability requirements -- Scalable application hosting -- Enterprise-grade deployments - -### Hybrid Deployments - -**Configuration**: -```yaml -cluster: - nodes: - control-1: - role: "controlplane" - taints: [] # Allow workloads on control plane - worker-gpu: - role: "worker" - labels: - nvidia.com/gpu: "true" # GPU-enabled node -``` - -**Use Cases**: -- Mixed workload requirements -- Specialized hardware (GPU, storage) -- Cost optimization -- Gradual scaling - -The Wild Cloud architecture provides a solid foundation for personal cloud infrastructure while maintaining flexibility for customization and extension. \ No newline at end of file diff --git a/ai/wildcloud-v.PoC/setup-process.md b/ai/wildcloud-v.PoC/setup-process.md deleted file mode 100644 index 26f3167..0000000 --- a/ai/wildcloud-v.PoC/setup-process.md +++ /dev/null @@ -1,390 +0,0 @@ -# Wild Cloud Setup Process & Infrastructure - -Wild Cloud provides a complete, production-ready Kubernetes infrastructure designed for personal use. It combines enterprise-grade technologies to create a self-hosted cloud platform with automated deployment, HTTPS certificates, and web management interfaces. - -## Setup Phases Overview - -The Wild Cloud setup follows a sequential, dependency-aware process: - -1. **Environment Setup** - Install required tools and dependencies -2. **DNS/Network Foundation** - Set up dnsmasq for DNS and PXE booting -3. **Cluster Infrastructure** - Deploy Talos Linux nodes and Kubernetes cluster -4. **Cluster Services** - Install core services (ingress, storage, certificates, etc.) - -## Phase 1: Environment Setup - -### Dependencies Installation -**Script**: `scripts/setup-utils.sh` - -**Required Tools**: -- `kubectl` - Kubernetes CLI -- `gomplate` - Template processor for configuration -- `kustomize` - Kubernetes configuration management -- `yq` - YAML processor -- `restic` - Backup tool -- `talosctl` - Talos Linux cluster management - -### Project Initialization -**Command**: `wild-init` - -Creates the basic Wild Cloud directory structure: -- `.wildcloud/` - Project marker and cache -- `config.yaml` - Main configuration file -- `secrets.yaml` - Sensitive data storage -- Basic project scaffolding - -## Phase 2: DNS/Network Foundation - -### dnsmasq Infrastructure -**Location**: `setup/dnsmasq/` -**Requirements**: Dedicated Linux machine with static IP - -**Services Provided**: -1. **LAN DNS Server** - - Forwards internal domains (`*.internal.domain.com`) to cluster - - Forwards external domains (`*.domain.com`) to cluster - - Provides DNS resolution for entire network - -2. **PXE Boot Server** - - Enables network booting for cluster node installation - - DHCP/TFTP services for Talos Linux deployment - - Automated node provisioning - -**Network Configuration Example**: -```yaml -network: - subnet: 192.168.1.0/24 - gateway: 192.168.1.1 - dnsmasq_ip: 192.168.1.50 - dhcp_range: 192.168.1.100-200 - metallb_pool: 192.168.1.80-89 - control_plane_vip: 192.168.1.90 - node_ips: 192.168.1.91-93 -``` - -## Phase 3: Cluster Infrastructure Setup - -### Talos Linux Foundation -**Command**: `wild-setup-cluster` - -**Talos Configuration**: -- **Version**: v1.11.0 (configurable) -- **Immutable OS**: Designed specifically for Kubernetes -- **System Extensions**: - - Intel microcode updates - - iSCSI tools for storage - - gVisor container runtime - - NVIDIA GPU support (optional) - - Additional system utilities - -### Cluster Setup Process - -#### 1. Configuration Generation -**Script**: `wild-cluster-config-generate` - -- Generates base Talos configurations (`controlplane.yaml`, `worker.yaml`) -- Creates cluster secrets using `talosctl gen config` -- Establishes foundation for all node configurations - -#### 2. Node Setup (Atomic Operations) -**Script**: `wild-node-setup [options]` - -**Complete Node Lifecycle Management**: -- **Hardware Detection**: Discovers network interfaces and storage devices -- **Configuration Generation**: Creates node-specific patches and final configs -- **Deployment**: Applies Talos configuration to the node - -**Options**: -- `--detect`: Force hardware re-detection -- `--no-deploy`: Generate configuration only, skip deployment - -**Integration with Cluster Setup**: -- `wild-setup-cluster` automatically calls `wild-node-setup` for each node -- Individual node failures don't break cluster setup -- Clear retry instructions for failed nodes - -### Cluster Architecture - -**Control Plane**: -- 3 nodes for high availability -- Virtual IP (VIP) for load balancing -- etcd distributed across all control plane nodes - -**Worker Nodes**: -- Variable count (configured during setup) -- Dedicated workload execution -- Storage participation via Longhorn - -**Networking**: -- All nodes on same LAN segment -- Sequential IP assignment -- MetalLB integration for load balancing - -## Phase 4: Cluster Services Installation - -### Services Deployment Process -**Command**: `wild-setup-services [options]` -- **`--fetch`**: Fetch fresh templates before setup -- **`--no-deploy`**: Configure only, skip deployment - -**New Architecture**: Per-service atomic operations -- Uses `wild-service-setup ` for each service in dependency order -- Each service handles complete lifecycle: fetch → configure → deploy -- Dependency validation before each service deployment -- Fail-fast with clear recovery instructions - -**Individual Service Management**: `wild-service-setup [options]` -- **Default**: Configure and deploy using existing templates -- **`--fetch`**: Fetch fresh templates before setup -- **`--no-deploy`**: Configure only, skip deployment - -### Core Services (Installed in Order) - -#### 1. MetalLB Load Balancer -**Location**: `setup/cluster-services/metallb/` - -- **Purpose**: Provides load balancing for bare metal clusters -- **Functionality**: Assigns external IPs to Kubernetes services -- **Configuration**: IP address pool from local network range -- **Integration**: Foundation for ingress traffic routing - -#### 2. Longhorn Distributed Storage -**Location**: `setup/cluster-services/longhorn/` - -- **Purpose**: Distributed block storage for persistent volumes -- **Features**: - - Cross-node data replication - - Snapshot and backup capabilities - - Volume expansion and management - - Web-based management interface -- **Storage**: Uses local disks from all cluster nodes - -#### 3. Traefik Ingress Controller -**Location**: `setup/cluster-services/traefik/` - -- **Purpose**: HTTP/HTTPS reverse proxy and ingress controller -- **Features**: - - Automatic service discovery - - TLS termination - - Load balancing and routing - - Gateway API support -- **Integration**: Works with MetalLB for external traffic - -#### 4. CoreDNS -**Location**: `setup/cluster-services/coredns/` - -- **Purpose**: DNS resolution for cluster services -- **Integration**: Connects with external DNS providers -- **Functionality**: Service discovery and DNS forwarding - -#### 5. cert-manager -**Location**: `setup/cluster-services/cert-manager/` - -- **Purpose**: Automatic TLS certificate management -- **Features**: - - Let's Encrypt integration - - Automatic certificate issuance and renewal - - Multiple certificate authorities support - - Certificate lifecycle management - -#### 6. ExternalDNS -**Location**: `setup/cluster-services/externaldns/` - -- **Purpose**: Automatic DNS record management -- **Functionality**: - - Syncs Kubernetes services with DNS providers - - Automatic A/CNAME record creation - - Supports multiple DNS providers (Cloudflare, Route53, etc.) - -#### 7. Kubernetes Dashboard -**Location**: `setup/cluster-services/kubernetes-dashboard/` - -- **Purpose**: Web UI for cluster management -- **Access**: `https://dashboard.internal.domain.com` -- **Authentication**: Token-based access via `wild-dashboard-token` -- **Features**: Resource management, monitoring, troubleshooting - -#### 8. NFS Storage (Optional) -**Location**: `setup/cluster-services/nfs/` - -- **Purpose**: Network file system for shared storage -- **Use Cases**: Media storage, backups, shared data -- **Integration**: Mounted as persistent volumes in applications - -#### 9. Docker Registry -**Location**: `setup/cluster-services/docker-registry/` - -- **Purpose**: Private container registry -- **Features**: Store custom images locally -- **Integration**: Used by applications and CI/CD pipelines - -## Infrastructure Components Deep Dive - -### DNS and Domain Architecture - -``` -Internet → External DNS → MetalLB LoadBalancer → Traefik → Kubernetes Services - ↑ - Internal DNS (dnsmasq) - ↑ - Internal Network -``` - -**Domain Types**: -- **External**: `app.domain.com` - Public-facing services -- **Internal**: `app.internal.domain.com` - Admin interfaces only -- **Resolution**: dnsmasq forwards all domain traffic to cluster - -### Certificate and TLS Management - -**Automatic Certificate Flow**: -1. Service deployed with ingress annotation -2. cert-manager detects certificate requirement -3. Let's Encrypt challenge initiated -4. Certificate issued and stored in Kubernetes secret -5. Traefik uses certificate for TLS termination -6. Automatic renewal before expiration - -### Storage Architecture - -**Longhorn Distributed Storage**: -- Block-level replication across nodes -- Default 3-replica policy for data durability -- Automatic failover and recovery -- Snapshot and backup capabilities -- Web UI for management and monitoring - -**Storage Classes**: -- `longhorn` - Default replicated storage -- `longhorn-single` - Single replica for non-critical data -- `nfs` - Shared network storage (if configured) - -### Network Traffic Flow - -**External Request Flow**: -1. DNS resolution via dnsmasq → cluster IP -2. Traffic hits MetalLB load balancer -3. MetalLB forwards to Traefik ingress -4. Traefik terminates TLS and routes to service -5. Service forwards to appropriate pod -6. Response follows reverse path - -### High Availability Features - -**Control Plane HA**: -- 3 control plane nodes with leader election -- Virtual IP for API server access -- etcd cluster with automatic failover -- Distributed workload scheduling - -**Storage HA**: -- Longhorn 3-way replication -- Automatic replica placement across nodes -- Node failure recovery -- Data integrity verification - -**Networking HA**: -- MetalLB speaker pods on all nodes -- Automatic load balancer failover -- Multiple ingress controller replicas - -## Hardware Requirements - -### Minimum Specifications -- **Nodes**: 3 control plane + optional workers -- **RAM**: 8GB minimum per node (16GB+ recommended) -- **CPU**: 4 cores minimum per node -- **Storage**: 100GB+ local storage per node -- **Network**: Gigabit ethernet connectivity - -### Network Requirements -- All nodes on same LAN segment -- Static IP assignments or DHCP reservations -- dnsmasq server accessible by all nodes -- Internet connectivity for image pulls and Let's Encrypt - -### Recommended Hardware -- **Control Plane**: 16GB RAM, 8 cores, 200GB NVMe SSD -- **Workers**: 32GB RAM, 16 cores, 500GB NVMe SSD -- **Network**: Dedicated VLAN or network segment -- **Redundancy**: UPS protection, dual network interfaces - -## Configuration Management - -### Configuration Files -- `config.yaml` - Main configuration (domains, network, apps) -- `secrets.yaml` - Sensitive data (passwords, API keys, certificates) -- `.wildcloud/` - Cache and temporary files - -### Template System -**gomplate Integration**: -- All configurations processed as templates -- Access to config and secrets via template variables -- Dynamic configuration generation -- Environment-specific customization - -### Configuration Commands -```bash -# Read configuration values -wild-config cluster.name -wild-config apps.ghost.domain - -# Set configuration values -wild-config-set cloud.domain "example.com" -wild-config-set cluster.nodeCount 5 - -# Secret management -wild-secret apps.database.password -wild-secret-set apps.api.key "secret-value" -``` - -## Setup Commands Reference - -### Complete Setup -```bash -wild-init # Initialize project -wild-setup # Complete automated setup -``` - -### Phase-by-Phase Setup -```bash -wild-setup-cluster # Cluster infrastructure only -wild-setup-services # Cluster services only -``` - -### Individual Operations -```bash -wild-cluster-config-generate # Generate base configs -wild-node-setup # Complete node setup (detect → configure → deploy) -wild-node-setup --detect # Force hardware re-detection -wild-node-setup --no-deploy # Configuration only -wild-dashboard-token # Get dashboard access -wild-health # System health check -``` - -## Troubleshooting and Validation - -### Health Checks -```bash -wild-health # Overall system status -kubectl get nodes # Node status -kubectl get pods -A # All pod status -talosctl health # Talos cluster health -``` - -### Service Validation -```bash -kubectl get svc -n metallb-system # MetalLB status -kubectl get pods -n longhorn-system # Storage status -kubectl get pods -n traefik # Ingress status -kubectl get certificates -A # Certificate status -``` - -### Log Analysis -```bash -talosctl logs -f machined # Talos system logs -kubectl logs -n traefik deployment/traefik # Ingress logs -kubectl logs -n cert-manager deployment/cert-manager # Certificate logs -``` - -This comprehensive setup process creates a production-ready personal cloud infrastructure with enterprise-grade reliability, security, and management capabilities. \ No newline at end of file diff --git a/docs/app-states.md b/docs/app-states.md deleted file mode 100644 index 7fcea7f..0000000 --- a/docs/app-states.md +++ /dev/null @@ -1,1902 +0,0 @@ -# Wild Cloud App Lifecycle: State and Operations - -## Overview - -Wild Cloud manages applications across multiple independent systems with different consistency guarantees. Understanding these systems, their interactions, and how app packages are structured is critical for reliable app lifecycle management. - -This document covers: -- **System architecture**: The three independent systems managing app state -- **User workflows**: Two distinct approaches (git-based vs Web UI) -- **App package structure**: How apps are defined in Wild Directory -- **State lifecycle**: Complete state transitions from add to delete -- **Operations**: How each lifecycle operation works across systems -- **Edge cases**: Common failure modes and automatic recovery - -## User Workflows - -Wild Cloud supports two fundamentally different workflows for managing app lifecycle: - -### Advanced Users: Git-Based Infrastructure-as-Code - -**Target Audience**: DevOps engineers, systems administrators, users comfortable with git and command-line tools. - -**Key Characteristics**: -- Instance data directory is a git repository -- Wild Directory tracked as upstream remote -- Manual edits tracked in git with commit messages -- Wild Directory updates merged using standard git workflows -- Full version control and audit trail -- SSH/command-line access to Wild Central device - -**Typical Workflow**: -```bash -# Clone instance repository -git clone user@wild-central:/var/lib/wild-central/instances/my-cloud - -# Make custom changes -vim apps/myapp/deployment.yaml -git commit -m "Increase CPU limits for production" - -# Merge upstream Wild Directory updates -git remote add wild-directory https://github.com/wildcloud/wild-directory.git -git fetch wild-directory -git merge wild-directory/main -# Resolve conflicts if needed - -# Deploy changes -wild app deploy myapp -``` - -**Philosophy**: Treat cluster configuration like application code - version controlled, reviewed, tested, and deployed through established git workflows. - -### Regular Users: Web UI-Based Management - -**Target Audience**: Non-technical users, small teams, users who prefer graphical interfaces. - -**Key Characteristics**: -- All management through Web UI or CLI (no SSH access) -- Configuration changes via forms (config.yaml, secrets.yaml) -- Wild Directory updates applied automatically with config merging -- Cannot directly edit manifest files (prevents divergence) -- Simplified workflow with automatic safety checks - -**Typical Workflow**: -1. Browse available apps in Web UI -2. Click "Add" to add app to instance -3. Configure via form fields (port, storage, domain, etc.) -4. Click "Deploy" to deploy to cluster -5. System notifies when Wild Directory updates available -6. Click "Update" to merge changes (config preserved) -7. Review changes in diff view -8. Click "Deploy" to apply updates - -**Philosophy**: Abstract away complexity - users manage apps like installing software, not like managing infrastructure code. - -### Key Differences - -| Aspect | Advanced Users (Git) | Regular Users (Web UI) | -|--------|----------------------|------------------------| -| **Access** | SSH + command line | Web UI + CLI | -| **Manifest Editing** | Direct file editing | Via config forms only | -| **Version Control** | Git (full history) | System managed | -| **Wild Directory Updates** | Manual git merge | Automatic merge with review | -| **Customization** | Unlimited | Configuration-based only | -| **Drift** | Intentional (git-tracked) | Unintentional (reconcile) | -| **Collaboration** | Git branches/PRs | Shared Web UI access | -| **Rollback** | `git revert` | Re-deploy previous state | - -The rest of this document covers both workflows, with sections clearly marked for each user type where behavior differs. - -## System Architecture - -### The Multi-System Challenge - -Wild Cloud app state spans **three independent systems**: - -1. **Wild Directory** (Source of Truth) - - Location: `/path/to/wild-directory/{app-name}/` - - Consistency: Immutable, version controlled - - Purpose: Template definitions shared across all instances - -2. **Instance Data** (Local State) - - Location: `/path/to/data-dir/instances/{instance}/` - - Consistency: Immediately consistent, file-system based - - Purpose: Instance-specific configuration and compiled manifests - -3. **Kubernetes Cluster** (Runtime State) - - Location: Kubernetes API and etcd - - Consistency: Eventually consistent - - Purpose: Running application workloads - -**Critical Insight**: These systems have fundamentally different consistency models, creating inherent challenges for atomic operations across system boundaries. - -## State Components - -### 1. Wild Directory (Immutable Source) - -``` -wild-directory/ -└── {app-name}/ - ├── manifest.yaml # App metadata, dependencies, defaults - ├── kustomization.yaml # Kustomize configuration - ├── deployment.yaml # Kubernetes workload (template) - ├── service.yaml # Kubernetes service (template) - ├── ingress.yaml # Kubernetes ingress (template) - ├── namespace.yaml # Namespace definition (template) - ├── pvc.yaml # Storage claims (template) - ├── db-init-job.yaml # Database initialization (optional) - └── README.md # Documentation -``` - -**Characteristics**: -- Read-only during operations -- Contains gomplate template variables: `{{ .cloud.domain }}`, `{{ .app.port }}` -- Shared across all Wild Cloud instances -- Version controlled (git) - -#### App Manifest Structure - -The `manifest.yaml` file defines everything about an app: - -```yaml -name: myapp # App identifier (matches directory name) -is: myapp # Unique app type identifier -description: Brief description -version: 1.0.0 # Follow upstream versioning -icon: https://example.com/icon.svg - -requires: # Dependencies (optional) - - name: postgres # Dependency app type (matches 'is' field) - alias: db # Optional: reference name in templates - - name: redis # No alias = use 'redis' as reference - -defaultConfig: # Merged into instance config.yaml - namespace: myapp - image: myapp/myapp:latest - port: "8080" - storage: 10Gi - domain: myapp.{{ .cloud.domain }} - # Can reference dependencies: - dbHost: "{{ .apps.db.host }}" - redisHost: "{{ .apps.redis.host }}" - -defaultSecrets: # App's own secrets - - key: apiKey # Auto-generated random if no default - - key: dbUrl # Can use template with config/secrets - default: "postgresql://{{ .app.dbUser }}:{{ .secrets.dbPassword }}@{{ .app.dbHost }}/{{ .app.dbName }}" - -requiredSecrets: # Secrets from dependencies - - db.password # Format: . - - redis.auth # Copied from dependency's secrets -``` - -**Template Variable Resolution**: - -In `manifest.yaml` only: -- `{{ .cloud.* }}` - Infrastructure config (domain, smtp, etc.) -- `{{ .cluster.* }}` - Cluster config (IPs, versions, etc.) -- `{{ .operator.* }}` - Operator info (email) -- `{{ .app.* }}` - This app's config from defaultConfig -- `{{ .apps..* }}` - Dependency app's config (via requires mapping) -- `{{ .secrets.* }}` - This app's secrets (in defaultSecrets default only) - -In `*.yaml` resource templates: -- `{{ .* }}` - Only this app's config (all from defaultConfig) -- No access to secrets, cluster config, or other apps - -**Dependency Resolution**: -1. `requires` lists app types needed (matches `is` field) -2. At add time, user maps to actual installed apps -3. System stores mapping in `installedAs` field in instance manifest -4. Templates resolve `{{ .apps.db.* }}` using this mapping - -### 2. Instance Data (Local State) - -``` -data-dir/instances/{instance}/ -├── config.yaml # App configuration (user-editable) -├── secrets.yaml # App secrets (generated + user-editable) -├── kubeconfig # Cluster access credentials -├── apps/ -│ └── {app-name}/ -│ ├── manifest.yaml # Copy with installedAs mappings -│ ├── deployment.yaml # Compiled (variables resolved) -│ ├── service.yaml # Compiled -│ ├── ingress.yaml # Compiled -│ └── ... # All manifests compiled -└── operations/ - └── op_{action}_app_{app-name}_{timestamp}.json -``` - -#### config.yaml Structure - -```yaml -apps: - postgres: - namespace: postgres - image: pgvector/pgvector:pg15 - port: "5432" - storage: 10Gi - host: postgres.postgres.svc.cluster.local - # ... all defaultConfig values from manifest -``` - -#### secrets.yaml Structure - -```yaml -apps: - postgres: - password: - ghost: - dbPassword: - adminPassword: - smtpPassword: - # defaultSecrets + requiredSecrets -``` - -**Characteristics**: -- Immediately consistent (filesystem) -- File-locked during updates (`config.yaml.lock`, `secrets.yaml.lock`) -- Version controlled (recommended but optional) -- User-editable (advanced users can SSH and modify) - -### 3. Kubernetes Cluster (Runtime State) - -``` -Kubernetes Cluster -└── Namespace: {app-name} - ├── Deployment: {app-name}-* - ├── ReplicaSet: {app-name}-* - ├── Pod: {app-name}-* - ├── Service: {app-name} - ├── Ingress: {app-name} - ├── PVC: {app-name}-pvc - ├── Secret: {app-name}-secrets - ├── ConfigMap: {app-name}-* (optional) - └── Job: {app-name}-db-init (optional) -``` - -**Namespace Lifecycle**: -- `Active`: Normal operating state -- `Terminating`: Deletion in progress (may take time) -- Finalizers: `[kubernetes]` prevents deletion until resources cleaned - -**Characteristics**: -- Eventually consistent (distributed system) -- Cascade deletion: Deleting namespace deletes all child resources -- Finalizers block deletion until cleared -- May enter stuck states requiring automatic intervention - -#### Kubernetes Resource Labeling - -All Wild Cloud apps use standard labels automatically applied via Kustomize: - -```yaml -# In kustomization.yaml -labels: - - includeSelectors: true # Apply to resources AND selectors - pairs: - app: myapp # App name - managedBy: kustomize - partOf: wild-cloud -``` - -This auto-expands selectors: -```yaml -# You write: -selector: - component: web - -# Kustomize expands to: -selector: - app: myapp - managedBy: kustomize - partOf: wild-cloud - component: web -``` - -**Important**: Use simple component labels (`component: web`), not Helm-style labels (`app.kubernetes.io/name`). - -### 4. External System State (Kubernetes Controller-Managed) - -These systems are not directly controlled by Wild Cloud API but are integral to app lifecycle: - -#### External DNS (via external-dns controller) - -**Location**: External DNS provider (Cloudflare, Route53, etc.) - -**Trigger**: Ingress with external-dns annotations -```yaml -annotations: - external-dns.alpha.kubernetes.io/target: {{ .domain }} -``` - -**State Flow**: -``` -1. Deploy creates Ingress with annotations -2. external-dns controller watches Ingress resources -3. Controller creates DNS records at provider -4. DNS propagates (eventual consistency, 30-300 seconds) -5. Domain resolves to cluster load balancer IP -``` - -**Lifecycle**: -- **Create**: Automatic when Ingress deployed -- **Update**: Automatic when Ingress annotations change -- **Delete**: Automatic when Ingress deleted (DNS records cleaned up) - -**Eventual Consistency**: DNS changes take 30s-5min to propagate globally. - -**Edge Cases**: -- DNS propagation delay (app deployed but domain not resolving yet) -- Provider rate limits (too many updates) -- Stale records if external-dns controller down during deletion -- Multiple ingresses with same hostname (last write wins) - -**Debugging**: -```bash -# View external-dns logs -kubectl logs -n external-dns deployment/external-dns - -# Check what DNS records external-dns is managing -kubectl get ingress -A -o yaml | grep external-dns -``` - -#### TLS Certificates (via cert-manager) - -**Location**: Both cluster (Kubernetes Secret) and external CA (Let's Encrypt) - -**Trigger**: Ingress with cert-manager annotations -```yaml -annotations: - cert-manager.io/cluster-issuer: letsencrypt-prod -spec: - tls: - - hosts: - - myapp.cloud.example.com - secretName: myapp-tls -``` - -**State Flow**: -``` -1. Deploy creates Ingress with TLS config -2. cert-manager creates Certificate resource -3. cert-manager creates Order with ACME DNS-01 challenge -4. cert-manager updates DNS via provider (for challenge) -5. Let's Encrypt validates domain ownership via DNS -6. cert-manager receives certificate and stores in Secret -7. Ingress controller uses Secret for TLS termination -``` - -**Lifecycle**: -- **Create**: Automatic when Ingress with cert-manager annotation deployed -- **Renew**: Automatic (starts 30 days before expiry) -- **Delete**: Secret deleted with namespace, CA record persists - -**Eventual Consistency**: Certificate issuance takes 30s-2min (DNS challenge + CA validation). - -**Edge Cases**: -- DNS-01 challenge timeout (DNS not propagated yet) -- Rate limits (Let's Encrypt: 50 certs/domain/week, 5 failed validations/hour) -- Expired certificates (cert-manager should auto-renew but may fail) -- Namespace stuck terminating (cert-manager challenges may block finalizers) - -**Debugging**: -```bash -# View certificates and their status -kubectl get certificate -n myapp -kubectl describe certificate myapp-tls -n myapp - -# View ACME challenge progress -kubectl get certificaterequest -n myapp -kubectl get order -n myapp -kubectl get challenge -n myapp - -# Check cert-manager logs -kubectl logs -n cert-manager deployment/cert-manager -``` - -#### Wildcard Certificates (Shared Resource Pattern) - -Wild Cloud uses **two shared wildcard certificates** to avoid rate limits: - -**1. Public Wildcard Certificate** (`wildcard-wild-cloud-tls`) -```yaml -# Created once in cert-manager namespace -apiVersion: cert-manager.io/v1 -kind: Certificate -metadata: - name: wildcard-wild-cloud-tls - namespace: cert-manager -spec: - secretName: wildcard-wild-cloud-tls - dnsNames: - - "*.cloud.example.com" -``` - -**2. Internal Wildcard Certificate** (`wildcard-internal-wild-cloud-tls`) -```yaml -# For internal-only apps not exposed via external-dns -apiVersion: cert-manager.io/v1 -kind: Certificate -metadata: - name: wildcard-internal-wild-cloud-tls - namespace: cert-manager -spec: - secretName: wildcard-internal-wild-cloud-tls - dnsNames: - - "*.internal.cloud.example.com" -``` - -**Usage Pattern**: -- **Public apps** (exposed externally): Use `wildcard-wild-cloud-tls` - - Domain: `myapp.cloud.example.com` - - Has external-dns annotation (creates public DNS record) - -- **Internal apps** (cluster-only): Use `wildcard-internal-wild-cloud-tls` - - Domain: `myapp.internal.cloud.example.com` - - No external-dns annotation (only accessible within cluster/LAN) - - Examples: Docker registry, internal dashboards - -**Shared Pattern**: -1. One wildcard cert per domain covers all subdomains -2. Apps reference via `tlsSecretName: wildcard-wild-cloud-tls` (or `wildcard-internal-wild-cloud-tls`) -3. Deploy operation copies secret from cert-manager namespace to app namespace -4. All apps on same domain share the certificate - -**Advantages**: -- Avoids Let's Encrypt rate limits (50 certs/domain/week) -- Faster deployment (no ACME challenge per app) -- Survives app delete/redeploy (cert persists in cert-manager namespace) - -**Trade-offs**: -- All apps on same domain share same cert (if compromised, affects all apps) -- Cert must be copied to each app namespace (handled by Deploy operation) - -**Copy Operation**: -```go -// In apps.Deploy() -// Copies both wildcard certs if referenced by ingress -wildcardSecrets := []string{"wildcard-wild-cloud-tls", "wildcard-internal-wild-cloud-tls"} -for _, secretName := range wildcardSecrets { - if bytes.Contains(ingressContent, []byte(secretName)) { - utilities.CopySecretBetweenNamespaces(kubeconfigPath, secretName, "cert-manager", appName) - } -} -``` - -#### Load Balancer IPs (via MetalLB) - -**Location**: MetalLB controller state + cluster network - -**Trigger**: Service with `type: LoadBalancer` -```yaml -apiVersion: v1 -kind: Service -metadata: - name: traefik - namespace: traefik -spec: - type: LoadBalancer - loadBalancerIP: 192.168.8.80 # Optional: request specific IP -``` - -**State Flow**: -``` -1. Service created with type: LoadBalancer -2. MetalLB controller assigns IP from configured pool -3. MetalLB announces IP via ARP (Layer 2) or BGP (Layer 3) -4. Network routes traffic to assigned IP -5. kube-proxy on nodes routes to service endpoints -``` - -**Lifecycle**: -- **Create**: Automatic when LoadBalancer Service deployed -- **Persist**: IP sticky (same IP across pod restarts) -- **Delete**: IP returned to pool when Service deleted - -**Eventual Consistency**: ARP cache clearing takes 0-60 seconds. - -**Edge Cases**: -- IP pool exhaustion (no IPs available from MetalLB pool) -- IP conflicts (pool overlaps with DHCP or static assignments) -- ARP cache issues (old MAC address cached, traffic fails until cleared) -- Split-brain scenarios (multiple nodes announce same IP) - -**Debugging**: -```bash -# View services with assigned IPs -kubectl get svc -A --field-selector spec.type=LoadBalancer - -# Check MetalLB IP pools -kubectl get ipaddresspool -n metallb-system - -# View MetalLB controller state -kubectl logs -n metallb-system deployment/controller -kubectl logs -n metallb-system daemonset/speaker -``` - -### Cross-System Dependency Chain - -A complete app deployment triggers this cascade across systems: - -``` -Wild Cloud API (Deploy) - ↓ -Kubernetes (kubectl apply) - ↓ -Namespace + Resources Created - ↓ -┌─────────────────┬──────────────────┬─────────────────┐ -│ │ │ │ -external-dns cert-manager MetalLB -watches Ingress watches Ingress watches Service - ↓ ↓ ↓ -DNS Provider Let's Encrypt Network ARP/BGP -(Cloudflare) (ACME CA) (Local Network) - ↓ ↓ ↓ -CNAME Record TLS Certificate IP Address -Created Issued Announced -(30s-5min) (30s-2min) (0-60s) - ↓ ↓ ↓ -Domain Resolves + HTTPS Works + Traffic Routes -``` - -**Total Time to Fully Operational**: -- Kubernetes resources: 5-30 seconds (image pull + pod start) -- DNS propagation: 30 seconds - 5 minutes -- TLS certificate: 30 seconds - 2 minutes -- Network ARP: 0-60 seconds - -**Worst case**: 5-7 minutes from deploy command to app fully accessible via HTTPS. - -## App Lifecycle States - -### State 0: NOT_ADDED - -``` -Wild Directory: {app-name}/ exists (templates) -Instance Apps: (does not exist) -config.yaml: (no apps.{app-name} entry) -secrets.yaml: (no apps.{app-name} entry) -Cluster: (no namespace) -``` - -**Invariants**: -- App can be added from Wild Directory -- No local or cluster state exists - ---- - -### State 1: ADDED - -**After**: `wild app add {app-name}` - -``` -Wild Directory: {app-name}/ (unchanged) -Instance Apps: {app-name}/ created with compiled manifests - manifest.yaml has installedAs dependency mappings -config.yaml: apps.{app-name} populated from defaultConfig -secrets.yaml: apps.{app-name} populated with generated secrets -Cluster: (no namespace yet) -``` - -**Operations**: -1. Read `wild-directory/{app-name}/manifest.yaml` -2. Resolve gomplate variables using instance config -3. Generate random secrets for `defaultSecrets` (if no default provided) -4. Copy secrets from dependencies for `requiredSecrets` -5. Compile templates → write to `instance/apps/{app-name}/` -6. Append to `config.yaml` (file-locked) -7. Append to `secrets.yaml` (file-locked) - -**Invariants**: -- Local state consistent: config, secrets, and compiled manifests all exist -- Cluster state empty: nothing deployed yet -- Idempotent: Can re-add without side effects (overwrites local state) - ---- - -### State 2: DEPLOYING - -**During**: `wild app deploy {app-name}` - -``` -Wild Directory: (unchanged) -Instance Apps: (unchanged) -config.yaml: (unchanged) -secrets.yaml: (unchanged) -Cluster: namespace: Active (or being created) - resources: Creating/Pending - secret/{app-name}-secrets: Created -``` - -**Operations**: -1. Check namespace status (pre-flight check) -2. Create/update namespace (idempotent) -3. Create Kubernetes secret from `secrets.yaml` (overwrite if exists) -4. Copy dependency secrets (e.g., postgres-secrets) -5. Copy TLS certificates (e.g., wildcard certs from cert-manager) -6. Apply manifests: `kubectl apply -k instance/apps/{app-name}/` - -**Invariants Being Established**: -- Namespace must be `Active` or `NotFound` (not `Terminating`) -- Kubernetes secret created before workloads -- All dependencies deployed first - ---- - -### State 3: DEPLOYED - -**After successful deploy**: - -``` -Wild Directory: (unchanged) -Instance Apps: (unchanged) -config.yaml: (unchanged) -secrets.yaml: (unchanged) -Cluster: namespace: Active - deployment: Ready (replicas running) - pods: Running - service: Available (endpoints ready) - ingress: Ready (external-dns created DNS) - pvc: Bound (storage provisioned) - secret: Exists -``` - -**Invariants**: -- **Strong consistency**: Local state matches cluster intent -- All pods healthy and running -- Services have endpoints -- DNS records created (via external-dns) -- TLS certificates valid (via cert-manager) - -**Health Checks**: -```bash -kubectl get pods -n {app-name} -kubectl get ingress -n {app-name} -kubectl get pvc -n {app-name} -``` - ---- - -### State 3a: UPDATING (Configuration/Secret Changes) - -**Scenario**: User modifies config.yaml or secrets.yaml and redeploys. - -**Operations**: - -#### Update Configuration Only -``` -1. User edits config.yaml (e.g., change port, storage size) -2. User runs: wild app deploy {app-name} -3. System re-compiles templates with new config -4. System applies updated manifests: kubectl apply -k -5. Kubernetes performs rolling update (if applicable) -``` - -**State Flow**: -``` -config.yaml: Modified (new values) -Instance Apps: Templates re-compiled with new config -secrets.yaml: (unchanged) -Cluster: Rolling update (pods recreated with new config) -``` - -**Important**: Config changes trigger template recompilation. The `.package` directory preserves original templates, but deployed manifests are regenerated. - -#### Update Secrets Only -``` -1. User edits secrets.yaml (e.g., change password) -2. User runs: wild app deploy {app-name} -3. System deletes old Kubernetes secret -4. System creates new Kubernetes secret with updated values -5. Pods must be restarted to pick up new secrets -``` - -**State Flow**: -``` -config.yaml: (unchanged) -Instance Apps: (unchanged - no template changes) -secrets.yaml: Modified (new secrets) -Cluster: Secret updated, pods may need manual restart -``` - -**Critical**: Most apps don't auto-reload secrets. May require manual pod restart: -```bash -kubectl rollout restart deployment/{app-name} -n {app-name} -``` - -#### Update Both Config and Secrets -``` -1. User edits both config.yaml and secrets.yaml -2. User runs: wild app deploy {app-name} -3. System re-compiles templates + updates secrets -4. System applies manifests (rolling update) -5. Pods restart with new config and secrets -``` - ---- - -### State 3b: UPDATING (Manifest/Template Changes) - -**Scenario**: User directly edits Kustomize files in instance apps directory. - -This workflow differs significantly for **advanced users** (git-based) vs **regular users** (Web UI/CLI). - -#### Advanced User Workflow (Git-Based) - -**Instance directory as git repository**: -```bash -# Instance data directory is a git repo -cd /var/lib/wild-central/instances/my-cloud -git status -git log -``` - -**Operations**: -``` -1. User SSHs to Wild Central device (or uses VSCode Remote SSH) -2. User edits: apps/{app-name}/deployment.yaml -3. User commits changes: git add . && git commit -m "Custom resource limits" -4. User runs: wild app deploy {app-name} OR kubectl apply -k apps/{app-name}/ -5. Changes applied to cluster -``` - -**State Flow**: -``` -Wild Directory: (unchanged - original templates intact) -Instance Apps: Modified and git-tracked (intentional divergence) -config.yaml: (unchanged) -secrets.yaml: (unchanged) -Cluster: Updated with manual changes -.package/: (unchanged - preserves original templates) -Git History: Tracks all manual edits with commit messages -``` - -**Benefits**: -- **Version Control**: Full audit trail of all changes -- **Rollback**: `git revert` to undo changes -- **Infrastructure as Code**: Instance config managed like application code -- **Collaboration**: Multiple admins can work on same cluster config -- **Merge Workflow**: Wild Directory updates handled as upstream merges - -**Example Git Workflow**: -```bash -# Make custom changes -vim apps/myapp/deployment.yaml -git add apps/myapp/deployment.yaml -git commit -m "Increase CPU limit for production load" - -# Deploy changes -wild app deploy myapp - -# Later, merge upstream Wild Directory updates -git pull upstream main # Pull Wild Directory changes -git merge upstream/main # Merge with local customizations -# Resolve any conflicts -git push origin main -``` - -#### Regular User Workflow (Web UI/CLI) - -**Operations**: -``` -1. User cannot directly edit manifests (no SSH access) -2. User modifies config.yaml or secrets.yaml via Web UI -3. System re-compiles templates automatically -4. User deploys via Web UI -``` - -**State Flow**: -``` -Wild Directory: (unchanged) -Instance Apps: Re-compiled from templates (stays in sync) -config.yaml: Modified via Web UI -secrets.yaml: Modified via Web UI -Cluster: Updated via Web UI deploy -``` - -**Protection**: -- No manual manifest editing (prevents divergence) -- All changes through config/secrets (stays synchronized) -- Wild Directory updates apply cleanly (no merge conflicts) - ---- - -### State 3c: UPDATING (Wild Directory Version Update) - -**Scenario**: Wild Directory app updated (bug fix, new version, new features). - -This workflow differs significantly for **advanced users** (git-based) vs **regular users** (Web UI/CLI). - -#### Advanced User Workflow (Git Merge) - -**Wild Directory as upstream remote**: -```bash -# Add Wild Directory as upstream remote (one-time setup) -git remote add wild-directory https://github.com/wildcloud/wild-directory.git -git fetch wild-directory -``` - -**Detection**: -```bash -# Check for upstream updates -git fetch wild-directory -git log HEAD..wild-directory/main --oneline - -# See what changed in specific app -git diff HEAD wild-directory/main -- apps/myapp/ -``` - -**Merge Operations**: -```bash -# 1. Fetch latest Wild Directory changes -git fetch wild-directory - -# 2. Merge upstream changes with local customizations -git merge wild-directory/main - -# 3. Resolve any conflicts -# Git will show conflicts in manifest files, config, etc. -# User resolves conflicts preserving their custom changes - -# 4. Test changes -wild app deploy myapp --dry-run - -# 5. Deploy updated app -wild app deploy myapp - -# 6. Commit merge -git push origin main -``` - -**Conflict Resolution Example**: -```yaml -# Conflict in apps/myapp/deployment.yaml -<<<<<<< HEAD - resources: - limits: - cpu: "2000m" # Local customization - memory: "4Gi" # Local customization -======= - resources: - limits: - cpu: "1000m" # Wild Directory default - memory: "2Gi" # Wild Directory default ->>>>>>> wild-directory/main -``` - -User resolves by keeping their custom values or adopting new defaults. - -**Benefits**: -- **Full Control**: User decides what to merge and when -- **Conflict Resolution**: Git's standard merge tools handle conflicts -- **Audit Trail**: Git history shows what changed and why -- **Selective Updates**: Can cherry-pick specific app updates - -**State Flow**: -``` -Wild Directory: (tracked as remote, fetched regularly) -Instance Apps: Merged with git (custom + upstream changes) -config.yaml: Manually merged (conflicts resolved by user) -secrets.yaml: Preserved (not in Wild Directory) -.package/: Updated after merge -Git History: Shows merge commits and conflict resolutions -Cluster: Updated when user deploys after merge -``` - -#### Regular User Workflow (Automated Merge) - -**Detection Methods**: - -**Method 1: Compare .package with Wild Directory** -```bash -# System compares checksums/timestamps -diff -r instance/apps/{app-name}/.package/ wild-directory/{app-name}/ -``` - -If differences exist: New version available in Wild Directory. - -**Method 2: Check manifest version field** -```yaml -# wild-directory/{app-name}/manifest.yaml -version: 2.0.0 - -# instance/apps/{app-name}/manifest.yaml -version: 1.0.0 # Older version -``` - -**Safe Update (Preserves Local Config)** -``` -1. System detects Wild Directory changes -2. User initiates update (via UI or CLI) -3. System backs up current instance state: - - Saves current config.yaml section - - Saves current secrets.yaml section - - Saves current manifest.yaml (with installedAs mappings) -4. System re-adds app from Wild Directory: - - Copies new templates to instance/apps/{app-name}/ - - Updates .package/ with new source files - - Merges new defaultConfig with existing config - - Preserves existing secrets (doesn't regenerate) -5. System re-compiles templates with preserved config -6. User reviews changes (diff shown in UI) -7. User deploys updated app -``` - -**State Flow**: -``` -Wild Directory: (unchanged - new version available) -Instance Apps: Updated templates + recompiled manifests -config.yaml: Merged (new fields added, existing preserved) -secrets.yaml: (unchanged - existing secrets preserved) -.package/: Updated with new source files -Cluster: (not changed until user deploys) -``` - -**Merge Strategy for Config**: -```yaml -# Old config.yaml (version 1.0.0) -apps: - myapp: - port: "8080" - storage: 10Gi - -# New Wild Directory manifest (version 2.0.0) adds "replicas" field -defaultConfig: - port: "8080" - storage: 10Gi - replicas: "3" # New field - -# Merged config.yaml (after update) -apps: - myapp: - port: "8080" # Preserved - storage: 10Gi # Preserved - replicas: "3" # Added -``` - -**Breaking Changes**: -If Wild Directory update has breaking changes (renamed fields, removed features): -- System cannot auto-merge -- User must manually reconcile -- UI shows conflicts and requires resolution - -#### Destructive Update (Fresh Install) -``` -1. User deletes app: wild app delete {app-name} -2. User re-adds app: wild app add {app-name} -3. Config and secrets regenerated (loses customizations) -4. User must manually reconfigure -``` - -**Use When**: -- Major version upgrade with breaking changes -- Significant manifest restructuring -- User wants clean slate - ---- - -### State 3d: DEPLOYED with Drift - -**Scenario**: Cluster state diverged from instance state. - -This state has different meanings for **advanced users** vs **regular users**. - -#### Advanced Users: Intentional Drift (Git-Tracked) - -**Scenario**: User made direct cluster changes and committed them to git. - -**Example**: -```bash -# User edits deployment directly -kubectl edit deployment myapp -n myapp - -# User documents change in git -vim apps/myapp/deployment.yaml # Update manifest to match -git add apps/myapp/deployment.yaml -git commit -m "Emergency CPU limit increase for production incident" -``` - -**State Flow**: -``` -Instance Apps: Updated and git-tracked (intentional) -Git History: Documents why change was made -Cluster: Matches updated instance state -``` - -**This is NOT drift** - it's infrastructure-as-code in action. The instance directory reflects the true desired state, tracked in git. - -**Reconciliation**: Not needed (intentional state). - -#### Regular Users: Unintentional Drift - -**Scenario**: Cluster state diverged from instance state (unexpected). - -**Causes**: -- User ran `kubectl edit` directly (shouldn't happen - no SSH access) -- Another admin modified cluster resources -- Partial deployment failure (some resources applied, others failed) -- Kubernetes controller modified resources (e.g., HPA changed replicas) - -**Detection**: -```bash -# Compare desired vs actual state -kubectl diff -k instance/apps/{app-name}/ - -# Or use declarative check -kubectl apply -k instance/apps/{app-name}/ --dry-run=server -``` - -**State Flow**: -``` -Instance Apps: Unchanged (desired state) -Cluster: Diverged (actual state differs) -``` - -**Reconciliation**: -``` -1. User runs: wild app deploy {app-name} -2. kubectl apply re-applies desired state -3. Kubernetes reconciles differences (three-way merge) -4. Cluster returns to matching instance state -``` - -**Important**: `kubectl apply` is idempotent and safe for reconciliation. - -#### Distinguishing Intentional vs Unintentional Drift - -**Advanced users (git-based)**: -- Check git status: `git status` shows no uncommitted changes → intentional -- Check git log: `git log -- apps/myapp/` shows recent commits → intentional -- Cluster state matches git-tracked files → intentional - -**Regular users (Web UI)**: -- Any divergence is unintentional (no way to edit manifests directly) -- Reconcile immediately by redeploying - ---- - -### State 4: DELETING - -**During**: `wild app delete {app-name}` - -``` -Wild Directory: (unchanged) -Instance Apps: Being removed -config.yaml: apps.{app-name} being removed -secrets.yaml: apps.{app-name} being removed -Cluster: namespace: Active → Terminating - resources: Deleting (cascade) -``` - -**Operations** (Two-Phase): - -**Phase 1: Cluster Cleanup (Best Effort)** -```bash -# Try graceful deletion -kubectl delete namespace {app-name} --timeout=30s --wait=true - -# If stuck, force cleanup -kubectl patch namespace {app-name} --type=merge -p '{"metadata":{"finalizers":null}}' -``` - -**Phase 2: Local Cleanup (Always Succeeds)** -```bash -rm -rf instance/apps/{app-name}/ -yq delete config.yaml '.apps.{app-name}' -yq delete secrets.yaml '.apps.{app-name}' -``` - -**Critical Design Decision**: -- **Don't wait indefinitely for cluster cleanup** -- Local state is immediately consistent after Phase 2 -- Cluster cleanup is eventually consistent - ---- - -### State 5: DELETED - -**After successful delete**: - -``` -Wild Directory: (unchanged - still available for re-add) -Instance Apps: (removed) -config.yaml: (no apps.{app-name} entry) -secrets.yaml: (no apps.{app-name} entry) -Cluster: namespace: NotFound - all resources: (removed) -``` - -**Invariants**: -- Local state has no trace of app -- Cluster has no namespace or resources -- App can be re-added cleanly - ---- - -### State X: STUCK_TERMINATING (Edge Case) - -**Problematic state when namespace won't delete**: - -``` -Wild Directory: (unchanged) -Instance Apps: May or may not exist (depends on delete progress) -config.yaml: May or may not have entry -secrets.yaml: May or may not have entry -Cluster: namespace: Terminating (STUCK!) - finalizers: Blocking deletion - resources: Some exist, some terminating -``` - -**Why This Happens**: -1. Resources with custom finalizers -2. Webhooks or admission controllers blocking deletion -3. Network issues during deletion -4. StatefulSet with orphaned PVCs - -**Resolution**: -- Handled automatically by Deploy pre-flight checks -- Force cleanup finalizers after retries -- User never needs manual intervention - -## System Boundaries and Consistency - -### Consistency Guarantees by System - -| System | Consistency Model | Synchronization | -|--------|------------------|-----------------| -| Wild Directory | Immutable | Read-only | -| Instance Data | Immediately Consistent | File locks | -| Kubernetes | Eventually Consistent | Reconciliation loops | - -### Cross-System Operations - -#### Delete Operation (Spans 2 Systems) - -``` -┌─────────────────────────────────────────────────┐ -│ Delete Operation Timeline │ -├─────────────────────────────────────────────────┤ -│ │ -│ T=0s: kubectl delete namespace (initiated) │ -│ └─ Cluster enters eventual consistency │ -│ │ -│ T=1s: rm apps/{app-name}/ (completes) │ -│ yq delete config.yaml (completes) │ -│ yq delete secrets.yaml (completes) │ -│ └─ Local state immediately consistent │ -│ │ -│ T=2s: Return success to user │ -│ │ -│ T=30s: Namespace still terminating in cluster │ -│ └─ This is OK! Eventually consistent │ -│ │ -│ T=60s: Cluster cleanup completes │ -│ └─ Both systems now consistent │ -└─────────────────────────────────────────────────┘ -``` - -**Key Insight**: We accept temporary inconsistency at the system boundary. - -#### Deploy Operation (Spans 2 Systems) - -``` -┌─────────────────────────────────────────────────┐ -│ Deploy Operation Timeline │ -├─────────────────────────────────────────────────┤ -│ │ -│ T=0s: Check namespace status (pre-flight) │ -│ If Terminating: Force cleanup + retry │ -│ │ -│ T=5s: Create namespace (idempotent) │ -│ Create secrets │ -│ Apply manifests │ -│ └─ Cluster enters reconciliation │ -│ │ -│ T=30s: Pods starting, images pulling │ -│ │ -│ T=60s: All pods Running, services ready │ -│ └─ Deployment successful │ -└─────────────────────────────────────────────────┘ -``` - -**Key Insight**: Deploy owns making cluster match local state. - -## Idempotency and Safety - -### Idempotent Operations - -| Operation | Idempotent? | Why | -|-----------|-------------|-----| -| `app add` | ✅ Yes | Overwrites local state | -| `app deploy` | ✅ Yes | `kubectl apply` is idempotent | -| `app delete` | ✅ Yes | `kubectl delete --ignore-not-found` | - -### Non-Idempotent Danger Zones - -1. **Secret Generation**: Regenerating secrets breaks running apps - - Solution: Only generate if key doesn't exist - -2. **Database Initialization**: Running twice can cause conflicts - - Solution: Job uses `CREATE IF NOT EXISTS`, `ALTER IF EXISTS` - -3. **Finalizer Removal**: Skips cleanup logic - - Solution: Only as last resort after graceful attempts - -## Edge Cases and Error Handling - -### Edge Case 1: Namespace Stuck Terminating - -**Scenario**: Previous delete left namespace in Terminating state. - -**Detection**: -```bash -kubectl get namespace {app-name} -o jsonpath='{.status.phase}' -# Returns: "Terminating" -``` - -**Resolution** (Automatic): -1. Deploy pre-flight check detects Terminating state -2. Attempts force cleanup: removes finalizers -3. Waits 5 seconds -4. Retries up to 3 times -5. If still stuck, returns clear error message - -**Code**: -```go -if status == "Terminating" { - forceNamespaceCleanup(kubeconfigPath, appName) - time.Sleep(5 * time.Second) - // Retry deploy -} -``` - -### Edge Case 2: Concurrent Delete + Deploy - -**Scenario**: User deletes app, then immediately redeploys. - -**Timeline**: -``` -T=0s: Delete initiated -T=1s: Local state cleaned up -T=2s: User clicks "Deploy" -T=3s: Deploy detects Terminating namespace -T=4s: Deploy force cleanups and retries -T=10s: Deploy succeeds -``` - -**Why This Works**: -- Delete doesn't block on cluster cleanup -- Deploy handles any namespace state -- Eventual consistency at system boundary - -### Edge Case 3: Dependency Not Deployed - -**Scenario**: User tries to deploy app requiring postgres, but postgres isn't deployed. - -**Current Behavior**: Deployment succeeds but pods crash (CrashLoopBackOff). - -**Detection**: -```bash -kubectl get pods -n {app-name} -# Shows: CrashLoopBackOff -kubectl logs {pod-name} -n {app-name} -# Shows: "Connection refused to postgres.postgres.svc.cluster.local" -``` - -**Future Enhancement**: Pre-flight dependency check in Deploy operation. - -### Edge Case 4: Secrets Out of Sync - -**Scenario**: User manually updates password in Kubernetes but not in `secrets.yaml`. - -**Impact**: -- Next deploy overwrites Kubernetes secret -- App may lose access if password changed elsewhere - -**Best Practice**: Always update `secrets.yaml` as source of truth. - -### Edge Case 5: PVC Retention - -**Scenario**: Delete removes namespace but PVCs may persist (depends on reclaim policy). - -**Behavior**: -- PVC with `ReclaimPolicy: Retain` stays after delete -- Redeploy creates new PVC (data orphaned) - -**Resolution**: Document PVC backup/restore procedures. - -## App Package Development Best Practices - -### Security Requirements - -**All pods must include security contexts**: - -```yaml -spec: - template: - spec: - securityContext: - runAsNonRoot: true - runAsUser: 999 # Use appropriate non-root UID - runAsGroup: 999 - seccompProfile: - type: RuntimeDefault - containers: - - name: app - securityContext: - allowPrivilegeEscalation: false - capabilities: - drop: [ALL] - readOnlyRootFilesystem: false # true when possible -``` - -Common UIDs: PostgreSQL/Redis use 999. - -### Database Initialization Pattern - -Apps requiring databases should include `db-init-job.yaml`: - -```yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: myapp-db-init -spec: - template: - spec: - restartPolicy: OnFailure - containers: - - name: db-init - image: postgres:15 - command: - - /bin/bash - - -c - - | - # Create database if doesn't exist - # Create/update user with password - # Grant permissions -``` - -**Critical**: Use idempotent SQL: -- `CREATE DATABASE IF NOT EXISTS` -- `CREATE USER IF NOT EXISTS ... ELSE ALTER USER ... WITH PASSWORD` -- Jobs retry on failure until success - -### Database URL Secrets - -Never use runtime variable substitution - it doesn't work with Kustomize: - -```yaml -# ❌ Wrong: -- name: DB_URL - value: "postgres://user:$(DB_PASSWORD)@host/db" - -# ✅ Correct: -- name: DB_URL - valueFrom: - secretKeyRef: - name: myapp-secrets - key: dbUrl -``` - -Define `dbUrl` in manifest's `defaultSecrets` with template: -```yaml -defaultSecrets: - - key: dbUrl - default: "postgres://{{ .app.dbUser }}:{{ .secrets.dbPassword }}@{{ .app.dbHost }}/{{ .app.dbName }}" -``` - -### External DNS Integration - -Ingresses should include external-dns annotations: - -```yaml -metadata: - annotations: - external-dns.alpha.kubernetes.io/target: {{ .domain }} - external-dns.alpha.kubernetes.io/cloudflare-proxied: "false" -``` - -This creates: `myapp.cloud.example.com` → `cloud.example.com` (CNAME) - -### Converting from Helm Charts - -1. Extract and render Helm chart: - ```bash - helm fetch --untar --untardir charts repo/chart-name - helm template --output-dir base --namespace myapp myapp charts/chart-name - ``` - -2. Create Wild Cloud structure: - - Add `namespace.yaml` - - Run `kustomize create --autodetect` - - Create `manifest.yaml` - - Replace values with gomplate variables - - Update labels (remove Helm-style, add Wild Cloud standard) - - Add security contexts - - Add external-dns annotations - -## Testing Strategies - -### Unit Tests - -Test individual operations in isolation: - -```go -func TestDelete_NamespaceNotFound(t *testing.T) { - // Test delete when namespace doesn't exist - // Should succeed without error -} - -func TestDelete_NamespaceTerminating(t *testing.T) { - // Test delete when namespace stuck terminating - // Should force cleanup and succeed -} - -func TestDeploy_NamespaceTerminating(t *testing.T) { - // Test deploy when namespace terminating - // Should retry and eventually succeed -} -``` - -### Integration Tests - -Test cross-system operations: - -```go -func TestDeleteThenDeploy(t *testing.T) { - // 1. Deploy app - // 2. Delete app - // 3. Immediately redeploy - // Should succeed without manual intervention -} - -func TestConcurrentOperations(t *testing.T) { - // Test multiple operations on same app - // File locks should prevent corruption -} -``` - -### Chaos Tests - -Test resilience to failures: - -```go -func TestDeleteWithNetworkPartition(t *testing.T) { - // Simulate network failure during delete - // Local state should still be cleaned up -} - -func TestDeployWithStuckFinalizer(t *testing.T) { - // Manually add finalizer to namespace - // Deploy should detect and force cleanup -} -``` - -## Operational Procedures - -### Manual Inspection - -**Check all state locations**: -```bash -# 1. Local state -ls instance/apps/{app-name}/ -yq eval '.apps.{app-name}' config.yaml -yq eval '.apps.{app-name}' secrets.yaml - -# 2. Cluster state -kubectl get namespace {app-name} -kubectl get all -n {app-name} -kubectl get pvc -n {app-name} -kubectl get secrets -n {app-name} -kubectl get ingress -n {app-name} -``` - -**Check operation status**: -```bash -ls -lt instance/operations/ | head -5 -cat instance/operations/op_deploy_app_{app-name}_*.json -``` - -### Manual Recovery - -**If namespace stuck terminating**: -```bash -# This should never be needed - Deploy handles automatically -# But for understanding: -kubectl get namespace {app-name} -o json | \ - jq '.spec.finalizers = []' | \ - kubectl replace --raw /api/v1/namespaces/{app-name}/finalize -f - -``` - -**If local state corrupted**: -```bash -# Re-add from Wild Directory -wild app add {app-name} -# This regenerates local state from source -``` - -**If secrets lost**: -```bash -# Secrets are auto-generated on add -# If lost, must re-add app (regenerates new secrets) -# Apps will need reconfiguration with new credentials -``` - -## Design Principles - -### 1. Eventual Consistency at Boundaries - -Accept that cluster state and local state may temporarily diverge. Design operations to handle any state. - -### 2. Local State as Source of Truth - -Instance data (config.yaml, secrets.yaml) is authoritative for intended state. Cluster reflects current state. - -### 3. Idempotent Everything - -Every operation should be safely repeatable. Use: -- `kubectl apply` (not `create`) -- `kubectl delete --ignore-not-found` -- `CREATE IF NOT EXISTS` in SQL - -### 4. Fail Forward, Not Backward - -If operation partially completes, retry should make progress (not start over). - -### 5. No Indefinite Waits - -Operations timeout and fail explicitly rather than hanging forever. - -### 6. User Never Needs Manual Intervention - -Automated recovery from all known edge cases (stuck namespaces, etc.). - -## Future Enhancements - -### 1. Dependency Validation - -Pre-flight check that required apps are deployed: -```go -if manifest.Requires != nil { - for _, dep := range manifest.Requires { - if !isAppDeployed(dep.Name) { - return fmt.Errorf("dependency %s not deployed", dep.Name) - } - } -} -``` - -### 2. State Reconciliation - -Periodic background job to ensure consistency: -```go -func ReconcileAppState(appName string) { - localState := readLocalState(appName) - clusterState := readClusterState(appName) - - if !statesMatch(localState, clusterState) { - // Alert or auto-correct - } -} -``` - -### 3. Backup/Restore Workflows - -Built-in PVC backup before delete: -```bash -wild app backup {app-name} -wild app restore {app-name} --from-backup {timestamp} -``` - -### 4. Dry-Run Mode - -Preview changes without applying: -```bash -wild app deploy {app-name} --dry-run -# Shows: resources that would be created/updated -``` - -## Git Workflow Best Practices (Advanced Users) - -This section provides operational guidance for advanced users managing Wild Cloud instances as git repositories. - -### Initial Repository Setup - -```bash -# Initialize instance directory as git repo -cd /var/lib/wild-central/instances/my-cloud -git init -git add . -git commit -m "Initial Wild Cloud instance configuration" - -# Add Wild Directory as upstream remote -git remote add wild-directory https://github.com/wildcloud/wild-directory.git -git fetch wild-directory - -# Add origin for your team's instance repo -git remote add origin git@github.com:myorg/wild-cloud-instances.git -git push -u origin main -``` - -### .gitignore Configuration - -```bash -# Create .gitignore for instance directory -cat > .gitignore <>>>>>> wild-directory/main - -# Resolution: Keep our production values -resources: - limits: - cpu: "4000m" - memory: "16Gi" - requests: - cpu: "2000m" - memory: "8Gi" -``` - -### Commit Message Conventions - -**Format**: `(): ` - -**Types**: -- `feat`: New app or feature -- `fix`: Bug fix or correction -- `config`: Configuration change -- `scale`: Resource scaling -- `upgrade`: Version upgrade -- `security`: Security-related change -- `docs`: Documentation change - -**Examples**: -```bash -git commit -m "feat(redis): Add Redis cache for session storage" -git commit -m "scale(postgres): Increase CPU limits for production load" -git commit -m "fix(ghost): Correct domain configuration for SSL" -git commit -m "upgrade(immich): Update to v1.2.0 with new ML features" -git commit -m "security(all): Rotate database passwords" -git commit -m "config(mastodon): Enable SMTP for email notifications" -``` - -### Rollback Procedures - -**Rollback entire app configuration**: -```bash -# Find commit to rollback to -git log --oneline -- apps/myapp/ - -# Revert specific commit -git revert abc123 - -# Or rollback to specific point -git checkout abc123 -- apps/myapp/ -git commit -m "rollback(myapp): Revert to stable configuration" - -# Deploy reverted state -wild app deploy myapp -``` - -**Emergency rollback (production incident)**: -```bash -# Immediately revert to last known good state -git log --oneline -5 -git reset --hard abc123 # Last working commit -wild app deploy myapp - -# Document the incident -git commit --allow-empty -m "emergency: Rolled back myapp due to production incident" -git push --force origin main # Force push to update remote -``` - -### Collaboration Patterns - -**Multiple admins working on same cluster**: - -```bash -# Always pull before making changes -git pull origin main - -# Use descriptive branch names -git checkout -b alice/add-monitoring -git checkout -b bob/upgrade-postgres - -# Push branches for review -git push origin alice/add-monitoring - -# Use PRs/MRs for review before merging to main -# This prevents conflicts and ensures peer review -``` - -**Code review checklist**: -- [ ] Changes tested in non-production environment -- [ ] Resource limits appropriate for workload -- [ ] Secrets not committed -- [ ] Dependencies deployed (if new app) -- [ ] Commit message follows conventions -- [ ] Breaking changes documented - -### Backup and Disaster Recovery - -**Regular backups**: -```bash -# Create tagged backup of current state -git tag -a backup-$(date +%Y%m%d) -m "Daily backup" -git push origin backup-$(date +%Y%m%d) - -# Automated daily backup (cron) -0 2 * * * cd /var/lib/wild-central/instances/my-cloud && git tag backup-$(date +%Y%m%d-%H%M) && git push origin --tags -``` - -**Disaster recovery**: -```bash -# Clone instance repository to new Wild Central device -git clone git@github.com:myorg/wild-cloud-instances.git /var/lib/wild-central/instances/my-cloud - -# Restore secrets from secure backup (NOT in git) -# (From password manager, Vault, encrypted backup, etc.) -cp ~/secure-backup/secrets.yaml /var/lib/wild-central/instances/my-cloud/ - -# Deploy all apps -cd /var/lib/wild-central/instances/my-cloud -for app in apps/*/; do - wild app deploy $(basename $app) -done -``` - -### Git Workflow vs Web UI - -**When git is better**: -- Complex changes requiring review -- Multi-app updates -- Compliance/audit requirements -- Team collaboration -- Emergency rollbacks - -**When Web UI is better**: -- Quick configuration tweaks -- Adding single app -- Viewing current state -- Non-technical team members - -**Hybrid approach**: Advanced users can use git for complex changes, Web UI for quick operations. The two workflows coexist peacefully since both modify the same instance directory. - -## Conclusion - -Wild Cloud's app lifecycle management spans three independent systems with different consistency guarantees. By understanding these systems and their boundaries, we can design operations that are: - -- **Reliable**: Handle edge cases automatically -- **Simple**: Two-phase operations (cluster + local) -- **Safe**: Idempotent and recoverable -- **Fast**: Don't wait unnecessarily for eventual consistency - -Additionally, for advanced users, the git-based workflow provides: -- **Auditable**: Full version control history -- **Collaborative**: Standard git workflows for team management -- **Recoverable**: Git revert/rollback capabilities -- **Professional**: Infrastructure-as-code best practices - -The key insight is accepting eventual consistency at system boundaries while maintaining immediate consistency within each system. This allows operations to complete quickly for users while ensuring the system eventually reaches a consistent state. diff --git a/wild-cloud b/wild-cloud index c200061..96dfaaf 160000 --- a/wild-cloud +++ b/wild-cloud @@ -1 +1 @@ -Subproject commit c20006192eb9e9a5b18e33df11c1013a83a22502 +Subproject commit 96dfaaf07c35606007870354a80c745f031ac7ad