Update docs.

2025-08-31 14:30:09 -07:00
parent 3b8b6de338
commit 1aa9f1050d
22 changed files with 230 additions and 1083 deletions
--- a/docs/guides/app-workflow.md
+++ b/docs/guides/app-workflow.md
@@ -43,4 +43,4 @@ wild-app-deploy <app>  # Deploys to Kubernetes

 ## App Directory Structure

-Your wild-cloud apps are stored in the `apps/` directory. You can change them however you like. You should keep them all in git and make commits anytime you change something. Some `wild` commands will overwrite files in your app directory (like when you are updating apps, or updating your configuration) so you'll want to review any changes made to your files after using them using `git`.
+Your wild-cloud apps are stored in the `apps/` directory. You can change them however you like. You should keep them all in git and make commits anytime you change something. Some `wild` commands will overwrite files in your app directory (like when you are updating apps, or updating your configuration) so you'll want to review any changes made to your files after using them using `git`.
--- a/docs/guides/backup-and-restore.md
+++ b/docs/guides/backup-and-restore.md
@@ -0,0 +1,3 @@
+# Backup and Restore
+
+TBD
--- a/docs/guides/monitoring.md
+++ b/docs/guides/monitoring.md
@@ -0,0 +1,50 @@
+# System Health Monitoring
+
+## Basic Monitoring
+
+Check system health with:
+
+```bash
+# Node resource usage
+kubectl top nodes
+
+# Pod resource usage
+kubectl top pods -A
+
+# Persistent volume claims
+kubectl get pvc -A
+```
+
+## Advanced Monitoring (Future Implementation)
+
+Consider implementing:
+
+1. **Prometheus + Grafana** for comprehensive monitoring:
+   ```bash
+   # Placeholder for future implementation
+   helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+   helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
+   ```
+
+2. **Loki** for log aggregation:
+   ```bash
+   # Placeholder for future implementation
+   helm repo add grafana https://grafana.github.io/helm-charts
+   helm install loki grafana/loki-stack --namespace logging --create-namespace
+   ```
+
+## Additional Resources
+
+This document will be expanded in the future with:
+
+- Detailed backup and restore procedures
+- Monitoring setup instructions
+- Comprehensive security hardening guide
+- Automated maintenance scripts
+
+For now, refer to the following external resources:
+
+- [K3s Documentation](https://docs.k3s.io/)
+- [Kubernetes Troubleshooting Guide](https://kubernetes.io/docs/tasks/debug/)
+- [Velero Backup Documentation](https://velero.io/docs/latest/)
+- [Kubernetes Security Best Practices](https://kubernetes.io/docs/concepts/security/)
--- a/docs/guides/node-setup.md
+++ b/docs/guides/node-setup.md
@@ -1,246 +0,0 @@
-# Node Setup Guide
-
-This guide covers setting up Talos Linux nodes for your Kubernetes cluster using USB boot.
-
-## Overview
-
-There are two main approaches for booting Talos nodes:
-
-1. **USB Boot** (covered here) - Boot from a custom USB drive with system extensions
-2. **PXE Boot** - Network boot using dnsmasq setup (see `setup/dnsmasq/README.md`)
-
-## USB Boot Setup
-
-### Prerequisites
-
- Target hardware for Kubernetes nodes
- USB drive (8GB+ recommended)
- Admin access to create bootable USB drives
-
-### Step 1: Upload Schematic and Download Custom Talos ISO
-
-First, upload the system extensions schematic to Talos Image Factory, then download the custom ISO.
-
-```bash
-# Upload schematic configuration to get schematic ID
-wild-talos-schema
-
-# Download custom ISO with system extensions
-wild-talos-iso
-```
-
-The custom ISO includes system extensions (iscsi-tools, util-linux-tools, intel-ucode, gvisor) needed for the cluster and is saved to `.wildcloud/iso/talos-v1.10.3-metal-amd64.iso`.
-
-### Step 2: Create Bootable USB Drive
-
-#### Linux (Recommended)
-
-```bash
-# Find your USB device (be careful to select the right device!)
-lsblk
-sudo dmesg | tail  # Check for recently connected USB devices
-
-# Create bootable USB (replace /dev/sdX with your USB device)
-sudo dd if=.wildcloud/iso/talos-v1.10.3-metal-amd64.iso of=/dev/sdX bs=4M status=progress sync
-
-# Verify the write completed
-sync
-```
-
-**⚠️ Warning**: Double-check the device path (`/dev/sdX`). Writing to the wrong device will destroy data!
-
-#### macOS
-
-```bash
-# Find your USB device
-diskutil list
-
-# Unmount the USB drive (replace diskX with your USB device)
-diskutil unmountDisk /dev/diskX
-
-# Create bootable USB
-sudo dd if=.wildcloud/iso/talos-v1.10.3-metal-amd64.iso of=/dev/rdiskX bs=4m
-
-# Eject when complete
-diskutil eject /dev/diskX
-```
-
-#### Windows
-
-Use one of these tools:
-
-1. **Rufus** (Recommended)
-
-   - Download from https://rufus.ie/
-   - Select the Talos ISO file
-   - Choose your USB drive
-   - Use "DD Image" mode
-   - Click "START"
-
-2. **Balena Etcher**
-
-   - Download from https://www.balena.io/etcher/
-   - Flash from file → Select Talos ISO
-   - Select target USB drive
-   - Flash!
-
-3. **Command Line** (Windows 10/11)
-
-   ```cmd
-   # List disks to find USB drive number
-   diskpart
-   list disk
-   exit
-
-   # Write ISO (replace X with your USB disk number)
-   dd if=.wildcloud\iso\talos-v1.10.3-metal-amd64.iso of=\\.\PhysicalDriveX bs=4M --progress
-   ```
-
-### Step 3: Boot Target Machine
-
-1. **Insert USB** into target machine
-2. **Boot from USB**:
-   - Restart machine and enter BIOS/UEFI (usually F2, F12, DEL, or ESC during startup)
-   - Change boot order to prioritize USB drive
-   - Or use one-time boot menu (usually F12)
-3. **Talos will boot** in maintenance mode with a DHCP IP
-
-### Step 4: Hardware Detection and Configuration
-
-Once the machine boots, it will be in maintenance mode with a DHCP IP address.
-
-```bash
-# Find the node's maintenance IP (check your router/DHCP server)
-# Then detect hardware and register the node
-cd setup/cluster-nodes
-./detect-node-hardware.sh <maintenance-ip> <node-number>
-
-# Example: Node got DHCP IP 192.168.8.150, registering as node 1
-./detect-node-hardware.sh 192.168.8.150 1
-```
-
-This script will:
-
- Discover network interface names (e.g., `enp4s0`)
- List available disks for installation
- Update `config.yaml` with node-specific hardware settings
-
-### Step 5: Generate and Apply Configuration
-
-```bash
-# Generate machine configurations with detected hardware
-./generate-machine-configs.sh
-
-# Apply configuration (node will reboot with static IP)
-talosctl apply-config --insecure -n <maintenance-ip> --file final/controlplane-node-<number>.yaml
-
-# Example:
-talosctl apply-config --insecure -n 192.168.8.150 --file final/controlplane-node-1.yaml
-```
-
-### Step 6: Verify Installation
-
-After reboot, the node should come up with its assigned static IP:
-
-```bash
-# Check connectivity (node 1 should be at 192.168.8.31)
-ping 192.168.8.31
-
-# Verify system extensions are installed
-talosctl -e 192.168.8.31 -n 192.168.8.31 get extensions
-
-# Check for iscsi tools
-talosctl -e 192.168.8.31 -n 192.168.8.31 list /usr/local/bin/ | grep iscsi
-```
-
-## Repeat for Additional Nodes
-
-For each additional control plane node:
-
-1. Boot with the same USB drive
-2. Run hardware detection with the new maintenance IP and node number
-3. Generate and apply configurations
-4. Verify the node comes up at its static IP
-
-Example for node 2:
-
-```bash
-./detect-node-hardware.sh 192.168.8.151 2
-./generate-machine-configs.sh
-talosctl apply-config --insecure -n 192.168.8.151 --file final/controlplane-node-2.yaml
-```
-
-## Cluster Bootstrap
-
-Once all control plane nodes are configured:
-
-```bash
-# Bootstrap the cluster using the VIP
-talosctl bootstrap -n 192.168.8.30
-
-# Get kubeconfig
-talosctl kubeconfig
-
-# Verify cluster
-kubectl get nodes
-```
-
-## Troubleshooting
-
-### USB Boot Issues
-
- **Machine won't boot from USB**: Check BIOS boot order, disable Secure Boot if needed
- **Talos doesn't start**: Verify ISO was written correctly, try re-creating USB
- **Network issues**: Ensure DHCP is available on your network
-
-### Hardware Detection Issues
-
- **Node not accessible**: Check IP assignment, firewall settings
- **Wrong interface detected**: Manual override in `config.yaml` if needed
- **Disk not found**: Verify disk size (must be >10GB), check disk health
-
-### Installation Issues
-
- **Static IP not assigned**: Check network configuration in machine config
- **Extensions not installed**: Verify ISO includes extensions, check upgrade logs
- **Node won't join cluster**: Check certificates, network connectivity to VIP
-
-### Checking Logs
-
-```bash
-# View system logs
-talosctl -e <node-ip> -n <node-ip> logs machined
-
-# Check kernel messages
-talosctl -e <node-ip> -n <node-ip> dmesg
-
-# Monitor services
-talosctl -e <node-ip> -n <node-ip> get services
-```
-
-## System Extensions Included
-
-The custom ISO includes these extensions:
-
- **siderolabs/iscsi-tools**: iSCSI initiator tools for persistent storage
- **siderolabs/util-linux-tools**: Utility tools including fstrim for storage
- **siderolabs/intel-ucode**: Intel CPU microcode updates (harmless on AMD)
- **siderolabs/gvisor**: Container runtime sandbox (optional security enhancement)
-
-These extensions enable:
-
- Longhorn distributed storage
- Improved security isolation
- CPU microcode updates
- Storage optimization tools
-
-## Next Steps
-
-After all nodes are configured:
-
-1. **Install CNI**: Deploy a Container Network Interface (Cilium, Calico, etc.)
-2. **Install CSI**: Deploy Container Storage Interface (Longhorn for persistent storage)
-3. **Deploy workloads**: Your applications and services
-4. **Monitor cluster**: Set up monitoring and logging
-
-See the main project documentation for application deployment guides.
--- a/docs/guides/security.md
+++ b/docs/guides/security.md
@@ -0,0 +1,46 @@
+# Security
+
+## Best Practices
+
+1. **Keep Everything Updated**:
+   - Regularly update K3s
+   - Update all infrastructure components
+   - Keep application images up to date
+
+2. **Network Security**:
+   - Use internal services whenever possible
+   - Limit exposed services to only what's necessary
+   - Configure your home router's firewall properly
+
+3. **Access Control**:
+   - Use strong passwords for all services
+   - Implement a secrets management strategy
+   - Rotate API tokens and keys regularly
+
+4. **Regular Audits**:
+   - Review running services periodically
+   - Check for unused or outdated deployments
+   - Monitor resource usage for anomalies
+
+## Security Scanning (Future Implementation)
+
+Tools to consider implementing:
+
+1. **Trivy** for image scanning:
+   ```bash
+   # Example Trivy usage (placeholder)
+   trivy image <your-image>
+   ```
+
+2. **kube-bench** for Kubernetes security checks:
+   ```bash
+   # Example kube-bench usage (placeholder)
+   kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
+   ```
+
+3. **Falco** for runtime security monitoring:
+   ```bash
+   # Example Falco installation (placeholder)
+   helm repo add falcosecurity https://falcosecurity.github.io/charts
+   helm install falco falcosecurity/falco --namespace falco --create-namespace
+   ```
--- a/docs/guides/taslos.md
+++ b/docs/guides/taslos.md
@@ -0,0 +1,18 @@
+# Talos
+
+
+## System Extensions Included
+
+The custom ISO includes these extensions:
+
+- **siderolabs/iscsi-tools**: iSCSI initiator tools for persistent storage
+- **siderolabs/util-linux-tools**: Utility tools including fstrim for storage
+- **siderolabs/intel-ucode**: Intel CPU microcode updates (harmless on AMD)
+- **siderolabs/gvisor**: Container runtime sandbox (optional security enhancement)
+
+These extensions enable:
+
+- Longhorn distributed storage
+- Improved security isolation
+- CPU microcode updates
+- Storage optimization tools
--- a/docs/guides/troubleshoot-cluster.md
+++ b/docs/guides/troubleshoot-cluster.md
@@ -0,0 +1,19 @@
+# Troubleshoot Wild Cloud Cluster issues
+
+## General Troubleshooting Steps
+
+1. **Check Node Status**:
+   ```bash
+   kubectl get nodes
+   kubectl describe node <node-name>
+   ```
+
+1. **Check Component Status**:
+   ```bash
+   # Check all pods across all namespaces
+   kubectl get pods -A
+   
+   # Look for pods that aren't Running or Ready
+   kubectl get pods -A | grep -v "Running\|Completed"
+   ```
+
--- a/docs/guides/troubleshoot-dns.md
+++ b/docs/guides/troubleshoot-dns.md
@@ -0,0 +1,20 @@
+# Troubleshoot DNS
+
+If DNS resolution isn't working properly:
+
+1. Check CoreDNS status:
+   ```bash
+   kubectl get pods -n kube-system -l k8s-app=kube-dns
+   kubectl logs -l k8s-app=kube-dns -n kube-system
+   ```
+
+2. Verify CoreDNS configuration:
+   ```bash
+   kubectl get configmap -n kube-system coredns -o yaml
+   ```
+
+3. Test DNS resolution from inside the cluster:
+   ```bash
+   kubectl run -i --tty --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default
+   ```
+
--- a/docs/guides/troubleshoot-service-connectivity.md
+++ b/docs/guides/troubleshoot-service-connectivity.md
@@ -0,0 +1,18 @@
+# Troubleshoot Service Connectivity
+
+If services can't communicate:
+
+1. Check network policies:
+   ```bash
+   kubectl get networkpolicies -A
+   ```
+
+2. Verify service endpoints:
+   ```bash
+   kubectl get endpoints -n <namespace>
+   ```
+
+3. Test connectivity from within the cluster:
+   ```bash
+   kubectl run -i --tty --rm debug --image=busybox --restart=Never -- wget -O- <service-name>.<namespace>
+   ```
--- a/docs/guides/troubleshoot-tls-certificates.md
+++ b/docs/guides/troubleshoot-tls-certificates.md
@@ -0,0 +1,24 @@
+# Troubleshoot TLS Certificates
+
+If services show invalid certificates:
+
+1. Check certificate status:
+   ```bash
+   kubectl get certificates -A
+   ```
+
+2. Examine certificate details:
+   ```bash
+   kubectl describe certificate <cert-name> -n <namespace>
+   ```
+
+3. Check for cert-manager issues:
+   ```bash
+   kubectl get pods -n cert-manager
+   kubectl logs -l app=cert-manager -n cert-manager
+   ```
+
+4. Verify the Cloudflare API token is correctly set up:
+   ```bash
+   kubectl get secret cloudflare-api-token -n internal
+   ```
--- a/docs/guides/troubleshoot-visibility.md
+++ b/docs/guides/troubleshoot-visibility.md
@@ -0,0 +1,246 @@
+# Troubleshoot Service Visibility
+
+This guide covers common issues with accessing services from outside the cluster and how to diagnose and fix them.
+
+## Common Issues
+
+External access to your services might fail for several reasons:
+
+1. **DNS Resolution Issues** - Domain names not resolving to the correct IP address
+2. **Network Connectivity Issues** - Traffic can't reach the cluster's external IP
+3. **TLS Certificate Issues** - Invalid or missing certificates
+4. **Ingress/Service Configuration Issues** - Incorrectly configured routing
+
+## Diagnostic Steps
+
+### 1. Check DNS Resolution
+
+**Symptoms:**
+
+- Browser shows "site cannot be reached" or "server IP address could not be found"
+- `ping` or `nslookup` commands fail for your domain
+- Your service DNS records don't appear in CloudFlare or your DNS provider
+
+**Checks:**
+
+```bash
+# Check if your domain resolves (from outside the cluster)
+nslookup yourservice.yourdomain.com
+
+# Check if ExternalDNS is running
+kubectl get pods -n externaldns
+
+# Check ExternalDNS logs for errors
+kubectl logs -n externaldns -l app=external-dns  < /dev/null |  grep -i error
+kubectl logs -n externaldns -l app=external-dns | grep -i "your-service-name"
+
+# Check if CloudFlare API token is configured correctly
+kubectl get secret cloudflare-api-token -n externaldns
+```
+
+**Common Issues:**
+
+a) **ExternalDNS Not Running**: The ExternalDNS pod is not running or has errors.
+
+b) **Cloudflare API Token Issues**: The API token is invalid, expired, or doesn't have the right permissions.
+
+c) **Domain Filter Mismatch**: ExternalDNS is configured with a `--domain-filter` that doesn't match your domain.
+
+d) **Annotations Missing**: Service or Ingress is missing the required ExternalDNS annotations.
+
+**Solutions:**
+
+```bash
+# 1. Recreate CloudFlare API token secret
+kubectl create secret generic cloudflare-api-token \
+  --namespace externaldns \
+  --from-literal=api-token="your-api-token" \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+# 2. Check and set proper annotations on your Ingress:
+kubectl annotate ingress your-ingress -n your-namespace \
+  external-dns.alpha.kubernetes.io/hostname=your-service.your-domain.com
+
+# 3. Restart ExternalDNS
+kubectl rollout restart deployment -n externaldns external-dns
+```
+
+### 2. Check Network Connectivity
+
+**Symptoms:**
+
+- DNS resolves to the correct IP but the service is still unreachable
+- Only some services are unreachable while others work
+- Network timeout errors
+
+**Checks:**
+
+```bash
+# Check if MetalLB is running
+kubectl get pods -n metallb-system
+
+# Check MetalLB IP address pool
+kubectl get ipaddresspools.metallb.io -n metallb-system
+
+# Verify the service has an external IP
+kubectl get svc -n your-namespace your-service
+```
+
+**Common Issues:**
+
+a) **MetalLB Configuration**: The IP pool doesn't match your network or is exhausted.
+
+b) **Firewall Issues**: Firewall is blocking traffic to your cluster's external IP.
+
+c) **Router Configuration**: NAT or port forwarding issues if using a router.
+
+**Solutions:**
+
+```bash
+# 1. Check and update MetalLB configuration
+kubectl apply -f infrastructure_setup/metallb/metallb-pool.yaml
+
+# 2. Check service external IP assignment
+kubectl describe svc -n your-namespace your-service
+```
+
+### 3. Check TLS Certificates
+
+**Symptoms:**
+
+- Browser shows certificate errors
+- "Your connection is not private" warnings
+- Cert-manager logs show errors
+
+**Checks:**
+
+```bash
+# Check certificate status
+kubectl get certificates -A
+
+# Check cert-manager logs
+kubectl logs -n cert-manager -l app=cert-manager
+
+# Check if your ingress is using the correct certificate
+kubectl get ingress -n your-namespace your-ingress -o yaml
+```
+
+**Common Issues:**
+
+a) **Certificate Issuance Failures**: DNS validation or HTTP validation failing.
+
+b) **Wrong Secret Referenced**: Ingress is referencing a non-existent certificate secret.
+
+c) **Expired Certificate**: Certificate has expired and wasn't renewed.
+
+**Solutions:**
+
+```bash
+# 1. Check and recreate certificates
+kubectl apply -f infrastructure_setup/cert-manager/wildcard-certificate.yaml
+
+# 2. Update ingress to use correct secret
+kubectl patch ingress your-ingress -n your-namespace --type=json \
+  -p='[{"op": "replace", "path": "/spec/tls/0/secretName", "value": "correct-secret-name"}]'
+```
+
+### 4. Check Ingress Configuration
+
+**Symptoms:**
+
+- HTTP 404, 503, or other error codes
+- Service accessible from inside cluster but not outside
+- Traffic routed to wrong service
+
+**Checks:**
+
+```bash
+# Check ingress status
+kubectl get ingress -n your-namespace
+
+# Check Traefik logs
+kubectl logs -n kube-system -l app.kubernetes.io/name=traefik
+
+# Check ingress configuration
+kubectl describe ingress -n your-namespace your-ingress
+```
+
+**Common Issues:**
+
+a) **Incorrect Service Targeting**: Ingress is pointing to wrong service or port.
+
+b) **Traefik Configuration**: IngressClass or middleware issues.
+
+c) **Path Configuration**: Incorrect path prefixes or regex.
+
+**Solutions:**
+
+```bash
+# 1. Verify ingress configuration
+kubectl edit ingress -n your-namespace your-ingress
+
+# 2. Check that the referenced service exists
+kubectl get svc -n your-namespace
+
+# 3. Restart Traefik if needed
+kubectl rollout restart deployment -n kube-system traefik
+```
+
+## Advanced Diagnostics
+
+For more complex issues, you can use port-forwarding to test services directly:
+
+```bash
+# Port-forward the service directly
+kubectl port-forward -n your-namespace svc/your-service 8080:80
+
+# Then test locally
+curl http://localhost:8080
+```
+
+You can also deploy a debug pod to test connectivity from inside the cluster:
+
+```bash
+# Start a debug pod
+kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh
+
+# Inside the pod, test DNS and connectivity
+nslookup your-service.your-namespace.svc.cluster.local
+wget -O- http://your-service.your-namespace.svc.cluster.local
+```
+
+## ExternalDNS Specifics
+
+ExternalDNS can be particularly troublesome. Here are specific debugging steps:
+
+1. **Check Log Level**: Set `--log-level=debug` for more detailed logs
+2. **Check Domain Filter**: Ensure `--domain-filter` includes your domain
+3. **Check Provider**: Ensure `--provider=cloudflare` (or your DNS provider)
+4. **Verify API Permissions**: CloudFlare token needs Zone.Zone and Zone.DNS permissions
+5. **Check TXT Records**: ExternalDNS uses TXT records for ownership tracking
+
+```bash
+# Restart with verbose logging
+kubectl set env deployment/external-dns -n externaldns -- --log-level=debug
+
+# Check for specific domain errors
+kubectl logs -n externaldns -l app=external-dns | grep -i yourservice.yourdomain.com
+```
+
+## CloudFlare Specific Issues
+
+When using CloudFlare, additional issues may arise:
+
+1. **API Rate Limiting**: CloudFlare may rate limit frequent API calls
+2. **DNS Propagation**: Changes may take time to propagate through CloudFlare's CDN
+3. **Proxied Records**: The `external-dns.alpha.kubernetes.io/cloudflare-proxied` annotation controls whether CloudFlare proxies traffic
+4. **Access Restrictions**: CloudFlare Access or Page Rules may restrict access
+5. **API Token Permissions**: The token must have Zone:Zone:Read and Zone:DNS:Edit permissions
+6. **Zone Detection**: If using subdomains, ensure the parent domain is included in the domain filter
+
+Check CloudFlare dashboard for:
+
+- DNS record existence
+- API access logs
+- DNS settings including proxy status
+- Any error messages or rate limit warnings
--- a/docs/guides/upgrade-applications.md
+++ b/docs/guides/upgrade-applications.md
@@ -0,0 +1,3 @@
+# Upgrade Applications
+
+TBD
--- a/docs/guides/upgrade-kubernetes.md
+++ b/docs/guides/upgrade-kubernetes.md
@@ -0,0 +1,3 @@
+# Upgrade Kubernetes
+
+TBD
--- a/docs/guides/upgrade-talos.md
+++ b/docs/guides/upgrade-talos.md
@@ -0,0 +1,3 @@
+# Upgrade Talos
+
+TBD
--- a/docs/guides/upgrade-wild-cloud.md
+++ b/docs/guides/upgrade-wild-cloud.md
@@ -0,0 +1,3 @@
+# Upgrade Wild Cloud
+
+TBD