Files

Paul Payne 3f97dce86a docs: Update all guides to reflect current CLI, API, and web app

Rewrote backup/restore guides to document current system (native
pg_dump/Longhorn/tar.gz tools, blue-green restore, scheduling) and
remove outdated restic references. Rewrote monitoring guide to replace
K3s/Helm/Velero placeholders with actual capabilities. Filled in all
four upgrade guides (Talos, Kubernetes, applications, Wild Cloud) that
were previously TBD stubs. Expanded troubleshooting guides with correct
namespaces, Wild Cloud CLI commands, and Talos-specific diagnostics.
Added verification commands to cluster networking health checklist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-24 21:54:11 +00:00

3.2 KiB

Raw Blame History

Cluster Networking Health Checklist

Verifying every item on this list confirms the full networking stack is functioning correctly, from node-level overlay through DNS to external ingress.

Node Layer

All nodes Ready — no cordons, no taints (e.g., maintenance:NoExecute)
```
kubectl get nodes
wild node list
```
Flannel pods running on every node — stale VXLAN tunnels break cross-node pod traffic
```
kubectl get pods -n kube-system -l app=flannel -o wide
```
Cross-node pod connectivity — pods on each worker can reach pods on every other node

Service Routing

kube-proxy pods running on every node — nftables rules route ClusterIP traffic to pod endpoints
```
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
```
CoreDNS pods running and resolving — both cluster-internal names (*.svc.cluster.local) and external names
```
kubectl get pods -n kube-system -l k8s-app=kube-dns
```
CoreDNS upstream reachability — Talos DNS proxy at 169.254.116.108 responding from all nodes

Load Balancing

MetalLB speakers running on all nodes — L2 ARP announcements for LoadBalancer IPs
```
kubectl get pods -n metallb-system -l component=speaker -o wide
```
MetalLB ServiceL2Status resources valid — status.node matches actual pod placement (stale entries block announcements)
```
kubectl get servicel2statuses.metallb.io -n metallb-system
```
LoadBalancer IPs reachable — Traefik LB IP responds from LAN
```
kubectl get svc -n traefik
curl -k https://<traefik-lb-ip>
```

Ingress & Security

Traefik ingress routing — forwards to backend services, TLS termination working

kubectl get pods -n traefik
kubectl logs -n traefik -l app=traefik | tail -20

CrowdSec LAPI running — can reach api.crowdsec.net (depends on CoreDNS external resolution)
```
kubectl get pods -n crowdsec
```
CrowdSec bouncer registered with LAPI — unregistered bouncer blocks all forwardAuth requests
```
wild service logs crowdsec | grep bouncer
```

Storage

Longhorn managers running on all workers — enables volume replica scheduling and rebuilds
```
kubectl get pods -n longhorn-system -l app=longhorn-manager -o wide
```
Longhorn volume replicas healthy — all volumes at target replica count across nodes
```
kubectl get volumes.longhorn.io -n longhorn-system
```

External DNS & Certificates

ExternalDNS pod running — creating and updating DNS records at Cloudflare
```
kubectl get pods -n externaldns
```

cert-manager pods running — issuing and renewing TLS certificates

kubectl get pods -n cert-manager
kubectl get certificates -n cert-manager

LAN DNS

dnsmasq on Wild Central — resolves LAN-local domains to correct LoadBalancer IPs (hairpin NAT)
```
wild dns status
```

Quick Full Check

Run wild cluster health for an automated check of the most critical items. For a comprehensive check, walk through each item above.

3.2 KiB Raw Blame History