Rewrote backup/restore guides to document current system (native pg_dump/Longhorn/tar.gz tools, blue-green restore, scheduling) and remove outdated restic references. Rewrote monitoring guide to replace K3s/Helm/Velero placeholders with actual capabilities. Filled in all four upgrade guides (Talos, Kubernetes, applications, Wild Cloud) that were previously TBD stubs. Expanded troubleshooting guides with correct namespaces, Wild Cloud CLI commands, and Talos-specific diagnostics. Added verification commands to cluster networking health checklist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.2 KiB
Cluster Networking Health Checklist
Verifying every item on this list confirms the full networking stack is functioning correctly, from node-level overlay through DNS to external ingress.
Node Layer
-
All nodes Ready — no cordons, no taints (e.g.,
maintenance:NoExecute)kubectl get nodes wild node list -
Flannel pods running on every node — stale VXLAN tunnels break cross-node pod traffic
kubectl get pods -n kube-system -l app=flannel -o wide -
Cross-node pod connectivity — pods on each worker can reach pods on every other node
Service Routing
-
kube-proxy pods running on every node — nftables rules route ClusterIP traffic to pod endpoints
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide -
CoreDNS pods running and resolving — both cluster-internal names (
*.svc.cluster.local) and external nameskubectl get pods -n kube-system -l k8s-app=kube-dns -
CoreDNS upstream reachability — Talos DNS proxy at
169.254.116.108responding from all nodes
Load Balancing
-
MetalLB speakers running on all nodes — L2 ARP announcements for LoadBalancer IPs
kubectl get pods -n metallb-system -l component=speaker -o wide -
MetalLB ServiceL2Status resources valid —
status.nodematches actual pod placement (stale entries block announcements)kubectl get servicel2statuses.metallb.io -n metallb-system -
LoadBalancer IPs reachable — Traefik LB IP responds from LAN
kubectl get svc -n traefik curl -k https://<traefik-lb-ip>
Ingress & Security
-
Traefik ingress routing — forwards to backend services, TLS termination working
kubectl get pods -n traefik kubectl logs -n traefik -l app=traefik | tail -20 -
CrowdSec LAPI running — can reach
api.crowdsec.net(depends on CoreDNS external resolution)kubectl get pods -n crowdsec -
CrowdSec bouncer registered with LAPI — unregistered bouncer blocks all forwardAuth requests
wild service logs crowdsec | grep bouncer
Storage
-
Longhorn managers running on all workers — enables volume replica scheduling and rebuilds
kubectl get pods -n longhorn-system -l app=longhorn-manager -o wide -
Longhorn volume replicas healthy — all volumes at target replica count across nodes
kubectl get volumes.longhorn.io -n longhorn-system
External DNS & Certificates
-
ExternalDNS pod running — creating and updating DNS records at Cloudflare
kubectl get pods -n externaldns -
cert-manager pods running — issuing and renewing TLS certificates
kubectl get pods -n cert-manager kubectl get certificates -n cert-manager
LAN DNS
- dnsmasq on Wild Central — resolves LAN-local domains to correct LoadBalancer IPs (hairpin NAT)
wild dns status
Quick Full Check
Run wild cluster health for an automated check of the most critical items. For a comprehensive check, walk through each item above.