Files
wild-cloud/docs/guides/cluster-networking-health.md
Paul Payne 3f97dce86a docs: Update all guides to reflect current CLI, API, and web app
Rewrote backup/restore guides to document current system (native
pg_dump/Longhorn/tar.gz tools, blue-green restore, scheduling) and
remove outdated restic references. Rewrote monitoring guide to replace
K3s/Helm/Velero placeholders with actual capabilities. Filled in all
four upgrade guides (Talos, Kubernetes, applications, Wild Cloud) that
were previously TBD stubs. Expanded troubleshooting guides with correct
namespaces, Wild Cloud CLI commands, and Talos-specific diagnostics.
Added verification commands to cluster networking health checklist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-24 21:54:11 +00:00

3.2 KiB

Cluster Networking Health Checklist

Verifying every item on this list confirms the full networking stack is functioning correctly, from node-level overlay through DNS to external ingress.

Node Layer

  1. All nodes Ready — no cordons, no taints (e.g., maintenance:NoExecute)

    kubectl get nodes
    wild node list
    
  2. Flannel pods running on every node — stale VXLAN tunnels break cross-node pod traffic

    kubectl get pods -n kube-system -l app=flannel -o wide
    
  3. Cross-node pod connectivity — pods on each worker can reach pods on every other node

Service Routing

  1. kube-proxy pods running on every node — nftables rules route ClusterIP traffic to pod endpoints

    kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
    
  2. CoreDNS pods running and resolving — both cluster-internal names (*.svc.cluster.local) and external names

    kubectl get pods -n kube-system -l k8s-app=kube-dns
    
  3. CoreDNS upstream reachability — Talos DNS proxy at 169.254.116.108 responding from all nodes

Load Balancing

  1. MetalLB speakers running on all nodes — L2 ARP announcements for LoadBalancer IPs

    kubectl get pods -n metallb-system -l component=speaker -o wide
    
  2. MetalLB ServiceL2Status resources validstatus.node matches actual pod placement (stale entries block announcements)

    kubectl get servicel2statuses.metallb.io -n metallb-system
    
  3. LoadBalancer IPs reachable — Traefik LB IP responds from LAN

    kubectl get svc -n traefik
    curl -k https://<traefik-lb-ip>
    

Ingress & Security

  1. Traefik ingress routing — forwards to backend services, TLS termination working

    kubectl get pods -n traefik
    kubectl logs -n traefik -l app=traefik | tail -20
    
  2. CrowdSec LAPI running — can reach api.crowdsec.net (depends on CoreDNS external resolution)

    kubectl get pods -n crowdsec
    
  3. CrowdSec bouncer registered with LAPI — unregistered bouncer blocks all forwardAuth requests

    wild service logs crowdsec | grep bouncer
    

Storage

  1. Longhorn managers running on all workers — enables volume replica scheduling and rebuilds

    kubectl get pods -n longhorn-system -l app=longhorn-manager -o wide
    
  2. Longhorn volume replicas healthy — all volumes at target replica count across nodes

    kubectl get volumes.longhorn.io -n longhorn-system
    

External DNS & Certificates

  1. ExternalDNS pod running — creating and updating DNS records at Cloudflare

    kubectl get pods -n externaldns
    
  2. cert-manager pods running — issuing and renewing TLS certificates

    kubectl get pods -n cert-manager
    kubectl get certificates -n cert-manager
    

LAN DNS

  1. dnsmasq on Wild Central — resolves LAN-local domains to correct LoadBalancer IPs (hairpin NAT)
    wild dns status
    

Quick Full Check

Run wild cluster health for an automated check of the most critical items. For a comprehensive check, walk through each item above.