Files
wild-cloud/docs/guides/troubleshoot-visibility.md
2025-08-31 14:30:09 -07:00

7.2 KiB

Troubleshoot Service Visibility

This guide covers common issues with accessing services from outside the cluster and how to diagnose and fix them.

Common Issues

External access to your services might fail for several reasons:

  1. DNS Resolution Issues - Domain names not resolving to the correct IP address
  2. Network Connectivity Issues - Traffic can't reach the cluster's external IP
  3. TLS Certificate Issues - Invalid or missing certificates
  4. Ingress/Service Configuration Issues - Incorrectly configured routing

Diagnostic Steps

1. Check DNS Resolution

Symptoms:

  • Browser shows "site cannot be reached" or "server IP address could not be found"
  • ping or nslookup commands fail for your domain
  • Your service DNS records don't appear in CloudFlare or your DNS provider

Checks:

# Check if your domain resolves (from outside the cluster)
nslookup yourservice.yourdomain.com

# Check if ExternalDNS is running
kubectl get pods -n externaldns

# Check ExternalDNS logs for errors
kubectl logs -n externaldns -l app=external-dns  < /dev/null |  grep -i error
kubectl logs -n externaldns -l app=external-dns | grep -i "your-service-name"

# Check if CloudFlare API token is configured correctly
kubectl get secret cloudflare-api-token -n externaldns

Common Issues:

a) ExternalDNS Not Running: The ExternalDNS pod is not running or has errors.

b) Cloudflare API Token Issues: The API token is invalid, expired, or doesn't have the right permissions.

c) Domain Filter Mismatch: ExternalDNS is configured with a --domain-filter that doesn't match your domain.

d) Annotations Missing: Service or Ingress is missing the required ExternalDNS annotations.

Solutions:

# 1. Recreate CloudFlare API token secret
kubectl create secret generic cloudflare-api-token \
  --namespace externaldns \
  --from-literal=api-token="your-api-token" \
  --dry-run=client -o yaml | kubectl apply -f -

# 2. Check and set proper annotations on your Ingress:
kubectl annotate ingress your-ingress -n your-namespace \
  external-dns.alpha.kubernetes.io/hostname=your-service.your-domain.com

# 3. Restart ExternalDNS
kubectl rollout restart deployment -n externaldns external-dns

2. Check Network Connectivity

Symptoms:

  • DNS resolves to the correct IP but the service is still unreachable
  • Only some services are unreachable while others work
  • Network timeout errors

Checks:

# Check if MetalLB is running
kubectl get pods -n metallb-system

# Check MetalLB IP address pool
kubectl get ipaddresspools.metallb.io -n metallb-system

# Verify the service has an external IP
kubectl get svc -n your-namespace your-service

Common Issues:

a) MetalLB Configuration: The IP pool doesn't match your network or is exhausted.

b) Firewall Issues: Firewall is blocking traffic to your cluster's external IP.

c) Router Configuration: NAT or port forwarding issues if using a router.

Solutions:

# 1. Check and update MetalLB configuration
kubectl apply -f infrastructure_setup/metallb/metallb-pool.yaml

# 2. Check service external IP assignment
kubectl describe svc -n your-namespace your-service

3. Check TLS Certificates

Symptoms:

  • Browser shows certificate errors
  • "Your connection is not private" warnings
  • Cert-manager logs show errors

Checks:

# Check certificate status
kubectl get certificates -A

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager

# Check if your ingress is using the correct certificate
kubectl get ingress -n your-namespace your-ingress -o yaml

Common Issues:

a) Certificate Issuance Failures: DNS validation or HTTP validation failing.

b) Wrong Secret Referenced: Ingress is referencing a non-existent certificate secret.

c) Expired Certificate: Certificate has expired and wasn't renewed.

Solutions:

# 1. Check and recreate certificates
kubectl apply -f infrastructure_setup/cert-manager/wildcard-certificate.yaml

# 2. Update ingress to use correct secret
kubectl patch ingress your-ingress -n your-namespace --type=json \
  -p='[{"op": "replace", "path": "/spec/tls/0/secretName", "value": "correct-secret-name"}]'

4. Check Ingress Configuration

Symptoms:

  • HTTP 404, 503, or other error codes
  • Service accessible from inside cluster but not outside
  • Traffic routed to wrong service

Checks:

# Check ingress status
kubectl get ingress -n your-namespace

# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik

# Check ingress configuration
kubectl describe ingress -n your-namespace your-ingress

Common Issues:

a) Incorrect Service Targeting: Ingress is pointing to wrong service or port.

b) Traefik Configuration: IngressClass or middleware issues.

c) Path Configuration: Incorrect path prefixes or regex.

Solutions:

# 1. Verify ingress configuration
kubectl edit ingress -n your-namespace your-ingress

# 2. Check that the referenced service exists
kubectl get svc -n your-namespace

# 3. Restart Traefik if needed
kubectl rollout restart deployment -n kube-system traefik

Advanced Diagnostics

For more complex issues, you can use port-forwarding to test services directly:

# Port-forward the service directly
kubectl port-forward -n your-namespace svc/your-service 8080:80

# Then test locally
curl http://localhost:8080

You can also deploy a debug pod to test connectivity from inside the cluster:

# Start a debug pod
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh

# Inside the pod, test DNS and connectivity
nslookup your-service.your-namespace.svc.cluster.local
wget -O- http://your-service.your-namespace.svc.cluster.local

ExternalDNS Specifics

ExternalDNS can be particularly troublesome. Here are specific debugging steps:

  1. Check Log Level: Set --log-level=debug for more detailed logs
  2. Check Domain Filter: Ensure --domain-filter includes your domain
  3. Check Provider: Ensure --provider=cloudflare (or your DNS provider)
  4. Verify API Permissions: CloudFlare token needs Zone.Zone and Zone.DNS permissions
  5. Check TXT Records: ExternalDNS uses TXT records for ownership tracking
# Restart with verbose logging
kubectl set env deployment/external-dns -n externaldns -- --log-level=debug

# Check for specific domain errors
kubectl logs -n externaldns -l app=external-dns | grep -i yourservice.yourdomain.com

CloudFlare Specific Issues

When using CloudFlare, additional issues may arise:

  1. API Rate Limiting: CloudFlare may rate limit frequent API calls
  2. DNS Propagation: Changes may take time to propagate through CloudFlare's CDN
  3. Proxied Records: The external-dns.alpha.kubernetes.io/cloudflare-proxied annotation controls whether CloudFlare proxies traffic
  4. Access Restrictions: CloudFlare Access or Page Rules may restrict access
  5. API Token Permissions: The token must have Zone:Zone:Read and Zone:DNS:Edit permissions
  6. Zone Detection: If using subdomains, ensure the parent domain is included in the domain filter

Check CloudFlare dashboard for:

  • DNS record existence
  • API access logs
  • DNS settings including proxy status
  • Any error messages or rate limit warnings