Initial commit.

This commit is contained in:
2025-04-27 14:57:00 -07:00
commit 84376fb3d5
63 changed files with 5645 additions and 0 deletions

165
docs/APPS.md Normal file
View File

@@ -0,0 +1,165 @@
# Deploying Applications
Once you have your personal cloud infrastructure up and running, you'll want to start deploying applications. This guide explains how to deploy and manage applications on your infrastructure.
## Application Charts
The `/charts` directory contains curated Helm charts for common applications that are ready to deploy on your personal cloud.
### Available Charts
| Chart | Description | Internal/Public |
|-------|-------------|----------------|
| mariadb | MariaDB database for applications | Internal |
| postgres | PostgreSQL database for applications | Internal |
### Installing Charts
Use the `bin/helm-install` script to easily deploy charts with the right configuration:
```bash
# Install PostgreSQL
./bin/helm-install postgres
# Install MariaDB
./bin/helm-install mariadb
```
The script automatically:
- Uses values from your environment variables
- Creates the necessary namespace
- Configures storage and networking
- Sets up appropriate secrets
### Customizing Chart Values
Each chart can be customized by:
1. Editing environment variables in your `.env` file
2. Creating a custom values file:
```bash
# Create a custom values file
cp charts/postgres/values.yaml my-postgres-values.yaml
nano my-postgres-values.yaml
# Install with custom values
./bin/helm-install postgres --values my-postgres-values.yaml
```
### Creating Your Own Charts
You can add your own applications to the charts directory:
1. Create a new directory: `mkdir -p charts/my-application`
2. Add the necessary templates and values
3. Document any required environment variables
## Deploying Custom Services
For simpler applications or services without existing charts, use the `deploy-service` script to quickly deploy from templates.
### Service Types
The system supports four types of services:
1. **Public** - Accessible from the internet
2. **Internal** - Only accessible within your local network
3. **Database** - Internal database services
4. **Microservice** - Services that are only accessible by other services
### Deployment Examples
```bash
# Deploy a public blog using Ghost
./bin/deploy-service --type public --name blog --image ghost:4.12 --port 2368
# Deploy an internal admin dashboard
./bin/deploy-service --type internal --name admin --image my-admin:v1 --port 8080
# Deploy a database service
./bin/deploy-service --type database --name postgres --image postgres:15 --port 5432
# Deploy a microservice
./bin/deploy-service --type microservice --name auth --image auth-service:v1 --port 9000
```
### Service Structure
When you deploy a service, a directory is created at `services/[service-name]/` containing:
- `service.yaml` - The Kubernetes manifest for your service
You can modify this file directly and reapply it with `kubectl apply -f services/[service-name]/service.yaml` to update your service.
## Accessing Services
Services are automatically configured with proper URLs and TLS certificates.
### URL Patterns
- **Public services**: `https://[service-name].[domain]`
- **Internal services**: `https://[service-name].internal.[domain]`
- **Microservices**: `https://[service-name].svc.[domain]`
- **Databases**: `[service-name].[namespace].svc.cluster.local:[port]`
### Dashboard Access
Access the Kubernetes Dashboard at `https://dashboard.internal.[domain]`:
```bash
# Get the dashboard token
./bin/dashboard-token
```
### Service Management
Monitor your running services with:
```bash
# List all services
kubectl get services -A
# View detailed information about a service
kubectl describe service [service-name] -n [namespace]
# Check pods for a service
kubectl get pods -n [namespace] -l app=[service-name]
# View logs for a service
kubectl logs -n [namespace] -l app=[service-name]
```
## Advanced Configuration
### Scaling Services
Scale your services by editing the deployment:
```bash
kubectl scale deployment [service-name] --replicas=3 -n [namespace]
```
### Adding Environment Variables
Add environment variables to your service by editing the service YAML file and adding entries to the `env` section:
```yaml
env:
- name: DATABASE_URL
value: "postgres://user:password@postgres:5432/db"
```
### Persistent Storage
For services that need persistent storage, add a PersistentVolumeClaim to your service YAML.
## Troubleshooting
If a service isn't working correctly:
1. Check pod status: `kubectl get pods -n [namespace]`
2. View logs: `kubectl logs [pod-name] -n [namespace]`
3. Describe the pod: `kubectl describe pod [pod-name] -n [namespace]`
4. Verify the service: `kubectl get svc [service-name] -n [namespace]`
5. Check the ingress: `kubectl get ingress [service-name] -n [namespace]`

328
docs/MAINTENANCE.md Normal file
View File

@@ -0,0 +1,328 @@
# Maintenance Guide
This guide covers essential maintenance tasks for your personal cloud infrastructure, including troubleshooting, backups, updates, and security best practices.
## Troubleshooting
### General Troubleshooting Steps
1. **Check Component Status**:
```bash
# Check all pods across all namespaces
kubectl get pods -A
# Look for pods that aren't Running or Ready
kubectl get pods -A | grep -v "Running\|Completed"
```
2. **View Detailed Pod Information**:
```bash
# Get detailed info about problematic pods
kubectl describe pod <pod-name> -n <namespace>
# Check pod logs
kubectl logs <pod-name> -n <namespace>
```
3. **Run Validation Script**:
```bash
./infrastructure_setup/validate_setup.sh
```
4. **Check Node Status**:
```bash
kubectl get nodes
kubectl describe node <node-name>
```
### Common Issues
#### Certificate Problems
If services show invalid certificates:
1. Check certificate status:
```bash
kubectl get certificates -A
```
2. Examine certificate details:
```bash
kubectl describe certificate <cert-name> -n <namespace>
```
3. Check for cert-manager issues:
```bash
kubectl get pods -n cert-manager
kubectl logs -l app=cert-manager -n cert-manager
```
4. Verify the Cloudflare API token is correctly set up:
```bash
kubectl get secret cloudflare-api-token -n internal
```
#### DNS Issues
If DNS resolution isn't working properly:
1. Check CoreDNS status:
```bash
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -l k8s-app=kube-dns -n kube-system
```
2. Verify CoreDNS configuration:
```bash
kubectl get configmap -n kube-system coredns -o yaml
```
3. Test DNS resolution from inside the cluster:
```bash
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default
```
#### Service Connectivity
If services can't communicate:
1. Check network policies:
```bash
kubectl get networkpolicies -A
```
2. Verify service endpoints:
```bash
kubectl get endpoints -n <namespace>
```
3. Test connectivity from within the cluster:
```bash
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- wget -O- <service-name>.<namespace>
```
## Backup and Restore
### What to Back Up
1. **Persistent Data**:
- Database volumes
- Application storage
- Configuration files
2. **Kubernetes Resources**:
- Custom Resource Definitions (CRDs)
- Deployments, Services, Ingresses
- Secrets and ConfigMaps
### Backup Methods
#### Simple Backup Script
Create a backup script at `bin/backup.sh` (to be implemented):
```bash
#!/bin/bash
# Simple backup script for your personal cloud
# This is a placeholder for future implementation
BACKUP_DIR="/path/to/backups/$(date +%Y-%m-%d)"
mkdir -p "$BACKUP_DIR"
# Back up Kubernetes resources
kubectl get all -A -o yaml > "$BACKUP_DIR/all-resources.yaml"
kubectl get secrets -A -o yaml > "$BACKUP_DIR/secrets.yaml"
kubectl get configmaps -A -o yaml > "$BACKUP_DIR/configmaps.yaml"
# Back up persistent volumes
# TODO: Add logic to back up persistent volume data
echo "Backup completed: $BACKUP_DIR"
```
#### Using Velero (Recommended for Future)
[Velero](https://velero.io/) is a powerful backup solution for Kubernetes:
```bash
# Install Velero (future implementation)
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero --namespace velero --create-namespace
# Create a backup
velero backup create my-backup --include-namespaces default,internal
# Restore from backup
velero restore create --from-backup my-backup
```
### Database Backups
For database services, set up regular dumps:
```bash
# PostgreSQL backup (placeholder)
kubectl exec <postgres-pod> -n <namespace> -- pg_dump -U <username> <database> > backup.sql
# MariaDB/MySQL backup (placeholder)
kubectl exec <mariadb-pod> -n <namespace> -- mysqldump -u root -p<password> <database> > backup.sql
```
## Updates
### Updating Kubernetes (K3s)
1. Check current version:
```bash
k3s --version
```
2. Update K3s:
```bash
curl -sfL https://get.k3s.io | sh -
```
3. Verify the update:
```bash
k3s --version
kubectl get nodes
```
### Updating Infrastructure Components
1. Update the repository:
```bash
git pull
```
2. Re-run the setup script:
```bash
./infrastructure_setup/setup-all.sh
```
3. Or update specific components:
```bash
./infrastructure_setup/setup-cert-manager.sh
./infrastructure_setup/setup-dashboard.sh
# etc.
```
### Updating Applications
For Helm chart applications:
```bash
# Update Helm repositories
helm repo update
# Upgrade a specific application
./bin/helm-install <chart-name> --upgrade
```
For services deployed with `deploy-service`:
```bash
# Edit the service YAML
nano services/<service-name>/service.yaml
# Apply changes
kubectl apply -f services/<service-name>/service.yaml
```
## Security
### Best Practices
1. **Keep Everything Updated**:
- Regularly update K3s
- Update all infrastructure components
- Keep application images up to date
2. **Network Security**:
- Use internal services whenever possible
- Limit exposed services to only what's necessary
- Configure your home router's firewall properly
3. **Access Control**:
- Use strong passwords for all services
- Implement a secrets management strategy
- Rotate API tokens and keys regularly
4. **Regular Audits**:
- Review running services periodically
- Check for unused or outdated deployments
- Monitor resource usage for anomalies
### Security Scanning (Future Implementation)
Tools to consider implementing:
1. **Trivy** for image scanning:
```bash
# Example Trivy usage (placeholder)
trivy image <your-image>
```
2. **kube-bench** for Kubernetes security checks:
```bash
# Example kube-bench usage (placeholder)
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
```
3. **Falco** for runtime security monitoring:
```bash
# Example Falco installation (placeholder)
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco --namespace falco --create-namespace
```
## System Health Monitoring
### Basic Monitoring
Check system health with:
```bash
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods -A
# Persistent volume claims
kubectl get pvc -A
```
### Advanced Monitoring (Future Implementation)
Consider implementing:
1. **Prometheus + Grafana** for comprehensive monitoring:
```bash
# Placeholder for future implementation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
```
2. **Loki** for log aggregation:
```bash
# Placeholder for future implementation
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack --namespace logging --create-namespace
```
## Additional Resources
This document will be expanded in the future with:
- Detailed backup and restore procedures
- Monitoring setup instructions
- Comprehensive security hardening guide
- Automated maintenance scripts
For now, refer to the following external resources:
- [K3s Documentation](https://docs.k3s.io/)
- [Kubernetes Troubleshooting Guide](https://kubernetes.io/docs/tasks/debug/)
- [Velero Backup Documentation](https://velero.io/docs/latest/)
- [Kubernetes Security Best Practices](https://kubernetes.io/docs/concepts/security/)

112
docs/SETUP.md Normal file
View File

@@ -0,0 +1,112 @@
# Setting Up Your Personal Cloud
Welcome to your journey toward digital independence! This guide will walk you through setting up your own personal cloud infrastructure using Kubernetes, providing you with privacy, control, and flexibility.
## Hardware Recommendations
For a pleasant experience, we recommend:
- A dedicated mini PC, NUC, or old laptop with at least:
- 4 CPU cores
- 8GB RAM (16GB recommended)
- 128GB SSD (256GB or more recommended)
- A stable internet connection
- Optional: additional nodes for high availability
## Initial Setup
### 1. Prepare Environment Variables
First, create your environment configuration:
```bash
# Copy the example file and edit with your details
cp .env.example .env
nano .env
# Then load the environment variables
source load-env.sh
```
Important variables to set in your `.env` file:
- `DOMAIN`: Your domain name (e.g., `cloud.example.com`)
- `EMAIL`: Your email for Let's Encrypt certificates
- `CLOUDFLARE_API_TOKEN`: If using Cloudflare for DNS
### 2. Install K3s (Lightweight Kubernetes)
K3s provides a fully-compliant Kubernetes distribution in a small footprint:
```bash
# Install K3s without the default load balancer (we'll use MetalLB)
curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode=644 --disable servicelb
# Set up kubectl configuration
mkdir -p ~/.kube
sudo cat /etc/rancher/k3s/k3s.yaml > ~/.kube/config
chmod 600 ~/.kube/config
```
### 3. Install Infrastructure Components
One command sets up your entire cloud infrastructure:
```bash
./infrastructure_setup/setup-all.sh
```
This installs and configures:
- **MetalLB**: Provides IP addresses for services
- **Traefik**: Handles ingress (routing) with automatic HTTPS
- **cert-manager**: Manages TLS certificates automatically
- **CoreDNS**: Provides internal DNS resolution
- **ExternalDNS**: Updates DNS records automatically
- **Kubernetes Dashboard**: Web UI for managing your cluster
## Adding Additional Nodes (Optional)
For larger workloads or high availability, you can add more nodes:
```bash
# On your master node, get the node token
sudo cat /var/lib/rancher/k3s/server/node-token
# On each new node, join the cluster
curl -sfL https://get.k3s.io | K3S_URL=https://MASTER_IP:6443 K3S_TOKEN=NODE_TOKEN sh -
```
## Next Steps
Now that your infrastructure is set up, you can:
1. **Deploy Applications**: See [Applications Guide](./APPS.md) for deploying services and applications
2. **Access Dashboard**: Visit `https://dashboard.internal.yourdomain.com` and use the token from `./bin/dashboard-token`
3. **Validate Setup**: Run `./infrastructure_setup/validate_setup.sh` to ensure everything is working
## Validation and Troubleshooting
Run the validation script to ensure everything is working correctly:
```bash
./infrastructure_setup/validate_setup.sh
```
This script checks:
- All infrastructure components
- DNS resolution
- Service connectivity
- Certificate issuance
- Network configuration
If issues are found, the script provides specific remediation steps.
## What's Next?
Now that your personal cloud is running, consider:
- Setting up backups with [Velero](https://velero.io/)
- Adding monitoring with Prometheus and Grafana
- Deploying applications like Nextcloud, Home Assistant, or Gitea
- Exploring the Kubernetes Dashboard to monitor your services
Welcome to your personal cloud journey! You now have the foundation for hosting your own services and taking control of your digital life.

331
docs/learning/visibility.md Normal file
View File

@@ -0,0 +1,331 @@
# Understanding Network Visibility in Kubernetes
This guide explains how applications deployed on our Kubernetes cluster become accessible from both internal and external networks. Whether you're deploying a public-facing website or an internal admin panel, this document will help you understand the journey from deployment to accessibility.
## The Visibility Pipeline
When you deploy an application to the cluster, making it accessible involves several coordinated components working together:
1. **Kubernetes Services** - Direct traffic to your application pods
2. **Ingress Controllers** - Route external HTTP/HTTPS traffic to services
3. **Load Balancers** - Assign external IPs to services
4. **DNS Management** - Map domain names to IPs
5. **TLS Certificates** - Secure connections with HTTPS
Let's walk through how each part works and how they interconnect.
## From Deployment to Visibility
### 1. Application Deployment
Your journey begins with deploying your application on Kubernetes. This typically involves:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: my-namespace
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: myapp:latest
ports:
- containerPort: 80
```
This creates pods running your application, but they're not yet accessible outside their namespace.
### 2. Kubernetes Service: Internal Connectivity
A Kubernetes Service provides a stable endpoint to access your pods:
```yaml
apiVersion: v1
kind: Service
metadata:
name: my-app
namespace: my-namespace
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 80
type: ClusterIP
```
With this `ClusterIP` service, your application is accessible within the cluster at `my-app.my-namespace.svc.cluster.local`, but not from outside.
### 3. Ingress: Defining HTTP Routes
For HTTP/HTTPS traffic, an Ingress resource defines routing rules:
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app
namespace: my-namespace
annotations:
kubernetes.io/ingress.class: "traefik"
external-dns.alpha.kubernetes.io/target: "CLOUD_DOMAIN"
external-dns.alpha.kubernetes.io/ttl: "60"
spec:
rules:
- host: my-app.CLOUD_DOMAIN
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app
port:
number: 80
tls:
- hosts:
- my-app.CLOUD_DOMAIN
secretName: wildcard-sovereign-cloud-tls
```
This Ingress tells the cluster to route requests for `my-app.CLOUD_DOMAIN` to your service. The annotations provide hints to other systems like ExternalDNS.
### 4. Traefik: The Ingress Controller
Our cluster uses Traefik as the ingress controller. Traefik watches for Ingress resources and configures itself to handle the routing rules. It acts as a reverse proxy and edge router, handling:
- HTTP/HTTPS routing
- TLS termination
- Load balancing
- Path-based routing
- Host-based routing
Traefik runs as a service in the cluster with its own external IP (provided by MetalLB).
### 5. MetalLB: Assigning External IPs
Since we're running on-premises (not in a cloud that provides load balancers), we use MetalLB to assign external IPs to services. MetalLB manages a pool of IP addresses from our local network:
```yaml
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default
namespace: metallb-system
spec:
addresses:
- 192.168.8.240-192.168.8.250
```
This allows Traefik and any other LoadBalancer services to receive a real IP address from our network.
### 6. ExternalDNS: Automated DNS Management
ExternalDNS automatically creates and updates DNS records in our CloudFlare DNS zone:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
namespace: externaldns
spec:
# ...
template:
spec:
containers:
- name: external-dns
image: registry.k8s.io/external-dns/external-dns
args:
- --source=service
- --source=ingress
- --provider=cloudflare
- --txt-owner-id=sovereign-cloud
```
ExternalDNS watches Kubernetes Services and Ingresses with appropriate annotations, then creates corresponding DNS records in CloudFlare, making your applications discoverable by domain name.
### 7. Cert-Manager: TLS Certificate Automation
To secure connections with HTTPS, we use cert-manager to automatically obtain and renew TLS certificates:
```yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: wildcard-sovereign-cloud-io
namespace: default
spec:
secretName: wildcard-sovereign-cloud-tls
dnsNames:
- "*.CLOUD_DOMAIN"
- "CLOUD_DOMAIN"
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
```
Cert-manager handles:
- Certificate request and issuance
- DNS validation (for wildcard certificates)
- Automatic renewal
- Secret storage of certificates
## The Two Visibility Paths
In our infrastructure, we support two primary visibility paths:
### Public Services (External Access)
Public services are those meant to be accessible from the public internet:
1. **Service**: Kubernetes ClusterIP service (internal)
2. **Ingress**: Defines routing with hostname like `service-name.CLOUD_DOMAIN`
3. **DNS**: ExternalDNS creates a CNAME record pointing to `CLOUD_DOMAIN`
4. **TLS**: Uses wildcard certificate for `*.CLOUD_DOMAIN`
5. **IP Addressing**: Traffic reaches the MetalLB-assigned IP for Traefik
6. **Network**: Traffic flows from external internet → router → MetalLB IP → Traefik → Kubernetes Service → Application Pods
**Deploy a public service with:**
```bash
./bin/deploy-service --type public --name myservice
```
### Internal Services (Private Access)
Internal services are restricted to the internal network:
1. **Service**: Kubernetes ClusterIP service (internal)
2. **Ingress**: Defines routing with hostname like `service-name.internal.CLOUD_DOMAIN`
3. **DNS**: ExternalDNS creates an A record pointing to the internal load balancer IP
4. **TLS**: Uses wildcard certificate for `*.internal.CLOUD_DOMAIN`
5. **IP Addressing**: Traffic reaches the MetalLB-assigned IP for Traefik
6. **Network**: Traffic flows from internal network → MetalLB IP → Traefik → Service → Pods
7. **Security**: Traefik middleware restricts access to internal network IPs
**Deploy an internal service with:**
```bash
./bin/deploy-service --type internal --name adminpanel
```
## How It All Works Together
1. **You deploy** an application using our deploy-service script
2. **Kubernetes** schedules and runs your application pods
3. **Services** provide a stable endpoint for your pods
4. **Traefik** configures routing based on Ingress definitions
5. **MetalLB** assigns real network IPs to LoadBalancer services
6. **ExternalDNS** creates DNS records for your services
7. **Cert-Manager** ensures valid TLS certificates for HTTPS
### Network Flow Diagram
```mermaid
flowchart TD
subgraph Internet["Internet"]
User("User Browser")
CloudDNS("CloudFlare DNS")
end
subgraph Cluster["Cluster"]
Router("Router")
MetalLB("MetalLB")
Traefik("Traefik Ingress")
IngSvc("Service")
IngPods("Application Pods")
Ingress("Ingress")
CertManager("cert-manager")
WildcardCert("Wildcard Certificate")
ExtDNS("ExternalDNS")
end
User -- "1\. DNS Query" --> CloudDNS
CloudDNS -- "2\. IP Address" --> User
User -- "3\. HTTPS Request" --> Router
Router -- "4\. Forward" --> MetalLB
MetalLB -- "5\. Route" --> Traefik
Traefik -- "6\. Route" --> Ingress
Ingress -- "7\. Forward" --> IngSvc
IngSvc -- "8\. Balance" --> IngPods
ExtDNS -- "A. Update DNS" --> CloudDNS
Ingress -- "B. Configure" --> ExtDNS
CertManager -- "C. Issue Cert" --> WildcardCert
Ingress -- "D. Use" --> WildcardCert
User:::internet
CloudDNS:::internet
Router:::cluster
MetalLB:::cluster
Traefik:::cluster
IngSvc:::cluster
IngPods:::cluster
Ingress:::cluster
CertManager:::cluster
WildcardCert:::cluster
ExtDNS:::cluster
classDef internet fill:#fcfcfc,stroke:#333
classDef cluster fill:#a6f3ff,stroke:#333
style User fill:#C8E6C9
style CloudDNS fill:#C8E6C9
style Router fill:#C8E6C9
style MetalLB fill:#C8E6C9
style Traefik fill:#C8E6C9
style IngSvc fill:#C8E6C9
style IngPods fill:#C8E6C9
style Ingress fill:#C8E6C9
style CertManager fill:#C8E6C9
style WildcardCert fill:#C8E6C9
style ExtDNS fill:#C8E6C9
```
A successful deployment creates a chain of connections:
```
Internet → DNS (domain name) → External IP → Traefik → Kubernetes Service → Application Pod
```
## Behind the Scenes: The Technical Magic
When you use our `deploy-service` script, several things happen:
1. **Template Processing**: The script processes a YAML template for your service type, using environment variables to customize it
2. **Namespace Management**: Creates or uses your service's namespace
3. **Resource Application**: Applies the generated YAML to create/update all Kubernetes resources
4. **DNS Configuration**: ExternalDNS detects the new resources and creates DNS records
5. **Certificate Management**: Cert-manager ensures TLS certificates exist or creates new ones
6. **Secret Distribution**: For internal services, certificates are copied to the appropriate namespaces
## Troubleshooting Visibility Issues
When services aren't accessible, the issue usually lies in one of these areas:
1. **DNS Resolution**: Domain not resolving to the correct IP
2. **Certificate Problems**: Invalid, expired, or missing TLS certificates
3. **Ingress Configuration**: Incorrect routing rules or annotations
4. **Network Issues**: Firewall rules or internal/external network segregation
Our [Visibility Troubleshooting Guide](/docs/troubleshooting/VISIBILITY.md) provides detailed steps for diagnosing these issues.
## Conclusion
The visibility layer in our infrastructure represents a sophisticated interplay of multiple systems working together. While complex under the hood, it provides a streamlined experience for developers to deploy applications with proper networking, DNS, and security.
By understanding these components and their relationships, you'll be better equipped to deploy applications and diagnose any visibility issues that arise.
## Further Reading
- [Traefik Documentation](https://doc.traefik.io/traefik/)
- [ExternalDNS Project](https://github.com/kubernetes-sigs/external-dns)
- [Cert-Manager Documentation](https://cert-manager.io/docs/)
- [MetalLB Project](https://metallb.universe.tf/)

View File

@@ -0,0 +1,246 @@
# Troubleshooting Service Visibility
This guide covers common issues with accessing services from outside the cluster and how to diagnose and fix them.
## Common Issues
External access to your services might fail for several reasons:
1. **DNS Resolution Issues** - Domain names not resolving to the correct IP address
2. **Network Connectivity Issues** - Traffic can't reach the cluster's external IP
3. **TLS Certificate Issues** - Invalid or missing certificates
4. **Ingress/Service Configuration Issues** - Incorrectly configured routing
## Diagnostic Steps
### 1. Check DNS Resolution
**Symptoms:**
- Browser shows "site cannot be reached" or "server IP address could not be found"
- `ping` or `nslookup` commands fail for your domain
- Your service DNS records don't appear in CloudFlare or your DNS provider
**Checks:**
```bash
# Check if your domain resolves (from outside the cluster)
nslookup yourservice.yourdomain.com
# Check if ExternalDNS is running
kubectl get pods -n externaldns
# Check ExternalDNS logs for errors
kubectl logs -n externaldns -l app=external-dns < /dev/null | grep -i error
kubectl logs -n externaldns -l app=external-dns | grep -i "your-service-name"
# Check if CloudFlare API token is configured correctly
kubectl get secret cloudflare-api-token -n externaldns
```
**Common Issues:**
a) **ExternalDNS Not Running**: The ExternalDNS pod is not running or has errors.
b) **Cloudflare API Token Issues**: The API token is invalid, expired, or doesn't have the right permissions.
c) **Domain Filter Mismatch**: ExternalDNS is configured with a `--domain-filter` that doesn't match your domain.
d) **Annotations Missing**: Service or Ingress is missing the required ExternalDNS annotations.
**Solutions:**
```bash
# 1. Recreate CloudFlare API token secret
kubectl create secret generic cloudflare-api-token \
--namespace externaldns \
--from-literal=api-token="your-api-token" \
--dry-run=client -o yaml | kubectl apply -f -
# 2. Check and set proper annotations on your Ingress:
kubectl annotate ingress your-ingress -n your-namespace \
external-dns.alpha.kubernetes.io/hostname=your-service.your-domain.com
# 3. Restart ExternalDNS
kubectl rollout restart deployment -n externaldns external-dns
```
### 2. Check Network Connectivity
**Symptoms:**
- DNS resolves to the correct IP but the service is still unreachable
- Only some services are unreachable while others work
- Network timeout errors
**Checks:**
```bash
# Check if MetalLB is running
kubectl get pods -n metallb-system
# Check MetalLB IP address pool
kubectl get ipaddresspools.metallb.io -n metallb-system
# Verify the service has an external IP
kubectl get svc -n your-namespace your-service
```
**Common Issues:**
a) **MetalLB Configuration**: The IP pool doesn't match your network or is exhausted.
b) **Firewall Issues**: Firewall is blocking traffic to your cluster's external IP.
c) **Router Configuration**: NAT or port forwarding issues if using a router.
**Solutions:**
```bash
# 1. Check and update MetalLB configuration
kubectl apply -f infrastructure_setup/metallb/metallb-pool.yaml
# 2. Check service external IP assignment
kubectl describe svc -n your-namespace your-service
```
### 3. Check TLS Certificates
**Symptoms:**
- Browser shows certificate errors
- "Your connection is not private" warnings
- Cert-manager logs show errors
**Checks:**
```bash
# Check certificate status
kubectl get certificates -A
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager
# Check if your ingress is using the correct certificate
kubectl get ingress -n your-namespace your-ingress -o yaml
```
**Common Issues:**
a) **Certificate Issuance Failures**: DNS validation or HTTP validation failing.
b) **Wrong Secret Referenced**: Ingress is referencing a non-existent certificate secret.
c) **Expired Certificate**: Certificate has expired and wasn't renewed.
**Solutions:**
```bash
# 1. Check and recreate certificates
kubectl apply -f infrastructure_setup/cert-manager/wildcard-certificate.yaml
# 2. Update ingress to use correct secret
kubectl patch ingress your-ingress -n your-namespace --type=json \
-p='[{"op": "replace", "path": "/spec/tls/0/secretName", "value": "correct-secret-name"}]'
```
### 4. Check Ingress Configuration
**Symptoms:**
- HTTP 404, 503, or other error codes
- Service accessible from inside cluster but not outside
- Traffic routed to wrong service
**Checks:**
```bash
# Check ingress status
kubectl get ingress -n your-namespace
# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik
# Check ingress configuration
kubectl describe ingress -n your-namespace your-ingress
```
**Common Issues:**
a) **Incorrect Service Targeting**: Ingress is pointing to wrong service or port.
b) **Traefik Configuration**: IngressClass or middleware issues.
c) **Path Configuration**: Incorrect path prefixes or regex.
**Solutions:**
```bash
# 1. Verify ingress configuration
kubectl edit ingress -n your-namespace your-ingress
# 2. Check that the referenced service exists
kubectl get svc -n your-namespace
# 3. Restart Traefik if needed
kubectl rollout restart deployment -n kube-system traefik
```
## Advanced Diagnostics
For more complex issues, you can use port-forwarding to test services directly:
```bash
# Port-forward the service directly
kubectl port-forward -n your-namespace svc/your-service 8080:80
# Then test locally
curl http://localhost:8080
```
You can also deploy a debug pod to test connectivity from inside the cluster:
```bash
# Start a debug pod
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh
# Inside the pod, test DNS and connectivity
nslookup your-service.your-namespace.svc.cluster.local
wget -O- http://your-service.your-namespace.svc.cluster.local
```
## ExternalDNS Specifics
ExternalDNS can be particularly troublesome. Here are specific debugging steps:
1. **Check Log Level**: Set `--log-level=debug` for more detailed logs
2. **Check Domain Filter**: Ensure `--domain-filter` includes your domain
3. **Check Provider**: Ensure `--provider=cloudflare` (or your DNS provider)
4. **Verify API Permissions**: CloudFlare token needs Zone.Zone and Zone.DNS permissions
5. **Check TXT Records**: ExternalDNS uses TXT records for ownership tracking
```bash
# Restart with verbose logging
kubectl set env deployment/external-dns -n externaldns -- --log-level=debug
# Check for specific domain errors
kubectl logs -n externaldns -l app=external-dns | grep -i yourservice.yourdomain.com
```
## CloudFlare Specific Issues
When using CloudFlare, additional issues may arise:
1. **API Rate Limiting**: CloudFlare may rate limit frequent API calls
2. **DNS Propagation**: Changes may take time to propagate through CloudFlare's CDN
3. **Proxied Records**: The `external-dns.alpha.kubernetes.io/cloudflare-proxied` annotation controls whether CloudFlare proxies traffic
4. **Access Restrictions**: CloudFlare Access or Page Rules may restrict access
5. **API Token Permissions**: The token must have Zone:Zone:Read and Zone:DNS:Edit permissions
6. **Zone Detection**: If using subdomains, ensure the parent domain is included in the domain filter
Check CloudFlare dashboard for:
- DNS record existence
- API access logs
- DNS settings including proxy status
- Any error messages or rate limit warnings