7.6 KiB
Maintenance Guide
This guide covers essential maintenance tasks for your personal cloud infrastructure, including troubleshooting, backups, updates, and security best practices.
Troubleshooting
General Troubleshooting Steps
-
Check Component Status:
# Check all pods across all namespaces kubectl get pods -A # Look for pods that aren't Running or Ready kubectl get pods -A | grep -v "Running\|Completed"
-
View Detailed Pod Information:
# Get detailed info about problematic pods kubectl describe pod <pod-name> -n <namespace> # Check pod logs kubectl logs <pod-name> -n <namespace>
-
Run Validation Script:
./infrastructure_setup/validate_setup.sh
-
Check Node Status:
kubectl get nodes kubectl describe node <node-name>
Common Issues
Certificate Problems
If services show invalid certificates:
-
Check certificate status:
kubectl get certificates -A
-
Examine certificate details:
kubectl describe certificate <cert-name> -n <namespace>
-
Check for cert-manager issues:
kubectl get pods -n cert-manager kubectl logs -l app=cert-manager -n cert-manager
-
Verify the Cloudflare API token is correctly set up:
kubectl get secret cloudflare-api-token -n internal
DNS Issues
If DNS resolution isn't working properly:
-
Check CoreDNS status:
kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs -l k8s-app=kube-dns -n kube-system
-
Verify CoreDNS configuration:
kubectl get configmap -n kube-system coredns -o yaml
-
Test DNS resolution from inside the cluster:
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default
Service Connectivity
If services can't communicate:
-
Check network policies:
kubectl get networkpolicies -A
-
Verify service endpoints:
kubectl get endpoints -n <namespace>
-
Test connectivity from within the cluster:
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- wget -O- <service-name>.<namespace>
Backup and Restore
What to Back Up
-
Persistent Data:
- Database volumes
- Application storage
- Configuration files
-
Kubernetes Resources:
- Custom Resource Definitions (CRDs)
- Deployments, Services, Ingresses
- Secrets and ConfigMaps
Backup Methods
Simple Backup Script
Create a backup script at bin/backup.sh
(to be implemented):
#!/bin/bash
# Simple backup script for your personal cloud
# This is a placeholder for future implementation
BACKUP_DIR="/path/to/backups/$(date +%Y-%m-%d)"
mkdir -p "$BACKUP_DIR"
# Back up Kubernetes resources
kubectl get all -A -o yaml > "$BACKUP_DIR/all-resources.yaml"
kubectl get secrets -A -o yaml > "$BACKUP_DIR/secrets.yaml"
kubectl get configmaps -A -o yaml > "$BACKUP_DIR/configmaps.yaml"
# Back up persistent volumes
# TODO: Add logic to back up persistent volume data
echo "Backup completed: $BACKUP_DIR"
Using Velero (Recommended for Future)
Velero is a powerful backup solution for Kubernetes:
# Install Velero (future implementation)
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero --namespace velero --create-namespace
# Create a backup
velero backup create my-backup --include-namespaces default,internal
# Restore from backup
velero restore create --from-backup my-backup
Database Backups
For database services, set up regular dumps:
# PostgreSQL backup (placeholder)
kubectl exec <postgres-pod> -n <namespace> -- pg_dump -U <username> <database> > backup.sql
# MariaDB/MySQL backup (placeholder)
kubectl exec <mariadb-pod> -n <namespace> -- mysqldump -u root -p<password> <database> > backup.sql
Updates
Updating Kubernetes (K3s)
-
Check current version:
k3s --version
-
Update K3s:
curl -sfL https://get.k3s.io | sh -
-
Verify the update:
k3s --version kubectl get nodes
Updating Infrastructure Components
-
Update the repository:
git pull
-
Re-run the setup script:
./infrastructure_setup/setup-all.sh
-
Or update specific components:
./infrastructure_setup/setup-cert-manager.sh ./infrastructure_setup/setup-dashboard.sh # etc.
Updating Applications
For Helm chart applications:
# Update Helm repositories
helm repo update
# Upgrade a specific application
./bin/helm-install <chart-name> --upgrade
For services deployed with deploy-service
:
# Edit the service YAML
nano services/<service-name>/service.yaml
# Apply changes
kubectl apply -f services/<service-name>/service.yaml
Security
Best Practices
-
Keep Everything Updated:
- Regularly update K3s
- Update all infrastructure components
- Keep application images up to date
-
Network Security:
- Use internal services whenever possible
- Limit exposed services to only what's necessary
- Configure your home router's firewall properly
-
Access Control:
- Use strong passwords for all services
- Implement a secrets management strategy
- Rotate API tokens and keys regularly
-
Regular Audits:
- Review running services periodically
- Check for unused or outdated deployments
- Monitor resource usage for anomalies
Security Scanning (Future Implementation)
Tools to consider implementing:
-
Trivy for image scanning:
# Example Trivy usage (placeholder) trivy image <your-image>
-
kube-bench for Kubernetes security checks:
# Example kube-bench usage (placeholder) kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
-
Falco for runtime security monitoring:
# Example Falco installation (placeholder) helm repo add falcosecurity https://falcosecurity.github.io/charts helm install falco falcosecurity/falco --namespace falco --create-namespace
System Health Monitoring
Basic Monitoring
Check system health with:
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods -A
# Persistent volume claims
kubectl get pvc -A
Advanced Monitoring (Future Implementation)
Consider implementing:
-
Prometheus + Grafana for comprehensive monitoring:
# Placeholder for future implementation helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
-
Loki for log aggregation:
# Placeholder for future implementation helm repo add grafana https://grafana.github.io/helm-charts helm install loki grafana/loki-stack --namespace logging --create-namespace
Additional Resources
This document will be expanded in the future with:
- Detailed backup and restore procedures
- Monitoring setup instructions
- Comprehensive security hardening guide
- Automated maintenance scripts
For now, refer to the following external resources: