Files

Paul Payne 84376fb3d5 Initial commit.

2025-04-27 14:57:00 -07:00

7.6 KiB

Raw Blame History

Maintenance Guide

This guide covers essential maintenance tasks for your personal cloud infrastructure, including troubleshooting, backups, updates, and security best practices.

Troubleshooting

General Troubleshooting Steps

Check Component Status:

# Check all pods across all namespaces
kubectl get pods -A

# Look for pods that aren't Running or Ready
kubectl get pods -A | grep -v "Running\|Completed"

View Detailed Pod Information:

# Get detailed info about problematic pods
kubectl describe pod <pod-name> -n <namespace>

# Check pod logs
kubectl logs <pod-name> -n <namespace>

Run Validation Script:

./infrastructure_setup/validate_setup.sh

Check Node Status:

kubectl get nodes
kubectl describe node <node-name>

Common Issues

Certificate Problems

If services show invalid certificates:

Check certificate status:
```
kubectl get certificates -A
```

Examine certificate details:

kubectl describe certificate <cert-name> -n <namespace>

Check for cert-manager issues:

kubectl get pods -n cert-manager
kubectl logs -l app=cert-manager -n cert-manager

Verify the Cloudflare API token is correctly set up:

kubectl get secret cloudflare-api-token -n internal

DNS Issues

If DNS resolution isn't working properly:

Check CoreDNS status:

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -l k8s-app=kube-dns -n kube-system

Verify CoreDNS configuration:

kubectl get configmap -n kube-system coredns -o yaml

Test DNS resolution from inside the cluster:

kubectl run -i --tty --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

Service Connectivity

If services can't communicate:

Check network policies:
```
kubectl get networkpolicies -A
```
Verify service endpoints:
```
kubectl get endpoints -n <namespace>
```

Test connectivity from within the cluster:

kubectl run -i --tty --rm debug --image=busybox --restart=Never -- wget -O- <service-name>.<namespace>

Backup and Restore

What to Back Up

Persistent Data:
- Database volumes
- Application storage
- Configuration files
Kubernetes Resources:
- Custom Resource Definitions (CRDs)
- Deployments, Services, Ingresses
- Secrets and ConfigMaps

Backup Methods

Simple Backup Script

Create a backup script at bin/backup.sh (to be implemented):

#!/bin/bash
# Simple backup script for your personal cloud
# This is a placeholder for future implementation

BACKUP_DIR="/path/to/backups/$(date +%Y-%m-%d)"
mkdir -p "$BACKUP_DIR"

# Back up Kubernetes resources
kubectl get all -A -o yaml > "$BACKUP_DIR/all-resources.yaml"
kubectl get secrets -A -o yaml > "$BACKUP_DIR/secrets.yaml"
kubectl get configmaps -A -o yaml > "$BACKUP_DIR/configmaps.yaml"

# Back up persistent volumes
# TODO: Add logic to back up persistent volume data

echo "Backup completed: $BACKUP_DIR"

Using Velero (Recommended for Future)

Velero is a powerful backup solution for Kubernetes:

# Install Velero (future implementation)
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero --namespace velero --create-namespace

# Create a backup
velero backup create my-backup --include-namespaces default,internal

# Restore from backup
velero restore create --from-backup my-backup

Database Backups

For database services, set up regular dumps:

# PostgreSQL backup (placeholder)
kubectl exec <postgres-pod> -n <namespace> -- pg_dump -U <username> <database> > backup.sql

# MariaDB/MySQL backup (placeholder)
kubectl exec <mariadb-pod> -n <namespace> -- mysqldump -u root -p<password> <database> > backup.sql

Updates

Updating Kubernetes (K3s)

Check current version:
```
k3s --version
```
Update K3s:
```
curl -sfL https://get.k3s.io | sh -
```
Verify the update:
```
k3s --version
kubectl get nodes
```

Updating Infrastructure Components

Update the repository:
```
git pull
```
Re-run the setup script:
```
./infrastructure_setup/setup-all.sh
```

Or update specific components:

./infrastructure_setup/setup-cert-manager.sh
./infrastructure_setup/setup-dashboard.sh
# etc.

Updating Applications

For Helm chart applications:

# Update Helm repositories
helm repo update

# Upgrade a specific application
./bin/helm-install <chart-name> --upgrade

For services deployed with deploy-service:

# Edit the service YAML
nano services/<service-name>/service.yaml

# Apply changes
kubectl apply -f services/<service-name>/service.yaml

Security

Best Practices

Keep Everything Updated:
- Regularly update K3s
- Update all infrastructure components
- Keep application images up to date
Network Security:
- Use internal services whenever possible
- Limit exposed services to only what's necessary
- Configure your home router's firewall properly
Access Control:
- Use strong passwords for all services
- Implement a secrets management strategy
- Rotate API tokens and keys regularly
Regular Audits:
- Review running services periodically
- Check for unused or outdated deployments
- Monitor resource usage for anomalies

Security Scanning (Future Implementation)

Tools to consider implementing:

Trivy for image scanning:

# Example Trivy usage (placeholder)
trivy image <your-image>

kube-bench for Kubernetes security checks:

# Example kube-bench usage (placeholder)
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml

Falco for runtime security monitoring:

# Example Falco installation (placeholder)
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco --namespace falco --create-namespace

System Health Monitoring

Basic Monitoring

Check system health with:

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -A

# Persistent volume claims
kubectl get pvc -A

Advanced Monitoring (Future Implementation)

Consider implementing:

Prometheus + Grafana for comprehensive monitoring:

# Placeholder for future implementation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace

Loki for log aggregation:

# Placeholder for future implementation
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack --namespace logging --create-namespace

Additional Resources

This document will be expanded in the future with:

Detailed backup and restore procedures
Monitoring setup instructions
Comprehensive security hardening guide
Automated maintenance scripts

For now, refer to the following external resources:

7.6 KiB Raw Blame History