andrewlb notes

TuringPi

TuringPi

Tools

Claude CodeAnsibleK3sFluxCDTraefikMetalLBcert-managerSOPSAgeLonghornPrometheusGrafanaLokiVeleroTailscale

What worked

91% complete (phases 1-7 done, phase 8 pending), 23 plans in ~0.92 hours of total execution time. Claude Code produced 40 idempotent Ansible roles with FQCN and serial execution for etcd safety, the PITFALLS.md up front identifying 7 critical risks with prevention strategies before a single role was written. K3s HA with 3 servers + 1 agent, embedded etcd, and true survival-of-1-node-failure worked on the first full deploy. The 3-tier storage split (local-path fast, Longhorn distributed 2-replica, NFS to Synology) handled 7 production apps (Home Assistant, AdGuard, Immich, Paperless, Penpot, Calibre, Sonarr with VPN routing). FluxCD introduced gradually with prune: false first.

What broke

32GB total RAM is a hard ceiling — memory-heavy apps not viable, every deploy requires resource tuning. Node 3 SATA is PCIe Gen 2 x1 (500MB/s) — single point for write-intensive workloads. HAOS migration complexity: addons don't exist as standalone containers so running Home Assistant in K3s instead of HAOS meant rebuilding the integration story. ARM64 image compatibility remains a constant friction — many images lack ARM64 variants. eMMC/SD write wear is mitigated with tmpfs + log rotation but still a risk.

Roles

I set the 'never SSH to production' constraint — everything flows through Ansible or FluxCD, period. Hardware choices (4x CM4 8GB, Longhorn 2-replica not 3 to save storage, MetalLB Layer 2 because no BGP router at home) were mine. Claude Code wrote all 40 Ansible roles, the FluxCD kustomizations, the Kyverno policies, and the backup infrastructure (Velero + Backblaze B2 + etcd snapshots + K3s token backup). The decision to use SOPS + Age over Sealed Secrets was Claude's proposal that I accepted for simpler key management.

TuringPi (Homelab Kubernetes Cluster)

Overview

TuringPi is a production-grade, self-hosted Kubernetes homelab cluster on Turing Pi 2 hardware (4x Raspberry Pi CM4 modules, 8GB each). It runs containerized applications with full automation, high availability, and disaster recovery — entirely managed through configuration, no manual SSH operations.

Target users: Homelab enthusiast (sole operator) seeking self-hosted alternatives to cloud services.

Key Features

  • K3s Kubernetes Cluster — 3 server nodes (HA with embedded etcd) + 1 agent node
  • 7 Applications: Home Assistant Core, AdGuard Home, Immich, Paperless-ngx, Penpot, Calibre-web, Sonarr with VPN routing
  • GitOps Pipeline — FluxCD watches GitHub repo, auto-reconciles cluster state
  • 3-Tier Storage: Local-path (fast), Longhorn (distributed, 2-replica), NFS to Synology NAS (shared)
  • Networking: MetalLB load balancer, Traefik ingress, cert-manager TLS, Tailscale remote access
  • Monitoring: kube-prometheus-stack (Prometheus/Grafana), Loki log aggregation, Alertmanager
  • Backup: Velero with Backblaze B2 cloud backend
  • Security: SOPS/Age encryption, NetworkPolicies, Kyverno policy enforcement

Architecture

Tech Stack

LayerTechnology
KubernetesK3s v1.34.3+k3s3 (ARM64, embedded etcd)
OSUbuntu Server 24.04 LTS (arm64)
GitOpsFluxCD v2.5+
IngressTraefik v3.6+
Load BalancerMetalLB v0.14+ (Layer 2)
TLScert-manager v1.16+ (Let's Encrypt)
SecretsSOPS v3.9+ with Age v1.2+
VPNTailscale + WireGuard
StorageLonghorn v1.7+, NFS CSI, local-path
Monitoringkube-prometheus-stack, Loki, Grafana Alloy
ProvisioningAnsible 2.15+ (40 roles)

Hardware

NodeRolePeripherals
Node 1K3s init serverGPIO, HDMI, mini PCIe, DSI
Node 2K3s server (HA)mini PCIe, M.2
Node 3K3s server + SATA2x SATA III, M.2 (only node with SATA)
Node 4K3s agent (worker)4x USB 3.0, M.2

Ansible Automation

40 roles organized into 7 phases:

ansible/
├── site.yml              # Master playbook
├── roles/
│   ├── common/           # Hostname, timezone, packages, NTP, swap
│   ├── networking/       # Static IPs, /etc/hosts
│   ├── security/         # Users, SSH hardening (port 2222), firewall
│   ├── tailscale/        # VPN mesh network
│   ├── k3s-server/       # HA cluster with embedded etcd
│   ├── k3s-agent/        # Worker node join
│   ├── longhorn-install/ # Distributed storage
│   ├── nfs-csi-install/  # Synology NAS integration
│   ├── velero-install/   # Backup controller
│   ├── metallb-install/  # LoadBalancer
│   ├── traefik-install/  # Ingress
│   ├── cert-manager/     # Auto HTTPS
│   ├── fluxcd-bootstrap/ # GitOps + SOPS/Age
│   ├── fluxcd-kyverno/   # Policy enforcement
│   ├── core-services-*/  # Monitoring, DNS, dashboards
│   └── apps-*/           # Application deployments

Development History

91% complete (Phases 1-7 done, Phase 8 pending), 23 plans in ~0.92 hours:

PhaseFocus
1Foundation: Ansible scaffold, common/networking/security/tailscale
2K3s cluster: HA with 3 servers + 1 agent, embedded etcd
3Storage: Longhorn, NFS CSI, Velero backups
4Networking: MetalLB, Traefik, cert-manager, NetworkPolicies
5GitOps: FluxCD, SOPS/Age, infrastructure HelmReleases, Kyverno
6Core services: Prometheus/Grafana, AdGuard DNS, Portainer, Homepage
7Applications: CloudNativePG, HA Core, Immich, Paperless, Penpot, Calibre, Sonarr
8 (Pending)MCP integration, disaster recovery docs, hardware swap guides

Architectural Decisions

DecisionRationale
K3s over full K8sLightweight (100MB), ARM64, embedded etcd HA
FluxCD over ArgoCDLower resource footprint (critical for 8GB nodes)
Container HA over HAOSHAOS takes entire machine; containers enable cluster integration
3 servers + 1 agentOdd number for etcd quorum, survives 1 failure
Longhorn 2 replicas (not 3)Saves storage for 4-node cluster
SOPS + Age over Sealed SecretsSimpler key management, no Kubernetes controller dependency
MetalLB Layer 2Home network lacks BGP router

Strengths

  • Exceptional planning — PITFALLS.md identifies 7 critical risks with prevention strategies before building
  • Hardware-agnostic — Variables can swap CM4 -> CM5 without code changes
  • True HA — Survives 1 node failure with automatic failover
  • Production-grade security — SSH hardening, NetworkPolicies, Tailscale, SOPS
  • Ansible excellence — 40 idempotent roles, FQCN, serial execution for etcd safety
  • Pragmatic GitOps — FluxCD introduced gradually, prune: false initially
  • Comprehensive backup — Multiple layers including etcd snapshots, Velero, K3s token backup

Weaknesses & Risks

  • 32GB total RAM — Memory-heavy apps not viable; requires careful resource tuning
  • Node 3 SATA bottleneck — PCIe Gen 2 x1 (500MB/s); single point for write-intensive workloads
  • HAOS migration complexity — Addons don't exist as standalone containers
  • ARM64 image compatibility — Many images lack ARM64 variants
  • eMMC/SD write wear — Mitigated with tmpfs and log rotation but still a risk
  • Resource exhaustion — Single app without limits can cascade-fail the cluster

Connection to Other Projects

  • GoVejle — Could host GoVejle infrastructure components
  • CNC — Could monitor the cluster's applications
  • Roughneck — Potential deployment target for worker infrastructure