andrewlb notes

TuringPi

TuringPi

Tools

Claude CodeAnsibleK3sFluxCDTraefikMetalLBcert-managerSOPSTailscalePrometheusGrafanaLokiKyvernokube-router

What worked

The v1.2 architecture simplification was the pivotal decision: switching from 3-server etcd HA to 1-server + 3-agent and replacing Longhorn with local-path-provisioner freed ~1GB RAM and eliminated distributed storage complexity that was overkill for a homelab. The 'never SSH to production' constraint forced everything through Ansible or FluxCD, which paid off when recovery bootstrap scripts could rebuild the cluster from scratch. The Wyoming voice pipeline (Whisper STT -> OpenClaw -> Piper TTS) working end-to-end through Home Assistant validated voice as a viable interface for household AI. The v1.3 observability addition — Prometheus metrics on OpenClaw scraped via Tailscale Funnel into a 19-panel Grafana dashboard on the CNC VPS — gave the voice pipeline production-grade visibility without adding load to the cluster itself.

What broke

32GB total RAM is a hard ceiling — the simplification helped but every new service still requires careful resource tuning. ARM64 image compatibility remains persistent friction. The voice pipeline depends on the Mac Mini node being available with no HA fallback, which is exactly the kind of single point of failure I'd criticize in someone else's architecture. The Twilio bridge for Android Auto phone calls is still blocked on Docker image build and E2E verification (Wave 2 pending). MCP server integrations (Home Assistant, Penpot) add Claude Code access to cluster services but expand the attack surface.

Roles

I set the 'never SSH to production' constraint and made the v1.2 architecture simplification call based on real-world experience — etcd quorum was solving a problem I didn't have. I made the call to add OpenClaw observability as Phase 20.1 before closing v1.3. Claude Code wrote the recovery bootstrap scripts, kube-router CNI migration, 51 Ansible roles, per-service deployment playbooks, Twilio bridge server, Wyoming container configs, the custom HA conversation agent, the CNC heartbeat CronJob, and the Grafana dashboard.

TuringPi (Homelab Kubernetes Cluster)

Overview

TuringPi is a self-hosted Kubernetes homelab cluster on Turing Pi 2 hardware (4x Raspberry Pi CM4 modules, 8GB each) plus an Intel Mac Mini as an external compute node. It runs containerized applications with full automation and disaster recovery — entirely managed through GitOps and Ansible with zero manual SSH operations. Four milestones shipped (v1.0–v1.3), 21 phases completed.

Target users: Homelab enthusiast (sole operator) seeking self-hosted alternatives to cloud services.

What It Does

  • K3s v1.34 cluster (1 server + 3 agents + 1 external x86_64 agent) with FluxCD GitOps reconciliation
  • kube-router CNI (replaced Flannel in v1.2) with Tailscale vpn-auth mesh networking
  • 10+ applications: Home Assistant Core, AdGuard Home, Immich, Paperless-ngx, Penpot, Calibre-web, Sonarr with VPN routing, OpenClaw AI agent, Vikunja, Mosquitto MQTT
  • Wyoming voice pipeline: Whisper STT, Piper TTS, openWakeWord on Mac Mini, proxied through a custom HA conversation agent to OpenClaw, accessible via Siri Shortcut on iOS
  • OpenClaw observability: diagnostics-prometheus plugin, CNC heartbeat CronJob, 19-panel Grafana dashboard on CNC VPS, Prometheus scraping via Tailscale Funnel
  • MCP servers: Home Assistant API and Penpot API exposed to Claude Code via Tailscale
  • Twilio bridge for hands-free phone call interaction via Android Auto (Wave 1 complete, Wave 2 pending)
  • Full networking stack: MetalLB load balancer, Traefik ingress, cert-manager TLS (self-signed CA on *.homelab.local)
  • Monitoring: kube-prometheus-stack, Loki log aggregation, Alloy, Alertmanager
  • Backup: Velero with Backblaze B2 cloud backend
  • Security: SOPS/Age encryption, NetworkPolicies, Kyverno policy enforcement
  • 51 Ansible roles for idempotent provisioning with pre-flight validation

How It Fits Together

K3s runs on Ubuntu Server 24.04 (ARM64) across 4 CM4 nodes with the Mac Mini joining as an x86_64 external agent via secondary Ethernet and Tailscale. FluxCD watches a GitHub repo and auto-reconciles cluster state. Ansible handles initial provisioning and recovery (51 roles across 7 phases). Secrets are encrypted with SOPS/Age, never stored in plaintext. Storage uses local-path-provisioner with a 500GB SATA drive on Node 3 (replaced Longhorn in v1.2). containerd GC thresholds (high=70%, low=50%) conserve 14GB eMMC. Sequential deployment pattern prevents OOM from simultaneous image pulls.

Architecture Decisions

  • 1 server + 3 agents over 3-server etcd HA — Quorum was overkill for a homelab; simplification freed ~1GB RAM across the cluster
  • local-path + SATA over Longhorn — Distributed storage's memory footprint was unjustifiable on 8GB ARM nodes
  • K3s over full K8s — 100MB footprint, native ARM64 support
  • kube-router over Flannel — Better NetworkPolicy support, critical for K3s v1.34+
  • FluxCD over ArgoCD — Lower resource footprint, critical for memory-constrained nodes
  • Container HA over HAOS — HAOS takes an entire machine; containers enable proper cluster integration
  • SOPS + Age over Sealed Secrets — Simpler key management, no Kubernetes controller dependency
  • MetalLB Layer 2 — Home network lacks a BGP router
  • Observability off-cluster — OpenClaw metrics scraped via Tailscale Funnel to the CNC VPS, keeping Grafana dashboard load off the cluster's limited RAM

What Changed After Dogfooding

The biggest lesson was admitting that etcd HA was premature engineering. I was building for a failure mode (server quorum loss) that simply doesn't matter in a homelab where the whole board shares a single power supply. Dropping it freed real resources and reduced operational complexity. Similarly, Longhorn's distributed storage was solving for multi-node persistence I didn't need — a single SATA drive on one node works fine when Velero handles disaster recovery to B2 cloud.

The voice pipeline (v1.3) was an exercise in accepting dependency chains. Wyoming -> OpenClaw -> Piper all running on the Mac Mini means a single node failure kills voice. I chose to ship it anyway because the alternative was not shipping it at all, but it's a conscious debt. Adding observability (Phase 20.1) was the responsible follow-up — if the pipeline is going to be fragile, at least make it visible. The 19-panel Grafana dashboard and CNC heartbeat CronJob provide production-grade monitoring without adding memory pressure to the cluster.

Weaknesses & Open Questions

  • 32GB total RAM — Memory-heavy apps not viable; every new service requires resource tuning
  • Node 3 SATA bottleneck — PCIe Gen 2 x1 (500MB/s); single point for write-intensive workloads
  • ARM64 image compatibility — Many container images lack ARM64 variants
  • eMMC/SD write wear — Mitigated with tmpfs and log rotation but still a long-term risk
  • Voice pipeline has no HA — Mac Mini going offline kills Wyoming containers with no fallback
  • Twilio bridge incomplete — Wave 1 (server code) done, Wave 2 (Docker build + Twilio number verification) still pending
  • MCP attack surface — Home Assistant and Penpot APIs exposed to Claude Code via Tailscale expands what an AI agent can touch
  • No v2.0 roadmap — v1.3 milestone is complete; next direction undefined

Ecosystem Role

TuringPi is the deployment target for OpenClaw (household AI agent running as a K3s pod with Prometheus metrics) and provides the infrastructure backbone for Home Assistant integration. CNC monitors OpenClaw via Prometheus scrape through Tailscale Funnel and a dedicated 19-panel Grafana dashboard. The cluster could eventually host other portfolio services (GoVejle, Roughneck workloads) if RAM constraints ease with a hardware upgrade.