CNC (Command and Control)
Tools
What worked
All infrastructure is git-managed from day one — Grafana dashboards, alert rules, Ansible playbooks, docker-compose all versioned, no manual UI steps. The fire-and-forget HTTP client pattern means monitored apps never block on the hub, so a CNC outage is invisible to production. The stack-normalization dedup logic (normalize frames, SHA-256 hash, unique index) handles code changes and source maps without manual tuning. Tests with per-test DB migration/cleanup emerged naturally and caught real integration issues.
What broke
BullMQ and ioredis had a version mismatch requiring an `as any` type cast — the library ecosystem wasn't ready. Admin auth coupling to Grafana is elegant when Grafana is up and broken when it isn't; accepted as a v1 trade-off. The Job Gateway phase has loose ends because I moved to Roughneck before finishing the integration — a pattern of widening scope before deepening.
Roles
I defined the taxonomy (error patterns, webhook event types, the three-app v1 scope) and drove the decision to use Ollama + Qwen3 32B locally instead of SaaS because zero-cloud-cost was a hard constraint. Claude Code built the Fastify routes, stack normalizer, and Grafana provisioning.
CNC (Command and Control)
Overview
CNC is a centralized monitoring, error aggregation, and improvement-capture system for the portfolio of self-hosted applications. It acts as an "attention multiplier" for a solo developer — automatically watching production systems, detecting outages, classifying errors with local LLM inference, and capturing improvement ideas without requiring active surveillance.
Core value: Automatically know when things break across all apps without watching dashboards all day.
Target users: Solo/small developer teams running multiple production applications.
Currently monitoring: Etyde, GoVejle, and OpenClaw (household AI agent on TuringPi K3s cluster).
What It Does
- Health monitoring — Real-time per-app status with heartbeat staleness detection
- Error aggregation — Deduplication by normalized stack signature (SHA-256), occurrence tracking, cross-app pattern detection
- LLM-powered error taxonomy — Ollama (Qwen3 32B) classifies errors into categories, running locally at zero cloud cost
- Centralized logging — Loki-backed logs queryable by app
- Git-managed Grafana dashboards — Health overview, error timelines, uptime trends, all provisioned from YAML/JSON in version control
- Webhook system — HMAC-signed event delivery for app status changes, error patterns, and job failures
- CLI tool — Status, errors, logs; exports error context as Claude Code prompts
- Job gateway — Submission endpoint for Roughneck workers via BullMQ
How It Fits Together
A pnpm monorepo with three packages: the Fastify hub API (PostgreSQL + BullMQ), an npm client library (@lovettbarron/cnc) that monitored apps import for fire-and-forget heartbeats and error reporting, and a CLI for manual inspection. Infrastructure runs on a Hetzner VPS provisioned by Ansible, with Grafana/Loki/Prometheus for observability and Ollama on an M4 Mac Mini (connected via WireGuard) for LLM classification.
Architecture Decisions
- Fire-and-forget HTTP client — Monitored apps never block on the hub; CNC being down doesn't affect production services.
- Dual auth model — API keys for monitored apps, Bearer tokens for Roughneck jobs, admin auth validated against Grafana. The Grafana coupling is a known weakness (admin endpoints fail when Grafana is down).
- Error dedup by stack signature — Normalize stack frames (strip line/col), SHA-256 hash, unique index per app. Handles code changes and source maps without manual rules.
- Grafana provisioning via file — Zero manual UI steps. Dashboards, datasources, alert rules all live in git.
- Local LLM over SaaS — Ollama + Qwen3 32B for error classification. Zero cloud cost was a hard constraint; the M4 Mac Mini handles inference adequately.
- HMAC-SHA256 callback verification — Per-job secrets stored at enqueue time for webhook integrity.
How It Evolved
v1.0 shipped the core loop: hub API, client library, Grafana dashboards, error aggregation, LLM classification, and CLI. v2 migrated from pg-boss to Redis + BullMQ, added the job gateway for Roughneck integration, and introduced the webhook event system.
The most telling iteration: OpenClaw (a K3s-hosted household AI agent) was added as a monitored app in May 2026 with Prometheus scraping via Tailscale Funnel and a dedicated Grafana dashboard. Adding a new monitored app exercised the full integration path and required no changes to the core hub — which validated the extensibility model.
The Job Gateway phase has loose ends (GATE-02/03/04) because I moved to Roughneck before completing the integration. This is a recurring pattern — widening to the next project before deepening the current one.
Weaknesses and Open Questions
- Job Gateway incomplete — GATE-02/03/04 integration not finished; scope shifted to Roughneck.
- BullMQ/ioredis version mismatch — Worked around with
as anytype cast; library ecosystem issue. - Admin auth couples to Grafana — Grafana down means admin endpoints unreachable. Accepted trade-off, but fragile.
- Staleness check uses JS filter — Fine for 3-5 apps, won't scale to 100+.
- Secret rotation not implemented — Callback secret lifecycle management is on the operational backlog.
Ecosystem Role
CNC is the connective tissue for the portfolio — Etyde, GoVejle, and OpenClaw report into it, Roughneck consumes jobs from it, and TuringPi cluster observability flows through it. It's the project most affected by scope creep because every new project wants monitoring.