andrewlb notes

CNC (Command and Control)

CNC (Command and Control)

Tools

Claude CodeTypeScriptFastifyPostgreSQLDrizzle ORMBullMQRedisOllamaQwen3 32BGrafanaLokiAnsible

What worked

v1.0 shipped in 8 phases with all infrastructure git-managed — Grafana dashboards, alert rules, Ansible playbooks, and docker-compose all versioned from day one. Claude Code handled the fire-and-forget HTTP client pattern cleanly (monitored apps never block on the hub) and produced the stack-normalization dedup logic without hand-holding. 17 test files with per-test DB migration/cleanup emerged naturally from the prompts.

What broke

BullMQ and ioredis had a version mismatch that required an `as any` type cast — Claude surfaced the conflict but the library ecosystem wasn't ready. The admin auth coupling to Grafana is elegant when Grafana is up and broken when it isn't; I accepted it as a v1 tradeoff. Phase 10 (Job Gateway) still has GATE-02/03/04 loose ends because I moved to Roughneck before finishing the integration.

Roles

I defined the taxonomy — what counts as an error pattern, what the webhook event types should be, the three-app v1 scope (Etyde, GoVejle, plus the self-referential hub). Claude Code built the Fastify routes, the stack normalizer, and the Grafana provisioning. I drove the decision to use Ollama + Qwen3 32B locally instead of a SaaS classifier because zero-cloud-cost was a hard constraint.

CNC (Command and Control)

Overview

CNC is a centralized monitoring, error aggregation, and improvement-capture system for the portfolio of AI-generated applications. It acts as an "attention multiplier" for a solo developer — automatically watching production systems, detecting outages, classifying errors with local LLM inference, and capturing improvement ideas without requiring active surveillance.

Core value: Automatically know when things break across all apps without watching dashboards all day.

Target users: Solo/small developer teams running multiple production applications.

Currently monitoring: Etyde and GoVejle.

Key Features

  • Health monitoring — Real-time per-app status (up/down/unknown) with heartbeat staleness detection (180s threshold)
  • Error aggregation — Deduplication by stack signature (normalized SHA-256), occurrence tracking
  • Cross-app pattern detection — Same error category in 2+ apps signals structural problems
  • LLM-powered taxonomy — Ollama (Qwen3 32B) classifies errors into categories
  • Log aggregation — Loki-backed centralized logs queryable by app
  • Grafana dashboards — Git-managed (zero manual UI), health overview, error timelines, uptime trends
  • CLI tool — Status, errors, logs; exports error context as Claude Code prompts
  • Improvement notes queue — Capture and track improvement ideas as Grafana annotations
  • Job gateway API — Submission endpoint for Roughneck; enqueues to BullMQ
  • Webhook system — HMAC-signed event delivery for app.down, app.up, error.pattern, note.created, job.failed

Architecture

Tech Stack

LayerTechnology
Monorepopnpm 9.15.0 workspaces
APIFastify 5 + Node.js 22
DatabasePostgreSQL 17 + Drizzle ORM
QueueBullMQ 5.71 + Redis 7 (v2; v1 used pg-boss)
ProxyCaddy 2.11
ObservabilityGrafana 12.4 + Loki 3.6 + Prometheus
LLMOllama + Qwen3 32B (on M4 Mac Mini)
ProvisioningAnsible + Docker Compose
CI/CDGitHub Actions

Package Structure

packages/
  hub/       # Fastify API (10 DB tables, 9 route modules, BullMQ workers)
  client/    # npm package @lovettbarron/cnc (heartbeat loop, error reporting)
  cli/       # CLI tool (status, errors, logs, init)

infra/
  ansible/   # VPS provisioning playbook
  docker/    # Dockerfiles, init SQL
  grafana/   # Provisioned datasources, dashboards, alert rules

Key Patterns

  1. Fire-and-forget HTTP — Client never blocks on monitoring; hub being down doesn't affect monitored apps
  2. Dual auth model — x-api-key for monitored apps, Bearer + hashed lookup for Roughneck jobs, admin Bearer validated against Grafana
  3. Error dedup by stack signature — Normalize stack frames (remove :line:col), SHA-256 hash, unique index on (app_id, stack_signature)
  4. HMAC-SHA256 callback verification — Per-job secrets stored at enqueue time
  5. Grafana provisioning via file — Dashboards, datasources, alert rules all YAML/JSON in git; zero manual UI steps
  6. Heartbeat metadata snapshots — Optional JSON metadata with heartbeats for historical trending

Development History

v1.0 shipped March 16, 2026 (8 phases):

  • Hub infrastructure, Fastify API, Grafana dashboards, client library, error aggregation, improvement notes, LLM worker, CLI tool

v2.0 (~95% complete, Phases 9-13):

  • Redis + BullMQ migration (from pg-boss), job gateway API, worker cutover to Roughneck, historical trend dashboards, webhook event system

Strengths

  • Zero cloud cost — Local LLM, Hetzner VPS ~EUR4/month, no SaaS monitoring fees
  • Fire-and-forget client — Apps never block on monitoring; hub outage is invisible
  • Fully git-manageable — Grafana dashboards, alert rules, Ansible playbooks, docker-compose all versioned
  • Comprehensive test coverage — 17 test files with DB migration/cleanup per test
  • Intelligent dedup — Stack normalization handles code changes and source maps
  • Webhook extensibility — HMAC-signed, exponential backoff, dead-letter table

Weaknesses & Risks

  • Phase 10 (Job Gateway) incomplete — GATE-02/03/04 not fully integrated
  • BullMQ ioredis version mismatch — Worked around with as any type cast
  • Admin auth couples to Grafana — If Grafana down, admin endpoints unreachable
  • Staleness check uses JS filter — Fine for 3-5 apps, inefficient at 100+
  • Secret rotation not yet implemented — Callback secret lifecycle management is on the operational backlog
  • Prometheus included but unused — Adds 200-500MB RAM overhead for zero value

Connection to Other Projects

  • Etyde — Monitored app; sends heartbeats, errors, logs
  • GoVejle — Monitored app; sends heartbeats, errors, logs
  • Roughneck — CNC acts as job gateway; Roughneck workers consume jobs and callback with results