andrewlb notes

Roughneck

Roughneck

Tools

Claude CodeTypeScriptNode.jsBullMQioredisFastifypinoOllamaTurborepoVitestAnsibleWireGuard

What worked

29/29 plans across 6 phases shipped in ~1-2 days. Claude Code built the plugin-based architecture cleanly — manifest-driven dependency injection with auto-discovery at boot. The three resource-class queue split (ollama=1, io=10, cpu=4) gave cross-job priority without per-job-type topology. BullMQ Flows composed GoVejle's translate → enrich → newsletter as parent-child pipelines. HMAC-signed webhook callbacks with exponential backoff retry handled delivery failures. The Ansible playbooks produced both Mac Mini launchd service and VPS Docker deployment from one role set.

What broke

The VPN link between VPS and Mac Mini is a single point of failure — one tunnel failure blocks all three apps' async processing. BullMQ jobs can get stuck 'active' after reconnection (mitigated with stalledInterval but still a sharp edge). Redis eviction policy is pinned to noeviction as a deliberate operational choice. Extended outages are bounded by available queue buffer, which is one of the lessons for any hybrid local/remote setup. Callback delivery failures are handled via the DLQ.

Roles

I set the consolidation bet — one job platform is better than three because model loading dominates cost when Ollama runs locally, and a unified queue gives cross-app priority. I defined the plugin contract (manifest, resourceClass, dependencies, retry config). Claude Code wrote the core worker engine, all 9 plugins (echo, etyde-session, etyde-set, govejle-* x4, cnc-* x2), the Ansible roles, and the CLI. The global Ollama concurrency=1 constraint was my hard physical constraint (M4 GPU can't run two 32B models without thrashing) that shaped the entire queue topology.

Roughneck (Unified Job Execution Platform)

Overview

Roughneck is a unified, plugin-based job execution platform that consolidates async background processing across three applications (Etyde, GoVejle, CNC) into a single deployable system running on an M4 Mac Mini.

**Core purpose:** Single API and queue for all async work (AI inference, d ata enrichment, scheduled tasks), replacing three independent worker implementations with one extensible architecture.

Key Features

  • Plugin-based architecture with manifest-driven dependency injection
  • Three resource-class queues: ollama (concurrency=1), io (concurrency=10), cpu (concurrency=4)
  • BullMQ-based queueing with priority levels, retries, dead-letter queue
  • HMAC-signed webhook callbacks with exponential backoff retry
  • Health/metrics endpoints with Prometheus export for Grafana
  • Scheduled job support (cron) via BullMQ repeatables
  • Job flow composition (parent-child pipelines) via FlowProducer
  • 9 plugins: echo, etyde-session, etyde-set, govejle-translation, govejle-enrichment, govejle-newsletter, govejle-scheduler, cnc-error-classify, cnc-pattern-detection

Architecture

Tech Stack

LayerTechnology
RuntimeNode.js 22+ + TypeScript 5.7+
QueueBullMQ v5.71 + ioredis v5.10
ServerFastify v5.8 + pino v10.3
LLMOllama (Qwen3 32B)
MonorepoTurborepo v2.8
TestingVitest v4.1
CLICommander v13
DeploymentAnsible + launchd (Mac), Docker (VPS)
NetworkWireGuard VPN (10.0.0.0/24)

Structure

packages/
  shared/     # Types, Redis client, Ollama client, logger, constants
  core/       # Worker engine, plugin registry, callbacks, health server
  cli/        # ask, status, jobs, health commands
  plugins/    # 9 plugins (echo, etyde-*, govejle-*, cnc-*)
deploy/
  ansible/    # Playbooks, roles (wireguard, ollama, roughneck, vps)
docs/         # Architecture, plugin guide, ops runbook, cutover plans

Deployment Topology

  • VPS (Hetzner): Docker Compose with Redis, CNC Hub, Roughneck container, Grafana, Prometheus, Loki
  • Mac Mini: Ollama, Roughneck worker (launchd service), Grafana Alloy for log shipping
  • Network: WireGuard VPN connecting VPS to Mac Mini; Redis bound to WireGuard interface only

Plugin Architecture

Plugins declare a manifest (name, version, resourceClass, dependencies, retry config). Core uses this to:

  • Route jobs to correct queue (ollama/io/cpu)
  • Inject dependencies via PluginContext (Ollama, logger, HTTP client)
  • Apply retry and stall detection settings
  • Auto-discover at boot (no core changes needed for new plugins)

Development History

100% complete — 29/29 plans across 6 phases, built March 20-21, 2026:

PhasePlansFocus
16Core platform (monorepo, worker engine, callback delivery, echo plugin)
23CNC job gateway (Redis service, enqueue API, client library)
34Etyde migration (session + set generation plugins, shadow testing)
48GoVejle migration (translation, enrichment, newsletter, scheduler)
54CNC migration (error classification, pattern detection, dashboards)
64Operations (Ansible, CLI, model monitoring, docs)

Architectural Decisions

DecisionRationale
Three resource-class queues (not per-job-type)Cross-job priority, simpler topology
CNC Hub as job gatewaySingle auth point, no Redis exposure to apps
Manifest-driven dependency injectionExplicit resource declarations, fail-fast at startup
BullMQ Flows for pipelinesGoVejle's translate->enrich->newsletter as composable steps
Global Ollama concurrency=1M4 GPU handles one 32B model; two simultaneous = thrashing
Hard cutover per app (not parallel)Simpler rollback, 1-week soak per app

Strengths

  • Clean separation — Core is pure infrastructure; plugins are pure domain logic
  • Scalable plugin system — New job type = create package + implement interface + auto-discovered
  • Comprehensive failure handling — Retry with exponential backoff, stalled detection, DLQ for both jobs and callbacks
  • Production-ready observability — Health endpoint, Prometheus metrics, structured logging, heartbeat
  • Infrastructure as code — Ansible playbooks for Mac Mini and VPS, auto-deployment

Weaknesses & Risks

  • VPN link is a single point of failure — One tunnel failure blocks all three apps' async processing
  • BullMQ jobs stuck after reconnection — Stale jobs can sit in "active"; mitigated with stalledInterval
  • Redis eviction policy deliberately pinned to noeviction — Trade-off: queue durability over memory elasticity
  • Callback delivery failures — Results could be lost if callback fails; mitigated with DLQ
  • Queue buffer is finite during extended outages — Jobs accumulate if the upstream link is down for long periods

Connection to Other Projects

  • Etyde — etyde-session and etyde-set plugins generate AI practice sessions
  • GoVejle — govejle-translation, enrichment, newsletter, scheduler plugins handle event pipeline
  • CNC — cnc-error-classify and cnc-pattern-detection plugins; CNC acts as job gateway