andrewlb notes

Roughneck

Roughneck

Tools

Claude CodeTypeScriptNode.jsBullMQioredisFastifypinoOllamaTurborepoVitestAnsibleWireGuard

What worked

The consolidation bet paid off — one job platform with a unified queue replaced three independent worker implementations, and the plugin architecture made migration straightforward (new job type = create package + implement interface + auto-discovered). The Prometheus Proxy wrapper on the Ollama client captured LLM metrics transparently with zero plugin code changes. The model upgrade from qwen3:32b to qwen3.6:27b brought tool calling support and 256K context at a smaller parameter count — a free improvement because the infrastructure was already instrumented.

What broke

The WireGuard VPN link is a genuine single point of failure — when it goes down, all async processing for three apps stops. The watchdog now includes worker health checks but extended outages are bounded only by available queue buffer. Prometheus metrics collection hung repeatedly (prom-client's collectDefaultMetrics blocked the event loop) and took multiple commits across two days to resolve. FlowProducer Redis disconnect events crashed the worker process before we added error listeners.

Roles

I set the consolidation bet — one platform beats three because model loading dominates cost when Ollama runs locally, and a unified queue gives cross-app priority. I defined the plugin contract and the global Ollama concurrency=1 constraint (M4 GPU can't run two large models without thrashing). Claude Code wrote the core worker engine, all plugins, Ansible roles, and the CLI.

Roughneck (Unified Job Execution Platform)

Overview

Roughneck is a plugin-based job execution platform that consolidates async background processing across three applications (Etyde, GoVejle, CNC) into a single deployable system running on an M4 Mac Mini.

Core purpose: Single API and queue for all async work (AI inference, data enrichment, scheduled tasks), replacing three independent worker implementations with one extensible architecture.

What It Does

  • Plugin-based job execution — 10 plugins across four apps, auto-discovered at boot via manifest declarations
  • Three resource-class queues — ollama (concurrency=1), io (concurrency=10), cpu (concurrency=4) for cross-job priority
  • BullMQ queueing with priority levels, retries, stalled detection, dead-letter queue, and job flow composition (parent-child pipelines)
  • HMAC-signed webhook callbacks with exponential backoff retry
  • LLM observability — Prometheus metrics (tokens, tokens/sec, load duration, response time) with Grafana dashboards
  • Scheduled jobs via BullMQ repeatables (cron)
  • Infrastructure as code — Ansible playbooks for both Mac Mini and VPS deployment

How It Fits Together

Turborepo monorepo with shared types, a core worker engine, a CLI, and 10 plugin packages. The VPS (Hetzner) runs Docker Compose with Redis, Grafana, Prometheus, and the job gateway. The Mac Mini runs Ollama and the Roughneck worker as a launchd service. WireGuard VPN connects them; Redis is bound to the WireGuard interface only. Plugins declare a manifest (name, version, resourceClass, dependencies, retry config) and the core routes jobs to the correct queue, injects dependencies, and applies retry settings.

Architecture Decisions

  • Three resource-class queues, not per-job-type — Cross-job priority and simpler topology. An Ollama inference job from any app competes fairly in one queue.
  • Global Ollama concurrency=1 — Hard physical constraint: the M4 GPU can run one large model at a time; two simultaneous requests cause thrashing. This single constraint shaped the entire queue topology.
  • CNC Hub as job gateway — Single auth point so apps never touch Redis directly. Adds a hop but centralizes access control.
  • Manifest-driven plugin system — Plugins declare their resource needs; core auto-discovers at boot. No core changes needed for new plugins. The tradeoff is that the manifest schema is a contract that's hard to evolve.
  • Hard cutover per app — Migrated apps one at a time with a one-week soak period each, rather than running old and new workers in parallel. Simpler rollback at the cost of slower migration.

Iteration and Lessons

The initial v1.0 shipped quickly, but the real work was in post-launch hardening. Three categories of problems surfaced:

Infrastructure fragility: The WireGuard link went down and took all async processing with it. The watchdog evolved from a simple ping to include worker health checks and Homebrew PATH fixes for launchd (macOS-specific gotcha). This is the core tension of self-hosting: you own the uptime.

Observability surprises: Adding Prometheus metrics seemed straightforward, but collectDefaultMetrics blocked the event loop and hung the /metrics endpoint. It took six commits over two days to resolve — disabling default collection, adding timeouts, switching to individual metric collection. The lesson: observability tooling can degrade the system it's observing.

Model upgrades as infrastructure wins: Upgrading Ollama from qwen3:32b to qwen3.6:27b was painless because the Proxy wrapper pattern isolated the model from plugin code. Tool calling support and 256K context came for free. This validated the abstraction layer.

The newsletter segment isolation fix was a good example of production learning — one bad segment was blocking the entire newsletter pipeline. The fix (isolate segments so failures are independent) only surfaced from real production traffic.

Weaknesses & Open Questions

  • WireGuard VPN is a single point of failure — One tunnel failure blocks all async processing for three apps
  • Queue buffer is finite during extended outages — Jobs accumulate with no backpressure mechanism
  • BullMQ stale jobs after reconnection — Active jobs can get stuck; mitigated with stalledInterval but not eliminated
  • Redis pinned to noeviction — Chose queue durability over memory elasticity; a long outage could exhaust memory
  • Callback delivery failures — Results can be lost if the callback target is down; mitigated with DLQ but not guaranteed
  • Open question: Should OpenClaw route through Roughneck for all LLM queries, or use Ollama directly for simple ones? The planned hybrid approach (direct for simple, Roughneck for batch) adds routing complexity.

Ecosystem Role

Roughneck is the shared async processing layer for Etyde (practice sessions), GoVejle (event pipeline), CNC (error classification), and Edytor (writing analysis). OpenClaw integration is planned for batch LLM workloads. The Mac Mini + WireGuard + VPS topology is shared infrastructure that other projects depend on.