Jano • a[7]t

I have one Mac Studio behind a small herd of agents that all want to talk to a language model, often to different models at the same time.

Giving each of them a direct line to llama-server puts the box into a permanent storm of loading and unloading weights, with the fans never quiet. Pre-loading a model per process avoids that and then runs out of RAM. Jano is the small thing in front that turns ten of those requests into one model swap.

The shape of the problem

Most local-LLM serving stories assume one client. Production setups solve it with autoscaling and a fleet of GPUs. I have an indie-engineer setup: one box, one developer, lots of small agents that fire in bursts.

The bursts are the interesting part. A scheduler wakes up to draft a tweet and makes one request. pi reviews a diff and makes maybe three. A build kicks off and fires fifteen. The load is spiky, often clustered and homogeneous (same model, ten times in two seconds).

Jano exploits that.

How it works

Jano is an OpenAI-compatible HTTP server on a specific port (configurable). Anything that already speaks the OpenAI Chat Completions API can point at it and not notice the difference.

When a call comes in, Jano picks the model from the request and drops the task into a single internal queue. The dispatcher reads that queue greedily. If the model currently loaded still has tasks waiting, it serves those first instead of swapping. A different model only gets activated when nothing is left for the one that is hot. Activation means calling out to a swap script you provide, polling the backend’s /health until it is ready, then streaming the response straight back to the client. The model stays loaded until the queue forces a switch.

The win is that a burst of N calls to the same model costs one model swap, not N. On my box that’s the difference between a noisy three-minute storm and a single sub-twenty-second load followed by ten fast streamed responses.

The other win: a chat call can go to a small fast model and a coding call can go to a slow heavy one from the same client, because the router demuxes by model field. The client never notices.

The same canonical six-request scenario the unit tests cover, animated end to end. Strict FIFO would force three model swaps here. Greedy collapses it to one.

What’s in scope

OpenAI-compatible Chat Completions for both llama.cpp and MLX backends. Streaming and tool-call payloads pass straight through, so the caller’s experience is identical to talking to the backend directly. The actual loading work is handed to a swap script you provide, which means “load this model” can mean launchctl, systemctl --user, Docker, or anything you want. Swap is idempotent and gated on a health poll, so half-loaded states never get traffic. A consecutive-swap-failure budget fails fast after a few misses, so a dead backend does not eat a swap timeout per request. A /health endpoint sits there for other infra to poll, which the live demos this site eventually hosts will rely on.

What it reports

A backend knows its own last response time. It does not know how deep the queue is, how often the box is thrashing between models, or what every other caller has been getting lately. Jano sits at the one place where all of that is visible, because it owns the queue and invokes every swap. As of v0.2.0 it reports it.

Three endpoints, all additive on top of the existing /health and /status:

GET /status grew. On top of queue depth and current model it now carries inFlight and oldestWaitingMs (the real “am I backed up?” signal), swap economics (currentModelLoadedAt, lastSwapDurationMs, swapsLastHour, which is the thrash detector), rolling tokens-per-second across every caller, and cumulative counters since start.
GET /metrics speaks Prometheus, so a swap-duration histogram and the token counters drop straight into Grafana.
GET /usage?limit=N is a ring of the last N requests with their token counts and timings, for answering “what did my requests actually cost.”

Token counts and tokens-per-second are read from whatever the backend already returns and normalized across shapes (llama.cpp timings, OpenAI usage, Ollama’s nanosecond fields). When a backend reports nothing usable, the number is null rather than a guess.

The line I will not cross is hardware. GPU temperature, fan speed, power draw: a router cannot see those and should not pretend to. They belong to a host-side helper. Jano reports what it owns, which is the inference path, and nothing it would have to invent.

What’s deliberately out of scope

Multi-GPU scheduling: one box, one accelerator. Tenancy and quotas: single user. Anything resembling a “marketplace” or “broker”.

This is intentional. The router serves my infrastructure, where I am the only writer. If you have a similar setup, Jano is a few hundred lines you can borrow. If you have a fleet, you want vLLM or Triton, not this.

A reasonable question at this point is “why not just use Ollama?” Ollama already serves multiple models from a single Go binary with OpenAI-compatible endpoints and on-demand loading, and if you have no backend preference yet, that is the right answer. Jano exists for the case where you have already committed to llama-server or mlx_lm.server or vLLM directly, and you want to keep that backend rather than wrap it in another runtime. The router is a thin layer in front of whatever you already chose, small enough to read in one sitting, and the swap policy is three lines you can change.

What I’d do differently

I would have started with the greedy queue. The first cut was strict FIFO and the result was exactly the thrashing I built Jano to avoid. The actual batching policy is three lines (takeNext in src/queue.ts) and it should have been there from commit one, not retrofitted later.

The greedy policy has a pathological case I have not hit yet but will. If one model gets a steady inbound stream and the other gets the occasional request, the occasional one can starve. Concretely: fifty code requests arriving faster than they can be served, one chat request sitting in the queue, and the chat waits until all fifty codes drain. The fix is a max-batch counter that forces a swap after N consecutive same-model serves. I will add it when I see the pattern for real, not before.

If I were starting today I would look harder at llama.cpp’s --models-preset flag, which lets one llama-server host both models with internal LRU swap and a single front-door URL. The brew formula did not ship that flag at the time I wrote Jano, and maintaining a source build was a cost I did not want to take on. Jano is the bridging solution. It is also a few hundred lines you can audit in an evening, which is its own kind of value.

Try it

Repo: https://github.com/a7t-ai/jano.

Most of the value is in two files: the greedy picker in src/queue.ts and the model-activation contract in src/swap.ts for whichever backend you point it at. If you want the same setup on your own box, the README has the launchd plist and the env layout. brain-llm coder|chat switches which model is hot, and Jano is the thing in front of that.

What I’d do for you

If you’ve got a workstation or a developer cluster behind a single model server and you’re bumping up against thrashing, the same router fits in front of your llama-server (or vLLM, or Ollama) and gives you that batch-once property without any changes to the model server itself. Tell me what your traffic looks like and we’ll scope the deployment.