The Router Sees Everything: Adding Telemetry to Jano
A model server knows its own last response time. It cannot tell you how deep the queue is, how often the box is thrashing between models, or what every other caller has been getting lately. The router can, because it sits above the backends and owns every swap. So I taught it to.
Jano is the small router I run in front of my local model server. A burst of mixed requests used to cost one model swap each; Jano reorders them so the whole burst costs one swap. I wrote about the model it swaps between last week. This week I gave the router a memory.
The reason is a gap I kept hitting. When I wanted to know how the box was actually doing, I had nothing to ask. The model server can tell the one client it just answered how many tokens that response took. It cannot tell me how many requests are waiting, whether it just spent two minutes thrashing between models, or what the throughput has looked like across every caller in the last minute. That information exists. It is just not anywhere I can reach.
Why the router is the right place to look
Ollama, the default local stack for good reason, exposes per-response timing and which model is currently loaded. That is genuinely useful, and it is also the whole list. There is no endpoint for queue depth, no record of how often it swapped, no aggregate of what every caller has been getting. Not because the authors forgot, but because a single model server is the wrong altitude to see those things. It answers one request at a time and has no concept of the others.
A router does. Jano sits above the backends. Every request passes through it, every response streams back through it, and it is the thing that decides when to swap models and calls the script that does the swap. That position is the entire point. Queue depth is a number it already holds. The swap economics are events it already triggers. The aggregate throughput is the stream it already forwards. None of it needs a sidecar tailing logs or a second process scraping anything. It is counters and small ring buffers in code paths that already run.
So the wedge is simple: the router can natively report exactly the things a backend structurally cannot.
What it reports now
Three endpoints, all additive on top of the /health and /status that were already there.
GET /status grew the most. It already knew the current model and the queue depth. Now it also carries how many requests are in flight, how long the oldest waiting one has sat there, when the current model was loaded and how long that last swap took, how many swaps happened in the last hour (the thrash detector), the rolling tokens per second across every caller, and cumulative counters since start. A trimmed view from my actual box, moments after a real request:
{
"currentModel": "qwen-chat",
"queueDepth": 0,
"inFlight": 0,
"lastSwapDurationMs": null,
"swapsLastHour": 0,
"recentGenTokS": 66.3,
"requestsServedTotal": 1,
"tokensGeneratedTotal": 48,
"backendHealth": { "qwen-chat": "up" }
}
That 66.3 is the real generation speed of the model I wrote about last week, reported by the router, not measured by me with a stopwatch.
GET /metrics speaks the Prometheus text format, so the counters and a swap-duration histogram drop straight into Grafana with no translation. The homelab crowd already scrapes this format; now Jano is one more thing it can scrape.
GET /usage?limit=N is a ring of the last N requests, each with its token counts and timings. It answers the question every local-LLM tinkerer eventually asks: what did my requests actually cost in compute?
The one part that took thought
The values are easy to report once you have them. Getting them without breaking anything is where the care went.
Jano’s defining promise is that it is transparent. It forwards your request body untouched and streams the response straight back, so talking to the router is identical to talking to the backend directly. Telemetry must not change that. I cannot buffer the whole response to count tokens, because that would break streaming and pin memory on long generations.
The answer is to tap the stream as it passes rather than hold it. For a streamed response the token totals live in the final chunk, so a small tail window catches them while the bytes fly past to the client unchanged. For a one-shot reply the body is small enough to read whole. Then the counts get pulled from whatever shape the backend speaks: llama.cpp reports a timings block with tokens per second already computed, the OpenAI format reports a usage object, Ollama reports raw durations in nanoseconds. Jano normalizes all three into one number. When a backend reports nothing usable, the field is null. The router never invents a throughput it did not actually observe.
The line I will not cross
One thing I deliberately left out: hardware. GPU temperature, fan speed, power draw, a router cannot see any of that, and a number you cannot measure is worse than one you do not report. Those belong to a host-side helper; Jano reports the inference path it owns, and nothing it would have to fake.
What it means
This is the local-first dividend showing up again, in a quieter form than raw speed. The box you own can tell you what it is doing, in detail, for free, if you put the observer at the layer that can actually see. No per-request billing to reconstruct from an invoice, no provider dashboard showing you their version of your usage, no agent to install. The data was always there in the request stream. It only needed something standing in the right place to keep count.
The router was already the best seat in the house. It just was not taking notes. Now it does.