The MoE Speedup, Measured: 50 Prompts, Two Local Qwen Models, One Mac Studio
A dense 27B at 8-bit runs my local chat work at 15 tokens a second. Its 35B Mixture-of-Experts sibling runs the same prompts at 70. I ran 50 prompts through both to find out what the extra speed actually costs.
My local chat model is a dense 27B at 8-bit. It is accurate, it is multilingual, and it generates at about 15 tokens per second on a 48GB Mac Studio. That is faster than I read, so for drafting and rewriting it has never been the bottleneck. But 15 tokens per second is the kind of number you stop noticing only because you got used to it.
Then the same model family shipped a Mixture-of-Experts sibling: 35B total parameters, but only about 3B active per token. On paper that should generate far faster, because each token touches a fraction of the weights. The question that actually matters is not whether it is faster. It is whether it is faster for free, or whether the speed quietly costs you quality.
So I ran 50 prompts through both and looked at every answer.
The setup
Both models run under the same local stack on the same machine, swapped in and out by a small router so only one sits in memory at a time. Same llama.cpp build, full GPU offload, 8-bit KV cache, 64k context, identical sampling, thinking mode off. The only variables are the model and its quantization.
- Dense: Qwen3.6 27B, Q8_0, about 27GB. The high-fidelity option.
- MoE: Qwen3.6 35B-A3B, Q6_K, about 27GB. Sparse activation, same footprint as the dense model. (I benchmarked it first at Q5, then moved up to Q6 for a reason the quality section makes clear.)
The 50 prompts span ten categories: technical explanation, reasoning, code, strict formatting, summarization, creative writing, German prose, Portuguese prose, factual gotchas, and rewriting. The full prompt-by-prompt catalog, both outputs side by side, lives alongside this post.
A note on honesty before the numbers: this is one run, not an averaged sweep, the judging is qualitative, and a few of the reasoning prompts hit my token cap and got cut off mid-solution. Treat it as a careful field test, not a leaderboard.
Speed: 4.5 times, flat across the board
| Category | Dense 27B (Q8) | 35B MoE (Q6) | Speedup |
|---|---|---|---|
| All 50 (mean) | 15.6 tok/s | 70.4 tok/s | 4.5x |
| Technical | 14.8 | 69.4 | 4.7x |
| Reasoning | 14.9 | 68.7 | 4.6x |
| Code | 15.1 | 69.6 | 4.6x |
| Formatting | 16.2 | 74.8 | 4.6x |
| German | 15.2 | 70.3 | 4.6x |
| Portuguese | 15.2 | 70.0 | 4.6x |
The interesting thing about this table is how boring it is. The speedup does not depend on the task. It is the architecture and the quant, not the workload. A sparse model reading 3B of weights per token instead of 27B is simply moving less data, and on Apple Silicon, where token generation is bound by memory bandwidth rather than compute, moving less data is the whole game.
Quality: near parity, with a pattern
If the MoE were 4.5 times faster and clearly worse, this would be an easy and uninteresting post. It is not. Across the 50 prompts the two models traded wins, with most answers a genuine tie. A few observations that held up across the set:
Both are factually solid. Every classic trap landed correctly for both: 9.9 is larger than 9.11, the capital of Australia is Canberra, the Great Wall is not visible from the Moon, and Strasbourg was a Free Imperial City in 1518, not French. No hallucinations in either column.
The MoE is crisper and more direct. Asked to “be concise,” the dense model offered three rewrites; the MoE gave one tight sentence. Asked a trick rate problem, the MoE led with the answer while the dense model was still enumerating machines when it ran out of room. Its taglines were wittier and its bash one-liner was cleaner. Lower latency plus a tendency to answer first is a real ergonomic win.
The dense 8-bit model looked more precise, until I checked the quant. At Q5 the MoE made two Portuguese spelling slips the 8-bit model did not, and dropped a stray em-dash into an error-message rewrite. Those clustered suspiciously in one place: lower-resource-language orthography, exactly the corner a lighter quant rounds off. So I re-ran the entire suite at Q6. The Portuguese came back clean, the em-dash was gone, and generation speed was unchanged, because the bottleneck is bandwidth and Q6 still moves far less data per token than the dense model. What Q6 did not fix: the MoE still read “garbage collection” as municipal trash trucks in a haiku, where the dense model saw the programming concept. That one is the model’s literal-mindedness, not the quant, and no number of bits will touch it. So the real trade-off, once you stop being cheap about the quant, shrinks to a single niche of literal wordplay.
More bits is not more smart
There is a tidy irony in the result worth sitting with. The model I deployed is a 6-bit quant. The one it replaced was 8-bit, nominally the higher-fidelity, less-compressed file. On the naive view that more bits means more quality, the dense 27B at Q8 should have won outright. It did not. A 6-bit MoE matched or beat an 8-bit dense model, and even the 5-bit version I tested first held parity. What you lose to coarser quantization turned out smaller than what you gain from a larger, sparser network with more total parameters to draw on. Bit-width is a knob, not the scoreboard: architecture and parameter count set the ceiling, and the quant only decides how close to it you land. The practical lesson for choosing a local model is to compare the actual outputs, not the precision label on the file.
What it means
For most of what a local model does day to day, in English and German, drafting, summarizing, rewriting, explaining, coding, the 35B MoE is a straight upgrade: roughly the same quality at 4.5 times the speed and the same memory footprint. That is not a marginal tuning win. That is the difference between waiting and not waiting.
The caveat was specific enough to act on, so I did. My work leans on Portuguese, so I stopped being cheap about the quant, re-ran at Q6, and the orthography slips vanished at no cost to speed. Q6 is what now serves my chat work, under the same name the dense model used, so everything downstream just got faster without knowing anything changed.
The broader point is the one local-first keeps making. None of this cost a cent or left the machine. No per-token bill, no rate limit, no provider deciding what the model will and will not say. The experiment was 50 prompts, two model files, and a router call to swap between them. The hardware you already own will tell you which model is right for your work, if you bother to measure instead of guess.