Good engineers make decisions based on data. Most businesses assumed that the more data, the better the decision. Then, several factors put a halt to the hoarding of ever more data. GDPR and its localized counterparts, and the cost of storage. However, before the GDPR came into effect, the Datensparsamkeit approach already existed.
Datensparsamkeit is a German word that’s difficult to translate properly into English. It’s an attitude to how we capture and store data, saying that we should only handle data that we really need.
I don’t agree with Martin Fowler’s claim that it’s difficult to translate. The translation of Sparsamkeit is frugality. By analogy in the context of coding assistants, Tokensparsamkeit means using only the tokens we really need. Token frugality is a good thing!
Today, critical resources aren’t CPU, RAM, or storage, but tokens. Tokens are a finite and expensive resource. My opinion is that soon, developers will be measured on their token usage: the better one will be the one using the fewest tokens to achieve similar results.
Imagine two engineers finishing the same job with the same quality in the same timeframe. If the organization needs to let go of one, it will be the one that is more expensive. In the era of AI, it means the one who consumes more tokens.
In this post, I’d like to describe a couple of methods to limit token usage.
Compression
One of the first steps toward Tokensparsamkeit is to compress tokens sent to the underlying LLM while keeping the same data. But what are tokens? It’s a gross oversimplification, but for the sake of explanation, let’s consider a word is a token. Read this deep dive if you want more details.
If we consider tokens to be words, we could remove articles and similar words from the payload to decrease the number of tokens. "Find the distance between the Earth and the moon" becomes "Find distance between Earth and moon". For all intents and purposes, the data received is the same, with fewer words.
The trick is to set a proxy between the client and the LLM backend.
I’m using rtk myself:
CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies
The tool works across file commands, git, gh, test runners, build/lint commands, aws, docker, kubectl, etc.
Note that it’s not a magical recipe, as rtk itself mentions:
This only applies to Bash tool calls. Claude Code built-in tools such as Read, Grep, and Glob bypass the hook, so use shell commands or explicit rtk commands when you want RTK filtering there.
Context optimization
The second step toward Tokensparsamkeit is to avoid stuffing the context with irrelevant data.
Most people who start using coding assistants assume the context only consists of the system prompt and user prompts. There actually is a lot more. Anthropic’s Effective context engineering for AI agents article mentions:
- System prompt
- User prompt
- Message history
- Tool definitions
- Tool results
- MCP servers
- RAG
- Agent memory if applicable
Claude Code introduced the option to compact (or clear?) the context before each interaction.
It explicitly asked with each interaction whether to do it.
I liked it, but they removed it a week or so later.
Perhaps too many people didn’t understand what it entailed?
In any case, make good use of the /compact command that most assistants provide:
it will reduce the conversation history to reduce its token usage, while keeping the relevant bits and pieces.
Also, notice that tools and MCP servers use tokens. The more you configure, the more tokens are used. Some MCP servers are so easy to set up that it’s tempting to stuff your assistant with them. Don’t. Or enable them either on a case-by-case basis or at the project level. Why enable the Vaadin MCP on a Rust project?
The same goes for tools, although I don’t think many use them a lot in comparison to MCP servers.
Local models
Token usage is very important for cloud-based models, because usage is billed per million tokens. We care about it less if we use a local model. More tokens means more memory and slower inference, but the cost is now decoupled. There are several ways to use local models, including AI gateways. In the scope of this article, I’ll keep it simple.
I want to keep Claude Code as the client, because it’s really good. At the same time, I want to use my own hardware with a local model. Hardware cost is upfront, but with zero recurring costs, except for power consumption.
If you want to just do it, How to Run Local LLMs with Claude Code is where I found the solution. Continue reading the section if you want to read about the issues I faced.
I tried initially to run Qwen3 32B via Ollama in Docker. Docker containers cannot access Apple’s Metal GPU framework, so the model ran entirely on CPU. It loaded successfully but crashed during inference with a 500 error; CPU-only inference on a 32B model is simply too slow to be usable.
I have been using Ollama as the default, because others did. Then I stumbled upon Friends Don’t Let Friends Use Ollama. I switched from Ollama to llama-cpp, which enabled low-level configuration.
The biggest hurdle was the context window size. Claude Code sends lots of tokens to the backend. On the OpenTelemetry tracing demo, it’s around 35k on each request.
I started with Qwen3 models.
The default token size wasn’t big enough.
When a model received more tokens than its maximum, llama server immediately rejects the request.
I tried to increase the limit with the --ctx-size option, to no avail.
Qwen3 models are trained with 32,768 tokens.
It’s a hard limit baked into the GGUF file metadata.
Llama server abides by it.
Llama server is meant to serve multiple requests simultaneously.
It turns out that the count of available tokens is shared equally across all possible requests.
If the number of max tokens is T and the server can handle x requests in parallel, each request only has T/x tokens available.
For this reason, I set the parallelism with --parallel 1.
Despite all of the above, it still didn’t work.
Mixture of Experts vs. dense models
I was using a dense model, which is what we use regularly. Dense models load all in memory at once. The alternative is to use a Mixture of Experts model.
In the context of transformer models, a MoE consists of two main elements:
- Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
- A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.
In layman’s terms, a MoE segments its weights/parameters into separate specialized submodels called experts.
A routing layer activates only the necessary experts depending on the request.
Compared to regular dense models, instead of computing across the entire model of size T, only a small subset of experts is activated per request.
The combined size of activated experts t is much smaller than T, even though the sum of all experts together is larger than T.
The Qwen3.5-35B-A3B model is a MoE that works perfectly on my machine. Unlike the dense Qwen3 models capped at 32k, it supports up to 262k tokens, well above what Claude Code uses.
Putting it all together
We still lack a couple of elements to reach the goal of Claude Code using a local backend.
To better interact with Claude Code, the model should return structured content.
That’s what the --jinja flag is for.
For better performance, you should also use Flash Attention.
It’s an optimized algorithm for computing the attention mechanism in Transformer models. It’s faster, more memory-efficient, and more scalable than standard attention.
Activate it via --flash-attn on.
The last configuration parameter is to offload as many layers as possible to the GPU with --n-gpu-layers 99.
The final server command line is:
llama-server \
--model ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
--n-gpu-layers 99 \
--ctx-size 65536 \
--parallel 1 \
--flash-attn on \
--jinja \
--port 8080
On the Claude Code side, we need to set several environment variables:
| Environment variable | Meaning | Example |
|---|---|---|
|
URL to the |
|
|
Anything |
|
|
Anything |
|
|
Self-explicit |
|
export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_AUTH_TOKEN=dummy
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
claude
At this point, you can use Claude Code, which will query your local model. Here’s a sample server output for a query, for information.
llama-server console output
srv params_from_: Chat format: peg-native
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.788 (> 0.100 thold), f_keep = 0.789
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 1464 | processing task, is_child = 0
slot update_slots: id 0 | task 1464 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 56401
slot update_slots: id 0 | task 1464 | n_past = 44456, slot.prompt.tokens.size() = 56378, seq_id = 0, pos_min = 56377, n_swa = 0
slot update_slots: id 0 | task 1464 | Checking checkpoint with [56141, 56141] against 44456...
slot update_slots: id 0 | task 1464 | Checking checkpoint with [55629, 55629] against 44456...
slot update_slots: id 0 | task 1464 | Checking checkpoint with [49151, 49151] against 44456...
slot update_slots: id 0 | task 1464 | Checking checkpoint with [40959, 40959] against 44456...
slot update_slots: id 0 | task 1464 | restored context checkpoint (pos_min = 40959, pos_max = 40959, n_tokens = 40960, n_past = 40960, size = 62.813 MiB)
slot update_slots: id 0 | task 1464 | erased invalidated context checkpoint (pos_min = 49151, pos_max = 49151, n_tokens = 49152, n_swa = 0, pos_next = 40960, size = 62.813 MiB)
slot update_slots: id 0 | task 1464 | erased invalidated context checkpoint (pos_min = 55629, pos_max = 55629, n_tokens = 55630, n_swa = 0, pos_next = 40960, size = 62.813 MiB)
slot update_slots: id 0 | task 1464 | erased invalidated context checkpoint (pos_min = 56141, pos_max = 56141, n_tokens = 56142, n_swa = 0, pos_next = 40960, size = 62.813 MiB)
slot update_slots: id 0 | task 1464 | n_tokens = 40960, memory_seq_rm [40960, end)
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 43008, batch.n_tokens = 2048, progress = 0.762540
slot update_slots: id 0 | task 1464 | n_tokens = 43008, memory_seq_rm [43008, end)
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 45056, batch.n_tokens = 2048, progress = 0.798851
slot update_slots: id 0 | task 1464 | n_tokens = 45056, memory_seq_rm [45056, end)
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 47104, batch.n_tokens = 2048, progress = 0.835163
slot update_slots: id 0 | task 1464 | n_tokens = 47104, memory_seq_rm [47104, end)
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 49152, batch.n_tokens = 2048, progress = 0.871474
slot update_slots: id 0 | task 1464 | n_tokens = 49152, memory_seq_rm [49152, end)
slot update_slots: id 0 | task 1464 | 8192 tokens since last checkpoint at 40960, creating new checkpoint during processing at position 51200
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 51200, batch.n_tokens = 2048, progress = 0.907785
slot update_slots: id 0 | task 1464 | created context checkpoint 6 of 32 (pos_min = 49151, pos_max = 49151, n_tokens = 49152, size = 62.813 MiB)
slot update_slots: id 0 | task 1464 | n_tokens = 51200, memory_seq_rm [51200, end)
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 53248, batch.n_tokens = 2048, progress = 0.944097
slot update_slots: id 0 | task 1464 | n_tokens = 53248, memory_seq_rm [53248, end)
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 55296, batch.n_tokens = 2048, progress = 0.980408
slot update_slots: id 0 | task 1464 | n_tokens = 55296, memory_seq_rm [55296, end)
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 55885, batch.n_tokens = 589, progress = 0.990851
slot update_slots: id 0 | task 1464 | n_tokens = 55885, memory_seq_rm [55885, end)
slot update_slots: id 0 | task 1464 | prompt processing progress, n_tokens = 56397, batch.n_tokens = 512, progress = 0.999929
slot update_slots: id 0 | task 1464 | created context checkpoint 7 of 32 (pos_min = 55884, pos_max = 55884, n_tokens = 55885, size = 62.813 MiB)
slot update_slots: id 0 | task 1464 | n_tokens = 56397, memory_seq_rm [56397, end)
reasoning-budget: activated, budget=2147483647 tokens
slot init_sampler: id 0 | task 1464 | init sampler, took 4.37 ms, tokens: text = 56401, total = 56401
slot update_slots: id 0 | task 1464 | prompt processing done, n_tokens = 56401, batch.n_tokens = 4
slot update_slots: id 0 | task 1464 | created context checkpoint 8 of 32 (pos_min = 56396, pos_max = 56396, n_tokens = 56397, size = 62.813 MiB)
srv log_server_r: done request: POST /v1/messages 127.0.0.1 200
reasoning-budget: deactivated (natural end)
slot print_timing: id 0 | task 1464 |
prompt eval time = 65949.79 ms / 15441 tokens ( 4.27 ms per token, 234.13 tokens per second)
eval time = 3639.91 ms / 87 tokens ( 41.84 ms per token, 23.90 tokens per second)
total time = 69589.71 ms / 15528 tokens
slot release: id 0 | task 1464 | stop processing: n_tokens = 56487, truncated = 0
Discussion
While the underlying model is important, most people undervalue the client. I used both Claude Code and Copilot CLI with the same underlying model, Claude Sonnet 4.6. I found Claude Code superior by far across several sessions.
Keeping Claude Code while hosting the model locally is a great cost-savvy alternative. You only need to pay for the hardware once. Granted, it’s slower, but it’s a business model worth considering if you have a limited budget, like at home. If you have well-designed working autonomous agents, you can run them during the night anyway.