From a software developer point of view, there’s a lot in what happens nowadays around generative AI that sounds familiar.
Once again, aren’t we reinventing the wheel? This leads to some questions:
- Aren’t we reinventing the wheel?
- What remains to be reinvented?
- What won’t it be possible to reinvent?
- Why hasn’t everything that could be reinvented been reinvented yet?
- How generative AI may help to speed up the process?
But as rightly pointed out by Art of the Problem :
In a world where the cost of answers is dropping to zero, the value of the question becomes everything.
So, as long as we keep an eye on hallucinations – even though we wanted it to be creative -, an interesting chat with Claude…
This conversation occurred as I was trying to learn more about how to spare tokens, and it quickly led me to the subject.
Optimizing the consumption of tokens looks very similar to the ways we optimize memory usage in a computer, the usage of a network connection, and so on. Is there something really new in the way we build systems based on an LLM compared to the way we build systems based on traditional computers?
That’s a profound observation – and honestly, the answer is mostly no, but with a few genuinely new things.
What’s clearly recycled
The parallels are striking and deliberate:
- Virtual context window → virtual memory & paging
- Semantic caching → CPU cache hierarchies (L1/L2/L3)
- RAG → demand paging / lazy loading
- Token routing to cheaper models → CPU frequency scaling, QoS traffic shaping
- Prompt compression → data compression, network packet optimization
- Conversation summarization → lossy compression, delta encoding
- Agent tool calls → syscalls, interrupt handlers
- Memory tiers (working / episodic / semantic) → registers → RAM → disk hierarchy
Systems engineers from the 1970s would recognize almost all of these patterns immediately. We’re largely rediscovering the same solutions because we’re facing the same fundamental constraints: scarce expensive fast memory vs. abundant cheap slow memory, finite bandwidth, latency tradeoffs.
What is genuinely new
There are a handful of things that don’t have clean analogies in classical systems:
1. The resource itself is a reasoning engine, not a processor
A CPU executes instructions deterministically. An LLM interprets inputs and produces outputs probabilistically. This means:
- The “quality” of a computation is fuzzy and context-dependent
- You can trade correctness for cost in ways classical systems can’t
- The system can be wrong in semantic ways that have no equivalent in CPU architecture


