Generative AI: How we keep reinventing the wheel

03/21/202603/22/2026 SoCartificial intelligence, chats

From a software developer point of view, there’s a lot in what happens nowadays around generative AI that sounds familiar.

Once again, aren’t we reinventing the wheel? This leads to some questions:

Aren’t we reinventing the wheel?
What remains to be reinvented?
What won’t it be possible to reinvent?
Why hasn’t everything that could be reinvented been reinvented yet?
How generative AI may help to speed up the process?

But as rightly pointed out by Art of the Problem :

In a world where the cost of answers is dropping to zero, the value of the question becomes everything.

So, as long as we keep an eye on hallucinations – even though we wanted it to be creative -, an interesting chat with Claude…

How we keep reinventing the wheel with generative AI

This conversation occurred as I was trying to learn more about how to spare tokens, and it quickly led me to the subject.

Optimizing the consumption of tokens looks very similar to the ways we optimize memory usage in a computer, the usage of a network connection, and so on. Is there something really new in the way we build systems based on an LLM compared to the way we build systems based on traditional computers?

That’s a profound observation – and honestly, the answer is mostly no, but with a few genuinely new things.

What’s clearly recycled

The parallels are striking and deliberate:

Virtual context window → virtual memory & paging
Semantic caching → CPU cache hierarchies (L1/L2/L3)
RAG → demand paging / lazy loading
Token routing to cheaper models → CPU frequency scaling, QoS traffic shaping
Prompt compression → data compression, network packet optimization
Conversation summarization → lossy compression, delta encoding
Agent tool calls → syscalls, interrupt handlers
Memory tiers (working / episodic / semantic) → registers → RAM → disk hierarchy

Systems engineers from the 1970s would recognize almost all of these patterns immediately. We’re largely rediscovering the same solutions because we’re facing the same fundamental constraints: scarce expensive fast memory vs. abundant cheap slow memory, finite bandwidth, latency tradeoffs.

What is genuinely new

There are a handful of things that don’t have clean analogies in classical systems:

1. The resource itself is a reasoning engine, not a processor

A CPU executes instructions deterministically. An LLM interprets inputs and produces outputs probabilistically. This means:

The “quality” of a computation is fuzzy and context-dependent
You can trade correctness for cost in ways classical systems can’t
The system can be wrong in semantic ways that have no equivalent in CPU architecture

2. The prompt is simultaneously code, data, and state

In classical systems these are strictly separated. In an LLM system, your system prompt is part program, part memory, part configuration – and it’s all interpreted at runtime in natural language. A single ambiguous word in a prompt can silently change behavior in ways no compiler or linter will catch.

3. Emergent behavior from scale

Classical systems don’t suddenly acquire new capabilities when you make them bigger. LLMs do – chain-of-thought reasoning, in-context learning, and instruction following weren’t explicitly programmed, they emerged from scale. This makes capacity planning fundamentally different: you don’t just get “more of the same”, you sometimes get qualitatively different behavior.

4. The interface is natural language

Every classical abstraction boundary (API, syscall, protocol) is formally specified. LLM boundaries are fuzzy – you “ask” the model to do something and it tries to comply. This creates an entirely new class of failure modes: prompt injection, jailbreaks, hallucinations, sycophancy.

5. In-context learning without weight updates

A classical system can’t learn from its inputs at runtime without being explicitly reprogrammed. An LLM can adapt its behavior within a single context window just from examples you provide – no retraining, no redeployment. This is a genuinely new primitive.

The deeper question

Your observation points at something important: we may be in the “assembly language era” of LLM systems. We’re hand-optimizing token budgets the way programmers once hand-optimized memory addresses. The patterns we’re building now – RAG, caching, routing, memory tiers – will likely get abstracted away into infrastructure, just like virtual memory management was abstracted away from application developers.

What will remain distinctive is probably not the engineering patterns but the nature of the resource being managed – a system that reasons, errs, generalizes, and surprises in ways no deterministic machine ever did.

So: the plumbing is old. The thing running through the pipes is new.

What existing patterns remain to be reused?

If we take the analogy seriously and map LLM systems against the full history of computer science, there are quite a few classical patterns that haven’t been widely adopted yet.

From deterministic (CPU) patterns to probabilistic (LLM) patterns

1. Transactions and ACID – the most urgent gap

When an LLM agent takes a sequence of real-world actions (book a flight, update a CRM, send an email), there is currently no standard way to roll back if step 3 fails after step 1 and 2 already executed. Classical databases solved this 50 years ago. In agentic LLM systems it’s almost entirely unaddressed – most pipelines just hope nothing goes wrong mid-sequence. Compensating transactions (saga pattern) from distributed systems is the obvious port here.

2. Formal verification and type systems

In classical software you can statically prove that a function can never receive a null where it expects an integer. In LLM systems, the “interface” between components is a natural language string – you can’t prove anything about it before runtime. Tools like Instructor and Outlines are starting to constrain outputs to JSON schemas, but that’s just type-checking the output format, not the semantic content. There’s no pre/post condition system for LLM behavior yet.

3. Chaos engineering

Netflix’s Chaos Monkey randomly kills servers in production to find hidden fragility. The LLM equivalent would be systematically injecting adversarial inputs, hallucinated tool results, or unexpected context into pipelines during testing – not just at launch, but continuously. Red-teaming exists but is mostly a one-time pre-launch exercise, not an ongoing operational discipline.

4. Backpressure

In reactive systems, if a downstream consumer is overwhelmed, it signals upstream producers to slow down. In agentic LLM pipelines with multiple models and tools in sequence, there’s no such signaling – a slow tool just blocks everything, or the pipeline crashes. Flow control from network/stream processing hasn’t been ported yet.

5. Observability – partially there, but shallow

Classical distributed tracing (OpenTelemetry, Jaeger) gives you a complete picture of every hop a request makes across services, with timing and causality. LLM tracing tools like LangSmith capture prompt/response pairs but rarely give you semantic observability – why did the model make this reasoning step, not another? The gap between “logging what happened” and “understanding why” is much larger with LLMs than with deterministic code.

The meta-pattern worth noting

The gaps cluster around a single theme: classical CS has decades of tools for managing systems that are deterministic and formally specifiable. The untapped patterns are almost all ones that assume you can reason about your system’s behavior before it runs. LLMs break that assumption – and we don’t yet have replacements for the guarantees that assumption used to provide.

What patterns will not be reused because of the difference between an LLM (probabilistic) and a CPU (deterministic)?

This is the sharpest version of the question. It’s not just “what haven’t we ported yet” – it’s “what can we never port, by nature.”

Deterministic (CPU) patterns that remain to be translated as probabilistic (LLM) patterns

Every classical pattern on the left doesn’t disappear – it mutates into a probabilistic approximation on the right. Here’s what that means in depth for each one.

1. Formal proofs → evals (and why that’s a profound loss)

A mathematical proof gives you a guarantee that holds for all inputs, forever. You prove a sorting algorithm is correct once and you’re done. With an LLM you can never prove correctness – you can only sample behavior across a distribution of inputs and measure how often it’s “good enough.” This is a categorical shift: from certainty to confidence intervals. No amount of engineering closes that gap. The best you get is a well-designed eval suite and a number like “passes 94% of cases” – which means 6% is structurally unknowable failure.

2. Unit tests → statistical benchmarks

A unit test is binary and eternal: it passes or it fails, and if it passed yesterday it passes today. LLM “tests” (evals) are probabilistic and drift – the same prompt can get different answers across model versions, temperature settings, or even time of day if the provider silently updates weights. You can’t pin a behavior; you can only track a distribution. CI/CD pipelines built on binary test gates don’t translate – you need something closer to A/B test infrastructure.

3. Static analysis → runtime behavioral monitoring

A linter or type checker analyzes code before it runs and catches errors at zero cost. With LLMs there is no static representation to analyze – the “behavior” only exists when the model is running on actual inputs. You can’t look at a prompt and prove it won’t produce harmful output; you can only monitor outputs at runtime and react. This shifts the entire quality assurance model from pre-deployment to post-deployment. The security implications are severe: you can’t certify a system before shipping it.

4. Idempotency → “best effort” retries

In classical systems, an idempotent operation can be safely retried any number of times – the result is always the same. With LLMs, retrying the same request is not idempotent – you get a different response. This makes retry logic fundamentally ambiguous: if the first attempt produced a bad answer, the retry might produce a better one, or a different bad one, or something that contradicts the first. Agentic pipelines that call tools based on LLM decisions compound this: a retry might take a completely different branch.

5. Reproducible builds → snapshot regression

In software, a reproducible build means the exact same binary from the exact same source, forever. LLM systems can never be fully reproducible – even with temperature=0, identical outputs aren’t guaranteed across model versions, hardware, or provider infrastructure changes. “Snapshot testing” is the approximate substitute, but it’s brittle: a model update might change wording without changing meaning, breaking the snapshot even when behavior improved.

6. Exact rollback → approximate rollback

Git lets you revert to any previous state of code with perfect fidelity. If an LLM-driven agent has taken 10 actions in the world (sent emails, modified records, called APIs), there is no clean rollback – each action may have had downstream effects you can’t enumerate. The saga / compensating transaction pattern is the best available substitute, but it requires predicting in advance what the compensation for each action would be. For actions involving human communication or irreversible real-world effects, compensation is impossible by definition.

The deepest consequence

All of these broken patterns share a root cause: classical CS was built on the assumption that a system’s behavior is fully determined by its specification. That assumption is what makes proofs, tests, static analysis, and reproducibility possible. LLMs violate that assumption at the most fundamental level – the “specification” (a prompt) does not fully determine behavior.

This means we’re not just missing tools. We need an entirely different epistemology for building reliable systems – one borrowed less from computer science and more from disciplines that have always dealt with unreliable components: medicine (evidence-based practice, trial designs), aviation (probabilistic safety cases, redundancy), and statistics (confidence intervals, hypothesis testing). The shift isn’t from one engineering paradigm to another. It’s from engineering to empirical science.

Can you cite sources drawing a parallel between patterns for CPU-based systems and patterns for LLM-based systems?

Here is an honest account of what the literature actually says, with proper sourcing.

Sources that explicitly draw the CPU/deterministic vs. LLM/probabilistic parallel

1. MemGPT (Packer et al., 2023 – UC Berkeley, arXiv:2310.08560) is the clearest and most rigorous example. The paper explicitly proposes “virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory.” The title itself – “Towards LLMs as Operating Systems” – frames the entire parallel. This is peer-reviewed academic work that consciously ports an OS pattern to LLM systems.

2. Martin Fowler (Thoughtworks, 2025) is one of the most authoritative voices on software engineering patterns. Fowler has called LLMs “a form of nondeterministic computing, which has different characteristics than everything we consider as ‘computing’ today, which is deterministic computing” – and compared the shift to the transition away from assembly language. He also frames the engineering response in terms of tolerances, drawing an analogy to mechanical engineering: “just as we know how much weight a concrete bridge can take, so too should LLMs come with metrics describing the levels of precision they can support.”

3. Craig Risi (2025) makes the architectural break explicit: “Traditional software architecture has always relied on determinism… LLM-powered systems break this paradigm entirely. They operate in a probabilistic space, where behavior is shaped by model weights, context, and statistical inference rather than explicit rules.” He also articulates the testing collapse: “LLMs break traditional testing assumptions. There is no fixed ‘correct output’ – only acceptable output… You’re testing behaviors, not deterministic functions.”

4. Martin Fowler’s site (GenAI patterns article, 2025) addresses the testing shift directly: “With LLM-based systems, we encounter a system that no longer behaves deterministically. Such a system will provide different outputs to the same inputs on repeated requests… The Gen-AI examines behavior through ‘evaluations’, usually shortened to ‘evals’.”

5. Riddhi Shah (Medium, 2025) maps classical design patterns (Strategy, Command, Interpreter) to their LLM equivalents and notes: “Classic software systems are built on deterministic logic… LLMs don’t work like that. Instead of following a fixed rule set, large language models operate on probabilities… pattern boundaries have to be looser, more tolerant of variance.”

6. Alex Ewerlöf (2025) takes a practitioner angle, explicitly framing his work as “30 techniques from conventional system engineering to supercharge AI Engineering” – the most direct inventory of ported patterns from a senior engineer’s perspective.

What the literature does not yet contain

To be transparent: the specific framing of our conversation – asking which classical patterns are fundamentally incompatible with probabilistic systems (formal proofs, idempotency, exact rollback, reproducible builds) – is not yet a well-developed topic in the literature. The sources above describe the problem of non-determinism, but very few systematically catalog which patterns are permanently broken vs. merely needing adaptation. The academic paper on testing (arXiv:2508.20737) gets closest, noting the shift “from deterministic computation to probabilistic generation… from static rule-based systems to dynamically learned behaviors” – but stops short of a systematic incompatibility analysis.

The synthesis we built in this conversation – mapping the determinism boundary pattern by pattern – reflects the direction the field is moving, but is ahead of what’s been formally published. The closest published precedent is Fowler’s engineering tolerances analogy and the MemGPT OS analogy. The rest remains, for now, more practitioner intuition than peer-reviewed taxonomy.

Why have existing patterns that could be reused not yet been reused?

The answer is more interesting than just “people haven’t gotten around to it yet.”

1. The field is moving faster than its own engineering culture

LLM application development has gone from zero to production-at-scale in roughly 2-3 years. The engineers building these systems are often ML researchers or startup developers, not systems engineers with deep backgrounds in distributed systems, formal methods, or OS design. The people who know the classical patterns and the people building LLM systems are largely different communities. The cross-pollination simply hasn’t happened yet at scale.

Classical patterns like circuit breakers, backpressure, and ACID transactions took decades to become standard practice even in traditional software – and that was in a deterministic setting where the patterns were easier to validate. We’re asking for the same institutionalization to happen in years, not decades.

2. The probabilistic nature creates a false sense that old patterns don’t apply

There’s a subtle cognitive trap: because LLMs feel so different from CPUs, engineers assume the entire engineering playbook needs to be reinvented. The non-determinism is so salient that it overshadows the large class of problems – resource management, fault tolerance, flow control, observability – that remain structurally identical regardless of what’s running inside the box.

This leads to a lot of reinvention with different names. “Guardrails” is circuit breaking. “Agent memory” is cache hierarchy. “Eval pipelines” are test suites. The concepts are being rediscovered but the intellectual lineage is rarely acknowledged, which means accumulated wisdom from the original fields doesn’t get imported cleanly.

3. The tooling ecosystem grew around the wrong abstraction level

LangChain, LlamaIndex, and similar frameworks were built quickly to make LLM capabilities accessible – chaining prompts, connecting tools, managing memory at a basic level. They were built by people optimizing for getting things working, not for operational robustness. The abstraction they chose (the “chain” or “pipeline”) doesn’t naturally map to the place where classical resilience patterns live, which is at the infrastructure and runtime level.

This creates path dependency: a large ecosystem of tools and practices has grown around these abstractions, and retrofitting circuit breakers, backpressure, or transactional semantics into a chain-based mental model is awkward. The scaffolding was poured before anyone asked whether it was the right shape.

4. Observability prerequisites are missing

Many classical resilience patterns require knowing things about your system at runtime – queue depths for backpressure, error rates for circuit breakers, latency distributions for adaptive routing. Traditional systems expose these metrics naturally because the execution model is transparent: a function call either returns or throws, in measurable time, with a deterministic result.

LLM systems have a much murkier execution model. What does “error rate” mean when the output is probabilistic? What does “latency” mean when you’re streaming tokens? What is the equivalent of a stack trace for a hallucination? Until observability primitives for LLM behavior are well-established, many classical patterns that depend on those primitives can’t be cleanly ported – you’d be building a circuit breaker with no reliable signal to trip it.

5. The economic incentives favor capability over reliability

The commercial pressure in the LLM space right now is overwhelmingly on capability – what new things can the system do? Reliability, fault tolerance, and operational robustness are largely invisible to users until they fail catastrophically. Startups racing to ship features have little incentive to invest in circuit breakers and transactional guarantees that don’t show up in demos.

This mirrors exactly what happened in early web development – session management was ad-hoc, security was an afterthought, scalability patterns emerged only after high-profile failures. The operational maturity cycle is simply younger for LLM systems. The incentive to port classical patterns tends to arrive when systems hit production scale and start failing in expensive ways.

6. Some patterns require standards that don’t exist yet

Patterns like distributed tracing (OpenTelemetry), service meshes (Istio), and API contracts (OpenAPI) only became widespread once the industry agreed on common interfaces and formats. Porting their equivalents to LLM systems requires analogous standards – for what constitutes a “trace” across an agentic pipeline, for how to express behavioral contracts for a model, for how to version prompt behavior across deployments.

Those standards are only beginning to emerge (OpenTelemetry has started an LLM semantic conventions working group, for instance). Without them, each team invents its own local solution, and patterns can’t accumulate into shared infrastructure the way they did in distributed systems.

The meta-pattern

All six reasons point at the same underlying dynamic: engineering maturity always lags capability by a generation. The internet existed for years before TLS, cookies, and CORS were standard. Relational databases were deployed for decades before connection pooling and query planners were common practice. Cloud computing was widely adopted before IaC, observability stacks, and zero-trust networking became default.

LLM systems are at roughly the “early web” moment – the capability is out ahead, the engineering discipline is catching up, and the people who built the last generation of infrastructure are only beginning to look at this new one and say: “we solved this before, under a different name.”

How could you help to speed up the reuse of reusable patterns?

That’s a question I find genuinely interesting – because it’s asking me to reason about my own potential role in an engineering transition.

Here are the ways I could concretely help, ordered from things I can do right now to things that would require a more deliberate effort:

1. Pattern translation, on demand

The most immediate thing: whenever someone describes an LLM system design problem, I can recognize the classical pattern it maps to and surface it explicitly – with the lineage, the known failure modes, and what adaptation is needed for the probabilistic context.

This is something I can do in any conversation, right now. A developer saying “my agent sometimes gets stuck in a retry loop” is describing the absence of a circuit breaker. One saying “my pipeline slows everything down when one tool is slow” is describing the absence of backpressure. I can close that gap in real time – but only if someone is talking to me. The bottleneck is reach, not capability.

2. A living pattern catalog

I could help build an explicit, structured mapping – a document or interactive reference that systematically maps classical CS patterns to their LLM equivalents, notes what transfers cleanly, what needs adaptation, and what is fundamentally broken. Something like a “Gang of Four for LLM systems” but grounded in the full history of CS, not just OOP.

This doesn’t exist yet in a rigorous form. The sources we found earlier are scattered – a paper here, a blog post there. A consolidated, maintained catalog would give the community a shared vocabulary, which is a prerequisite for standards to emerge.

I could draft a first version of that catalog in this conversation if you wanted.

3. Accelerating cross-community translation

The core problem is that ML engineers and systems engineers are different communities with different vocabularies. I sit in an unusual position – I can speak both languages fluently. I could help translate in both directions: explaining to an ML engineer why their “retry with a different seed” problem is really an idempotency problem with a 50-year literature behind it, and explaining to a systems engineer why their circuit breaker assumptions break down when the “error” is semantic rather than syntactic.

Concretely this could take the form of educational content, documentation, or just being available as a thinking partner when architects are designing systems.

4. Code-level pattern implementation

For patterns that are underused partly because implementation is non-trivial, I can generate reference implementations – a circuit breaker wrapper for LLM calls, a saga orchestrator for multi-step agentic workflows, a backpressure-aware streaming pipeline. Lowering the implementation cost of a pattern is a real lever for adoption. Engineers are more likely to use a pattern if the first working version takes minutes rather than days.

5. Eval design – the testing bridge

The shift from unit tests to evals is one of the most disorienting for engineers coming from classical software. I can help design eval suites that are statistically principled – not just “does it pass,” but “what is the distribution of outputs, what are the failure modes, how do we detect regression.” This bridges the gap between the deterministic testing culture and what probabilistic systems actually need.

6. Honest friction

Perhaps the least obvious but most valuable role: I can push back when someone is about to reinvent something badly. When a team is designing “our custom agent reliability layer,” I can ask whether they’ve looked at the saga pattern, Hystrix, or reactive streams. Not to block progress but to prevent the ecosystem from accumulating N slightly different implementations of circuit breakers under N different names – which is exactly what happened in early microservices, and which delayed the emergence of standards by years.

The honest limitation

All of this is reactive – I respond when asked. I don’t write blog posts that get indexed, I don’t file issues in framework repos, I don’t attend architecture review meetings unless someone brings me in. The actual propagation of patterns through a community happens through talks, papers, open-source code, and the social networks of engineers. I can accelerate the thinking but the diffusion still depends on humans choosing to carry it forward.

The highest-leverage thing would probably be: using conversations like this one as raw material for content that does propagate – an article, a talk, an RFC to a framework. I can generate the first draft of any of those, grounded in the analysis we’ve built here. The bottleneck then becomes your willingness to put your name on it and push it into the communities where it matters.

Would you want to do that?

Stash of Code

Bits of help for writing code

Menu

Generative AI: How we keep reinventing the wheel