Most teams integrating LLMs into their products spend a disproportionate amount of time choosing the model. GPT-4o vs Claude vs Gemini. Benchmarks, pricing spreadsheets, vibes from Twitter. Then they spend another few weeks wiring up the API. And then they ship, and the results are disappointing, the LLM confidently gives wrong answers, it forgets things users told it two messages ago, it ignores the docs they carefully loaded in.
The diagnosis they reach is usually “we need a better model.” The real diagnosis is almost always “we need better context.”
The model is not the hard part. A modern frontier model is extraordinarily capable at the task you’re giving it. What it cannot do is reason well over bad inputs. Garbage in, garbage out is not a cliché here, it’s the central engineering problem. And the teams shipping good AI features are not shipping better models. They’re shipping better context pipelines.
What context actually means
The context window is everything the model can see when it generates a response. That includes your system prompt, any documents or data you’ve injected, the conversation history, and the user’s current message.
Most teams treat the context window like a bucket. They throw in what seems relevant and let the model sort it out. This works fine in demos. In production, it’s the source of almost every quality problem.
The model doesn’t have opinions about what’s important in its context. It doesn’t downweight the stale parts or flag the contradictions. If you put noise in, the model will reason over the noise as if it were signal. If you put in too much, the model will start losing track of things buried in the middle. If you put in nothing beyond the user’s message, the model will hallucinate to fill the gap.
These are not model failures. They are context failures.
The three ways teams get context wrong
They treat retrieval as a checkbox.
Retrieval-augmented generation, pulling relevant documents and injecting them into the prompt has become the standard answer to “how do we give the LLM our data.” The problem is that most teams implement retrieval once, find that it sort of works, and move on. They don’t treat it as a precision problem.
Bad retrieval is worse than no retrieval. If your retrieval pipeline surfaces the wrong chunks, the model will reason over them with full confidence. It won’t say “I couldn’t find anything relevant.” It will take the nearest document it was handed and construct an answer from it. The result is a hallucination that looks exactly like a grounded answer.
The failure mode here isn’t a missing feature. It’s insufficient investment in the plumbing. Chunking strategy, metadata filtering, result reranking, these are the things that determine whether your retrieval is actually giving the model useful signal. Teams that build RAG in an afternoon and declare it done are almost always in this camp.
They let the conversation history grow without managing it.
The naive implementation of a chat interface is to append every message to history and pass the whole thing to the model. This works until it doesn’t. Eventually the history gets long enough to push important context out of the effective attention window, and the model starts behaving as if it forgot earlier parts of the conversation. Users notice immediately.
The more subtle failure happens earlier: even before you hit the context limit, a long unmanaged history introduces noise. If the user changed their mind three messages ago, the earlier messages are now contradictory context the model has to resolve. It usually doesn’t resolve it cleanly.
Managing conversation history is not a feature teams usually plan for. It shows up as a bug later. Keeping a sliding window of recent turns, periodically summarising older segments, or selectively retrieving only the turns semantically relevant to the current query, these all work, but they require deliberately treating history as something you engineer, not something you accumulate.
They write the system prompt once and never touch it.
The system prompt is where you tell the model who it is, what it knows, and how it should behave. Most teams write one at the start of the project and treat it as configuration rather than code. This is a mistake.
The system prompt is the highest-leverage thing you can change to improve output quality. More than the model version. More than temperature settings. A well-engineered system prompt that gives the model precise constraints, what to do when it doesn’t know, what format to use, what topics to refuse, will outperform a vague one regardless of what model is underneath it.
The reason teams don’t iterate on system prompts is that they don’t have evals. If you have no way to measure whether a change to the prompt made things better or worse, you’ll stop experimenting after the first version. Building even a small set of test cases against which you can score prompt changes will immediately make your iteration tighter. This is the single most underrated investment in LLM quality.
The model question is not unimportant
None of this means model choice doesn’t matter. There are real differences between frontier models in how they follow instructions, how they handle long contexts, how they behave with structured outputs. For some applications, extended reasoning, tool use, code generation, the model is genuinely a significant variable.
But for most web app integrations, the model is already more than capable enough. The marginal quality gain from switching models is usually smaller than the gain from fixing a retrieval pipeline that’s returning the wrong documents, or from tightening up a system prompt that’s giving the model contradictory instructions about how to respond.
Teams reach for model upgrades because they’re easy to frame as a decision and they feel like they’re doing something. Context engineering is less legible. There’s no button to press. It requires instrumenting what’s going into your prompts, reading them, and thinking carefully about whether they’re set up to produce good outputs. It’s closer to debugging than to procurement.
What to actually do
If your LLM integration is producing bad outputs, the first thing to do is read your prompts. Not the code that builds them, the actual strings that get sent to the model. Log them. Look at them. Most teams have never done this.
What you’ll usually find is that the context is a mess: contradictory instructions, retrieved chunks with no relation to the user’s question, conversation history that’s longer than it should be, and a system prompt that’s accumulated three months of patches without any coherent structure.
Fix the context before you change the model. In almost every case, that’s where the problem is.
You may also like : Smaller Programmes, Better ROI: The Case for Change
