The bottleneck in enterprise AI isn't the model. It's the context layer underneath.
Every team building AI inside a company hits the same wall: the model doesn't know who you are. That's a context problem, not a model problem, and it's the bottleneck holding most enterprise AI back from being genuinely useful rather than just technically impressive.
Every team building AI inside a company arrives at the same conversation eventually. The model is impressive. The demos are great. The internal pilot looked promising. But when we try to make it useful for our actual work, with our actual data, it falls apart. The model doesn’t know who our customers are. It doesn’t know what we discussed in last week’s call. It doesn’t know which version of the contract is current. We’ve built a beautiful brain that has no memory of who we are.
This is not a model problem. The models are remarkably capable now, and getting more so. This is a context problem, and it’s the bottleneck holding most enterprise AI back from being genuinely useful rather than just technically impressive.
I’ve been thinking about this layer carefully over the last few months, partly because I’ve shipped a small piece of it myself, mostly because I think the next decade of how AI gets used inside companies will be determined by who solves it well. This post is what I’ve learned. It’s not a vendor pitch. I don’t work for any of the companies I’ll mention. It’s an engineer’s view of what the actual problem looks like, why the dominant solution today is structurally limited, and what the field is converging toward instead.
The amnesia problem and why we need a context layer
The fundamental thing to understand about large language models is that they have no memory between calls. Every time you send a message to Claude, GPT, Gemini, or any frontier model, the model processes the entire input completely fresh. Nothing persists from previous calls. The model knows nothing about you, your company, your previous conversation, your data, your intentions, unless that information is included in the current request.
The technical name for everything you send the model in a single request is the context window. It has a hard upper limit, measured in tokens. Two years ago that limit was 4K tokens, roughly four pages of text. Today the frontier is pushing past two million tokens, hundreds of pages. The limit keeps rising, and the cost per token keeps dropping. But two facts remain stubbornly true. First, the model is still amnesiac outside that window. Second, the information that actually exists in your company (in your CRM, your inbox, your docs, your code, your meetings) is vastly larger than any context window will ever be.
The whole engineering challenge of building useful AI on top of these models reduces to a single question. What do you put in the context window for any given query?
This question is the gateway to everything else. Whether you’re building a customer support agent, a coding assistant, an enterprise search tool, or a multi-agent system, you are answering this question, explicitly or implicitly. The field of what people now call “context engineering” exists because the naive answer doesn’t work.
The naive approach, and why it fails
The naive answer is to stuff everything in. Got documentation? Paste it. Got customer data? Inject it. Got code? Include all the files. Modern context windows are huge; just use them.
This breaks at scale for three reasons that anyone who has tried it has lived through.
First, cost is roughly linear in input length. A 100K context costs about thirty-three times what a 3K context costs. Multiply that by every user query and you’ve built a product that bleeds money on every interaction.
Second, latency scales with input length. A query that should feel instant takes four seconds. Users notice. Conversion drops.
Third, and most subtle, attention degrades with context size. The research paper that named this phenomenon, “Lost in the Middle” by Liu et al, demonstrated that models pay less attention to information buried deep in long contexts. Even when you can technically fit 200K tokens, the model effectively ignores the 100K in the middle. You’re paying for tokens that don’t help.
So you can’t dump everything. You have to retrieve only the most relevant slice for each query. The dominant pattern for doing that, the one almost every LLM application uses today, is Retrieval-Augmented Generation. RAG.
RAG: the pattern that built modern LLM applications
The RAG pipeline, in its simplest form:
- Take your source content. Split it into chunks (typically 200-1000 tokens each).
- For each chunk, generate an embedding using an embedding model. An embedding is a dense vector, typically 1536 dimensions, that captures the semantic content of that chunk in a way that similar meanings produce similar vectors.
- Store the embeddings in a vector database (Pinecone, Weaviate, Qdrant, ChromaDB, pgvector, dozens of others).
- When the user asks a question, embed the question with the same model.
- Search the vector database for the K most similar chunks to the question. Cosine similarity is the standard metric.
- Inject those top K chunks into the model’s context alongside the question.
- The model now has relevant information to answer.
This pattern is the foundation of most modern LLM applications. Every enterprise AI startup, every “chat with your PDF” tool, every customer support bot, every coding assistant uses some variant. I’ve used it myself. I shipped a movie recommendation app called Reellette last year that ran exactly this stack: chunks of metadata, OpenAI embeddings, ChromaDB for storage, top-K retrieval at query time. For “find me something similar to Inception” the pattern worked beautifully. For a class of problems, vector RAG is genuinely the right tool.
And then there’s the much larger class of problems where it isn’t. Those are the problems that matter most in enterprise software, and they’re the ones quietly killing the AI initiatives inside most companies.
Where vector RAG breaks down for real-world data
Three structural limitations show up consistently once you push vector RAG past the demo stage.
Similarity is not the same as relationship
The first kind of query vector RAG handles well is the textbook case. “Find me documents about pricing.” Embed the question, find chunks that talk about pricing, return them. Easy.
The kind that breaks is relational. “What did Acme Corp say about pricing in our last call?” “Tell me about Acme Corp’s recent positioning shifts.” “Which of our customers have churned after raising similar concerns to Acme?” These look superficially like search queries but they’re structurally different. They require understanding how entities relate to each other, how facts connect, how events sequence. Vector similarity flattens all of those dimensions into a single scalar of textual closeness.
I noticed a small version of this when I was building Reellette. The system was great at “movies similar to Inception” but broke on “movies like Inception but darker” or “movies that use the same writer’s storytelling style.” Vector similarity couldn’t capture the dimensions users actually cared about. At consumer scale, the consequence was a slightly disappointing recommendation. At enterprise scale, the equivalent failure means the AI can find documents that mention Acme but cannot answer who at your company has worked with them, when the relationship started, or how it’s evolved.
The same entity wears different names everywhere
In real enterprise data, the same person, customer, product, or contract appears in different systems with different identifiers. “John Smith” in Salesforce, “[email protected]” in Gmail, “Smith” in Slack, “JS” in a meeting transcript. A naive context system treats these as four different entities. A correct context system recognizes them as one.
The problem has a name: entity resolution. It’s the hardest unsolved problem in the context layer. The reason it’s hard is that you almost never have a deterministic key shared across systems. If every tool tagged John with the same global UUID, the problem would be trivial. In the real world, it’s almost never trivial. So you have to infer identity from circumstantial evidence: similar names, overlapping domains, shared context, the fact that “Smith” was mentioned in a thread that also referenced Acme. Every clue is probabilistic. The system has to weigh them and decide whether the confidence is high enough to merge.
This is a precision-versus-recall tradeoff with no free lunch. Merge too aggressively and you’ll combine two genuinely different John Smiths into one corrupted record. Merge too cautiously and you fragment one person across multiple entries, missing every connection.
I ran into a small version of this on Reellette too. The same film often appeared under slightly different titles or IDs across metadata sources, and deduplication was a constant low-grade headache. At my scale it was a UX papercut. At enterprise scale, the same structural problem becomes an access control leak when two real people get merged, or a corrupted account picture when one person gets fragmented. The difficulty is identical. The stakes differ by orders of magnitude.
Vector search structurally cannot answer relational questions
This is the deepest limitation, and it’s the one that motivates the move beyond RAG entirely. Suppose you ask: “Find me the person at our company who’s connected to Acme through a shared board member.” There is no document in your data that is textually similar to that question. The answer isn’t in the words. The answer is in the structure of the connections themselves. You’d find the answer by walking from Acme to its board, from the board to the people on it, from those people to the companies they’re affiliated with, and from those companies back to your team.
Vector search can’t do that walk. It can find text similar to “board member” or “Acme” but it cannot follow the chain. The information exists in your data, but it lives in the relationships between entities, not in any individual chunk of text. Asking a vector store to answer this query is asking the wrong tool the wrong question.
The graph turn
The response to these limitations, increasingly visible in research and in product across the field, is to move from flat vector representations to structured graph representations.
A knowledge graph represents information as a network of nodes (entities) connected by edges (relationships). Both can carry properties: a person node might have a name, email, and role; a “signed contract” edge might have a date, value, and renewal status. The graph isn’t just a different storage format. It’s a different way of thinking about what your data actually is.
The intuition is simple. A vector store treats each piece of information as an isolated point in semantic space. You find points near other points by similarity. But the real value in most enterprise data isn’t in the points; it’s in the relationships between them. Acme is a customer. John is the contact at Acme. They signed in March. Renewal is in September. They’ve been increasingly active in support. They mentioned a competitor in last week’s call. Each of those facts is a relationship between entities. None of them is captured by vector similarity. All of them are captured by a graph.
GraphRAG, in its various forms, combines this graph structure with the retrieval logic of traditional RAG. Instead of just finding the K most similar chunks to a query, the system does both. It uses vector similarity to find semantically relevant entities, then traverses the graph from those entities to gather connected context. The result is a structured slice of company state that’s coherent and connected, not a pile of text blobs that happen to be textually close.
A landscape of products is now betting on different versions of this. Microsoft published a research framework called GraphRAG that put the academic stake in the ground. Companies like Zep with Graphiti, Cognee in the open-source space, and Mem0 in agent memory specifically are building public variants. Some are graph-first. Some are hybrid vector-plus-graph. Some target agent memory; others target enterprise context broadly. The shared bet is the same: the graph is where the value lives.
Graph traversal, the operation that vector search structurally cannot do
The thing that makes graphs powerful is the operation called traversal. Once your data is structured as a graph, you can answer questions by walking the relationships rather than searching for textual matches.
Back to the question I posed earlier. “Find me the person at our company connected to Acme through a shared board member.” Put your finger on the Acme node. Walk outward along edges marked “board member” or “investor” to find shared connections. From those, walk back along “works at” edges to find your colleagues. The people at the end of that walk are your answer. You found them not by matching words but by following the structure of relationships.
This is the central thing graphs unlock, and it’s why I’m increasingly convinced that the future of AI context isn’t bigger context windows. It’s structured context. A two-million-token context window does not help you answer the Acme question. A graph with the right entities and edges does.
The depth of the walk matters and is itself a piece of engineering judgment. One hop finds direct connections. Two hops finds friends of friends. Three or four hops finds increasingly distant connections, but the further you walk, the more results you get and the noisier they become. Real systems have to choose sensible traversal depths and weighting functions, both of which are active research areas.
The protocol layer: MCP
Even with a clean graph and rigorous entity resolution, there’s still a question of how AI tools actually consume the context you’ve built. Bespoke integrations have been the default until recently: every AI assistant gets its own custom connector to every data source, and the integration matrix explodes.
The Model Context Protocol, MCP, which Anthropic released in late 2024, is the most credible attempt at solving that. MCP is roughly USB for LLMs: a standardized way for any AI client (Claude, Cursor, n8n, custom agents) to talk to any data source (databases, SaaS APIs, custom stores) without bespoke integration. You build the context once, expose it via MCP, and any AI tool that speaks the protocol can consume it. This is a major shift from the prior world where every AI-to-data integration was a snowflake.
The protocol itself is open, the spec is public, and the early adoption from both major model providers and major data sources has made it the dominant interoperability layer. If you’re building anything in this space, MCP is worth understanding deeply.
The caveat nobody talks about
If I were a vendor pitching this stuff, I’d stop here on the upbeat note. But I’m not pitching anything, so one caveat is worth stating plainly because it gets quietly skipped in most material about this layer.
More context doesn’t fix bad reasoning.
If the model can’t think its way through a problem, a richer context just lets it hallucinate with more confidence. A perfectly resolved graph with bitemporal accuracy and clean ACL enforcement is necessary, not sufficient. The graph helps an AI that already knows how to think. It doesn’t make a weak reasoner into a strong one.
This matters because the limits of what current models can reason through are real, and they don’t move just because you give the model more data. Some questions models still answer poorly even with perfect context: open-ended multi-step planning, problems that require genuine numerical reasoning, anything where the answer depends on understanding intent rather than retrieving information. For those, the context layer isn’t the bottleneck. The model is.
Knowing when you have a context problem versus a reasoning problem is itself a skill. Context problems get solved with better retrieval and structure. Reasoning problems get solved by waiting for better models, by decomposing the task differently, or by accepting that the AI isn’t the right tool for this specific job. Confusing the two leads to enormous wasted engineering effort.
The honest framing is that context is a necessary substrate. Without it, even the best reasoning model is useless on your data. With it, a capable reasoning model becomes substantially more useful. But the substrate alone isn’t the answer. What you build on top, and what the underlying model can actually reason through, is the rest of the answer.
Where I’m taking this
The reason I’ve been thinking carefully about all of this is that I’m building a personal AI agent system under a7t.ai. The vision underneath that project has always been a small set of specialized sub-agents working from shared context: my calendar, my mail, my finance, my code. The earliest sketches assumed flat retrieval, and the more I built, the more obvious it became that flat retrieval wouldn’t get me where I wanted. The next architecture will treat context as a structured graph, with explicit entity resolution between my sources and traversal as the primary retrieval operation.
The shipped piece I have so far, Reellette, was the smallest possible version of the broader pattern. It taught me where vector similarity hits its ceiling and made the case for graph structure concrete in my own hands. The system I’m building next is a few steps further along the same arc, and I’ll write up the architecture in a separate post once it’s running cleanly.
The bigger picture is that I think we’re in early innings on context infrastructure. Vector RAG is the assembly language of LLM applications: useful, ubiquitous, structurally limited. The next layer (graphs, entity resolution, MCP protocols, traversal logic, multi-tenant permissions) is being built right now, in public, by a few dozen companies and a few hundred engineers worldwide. Some of those bets will work and some won’t. The ones that work will define how AI gets used inside companies for the next decade.
That’s a problem worth understanding deeply, even if you’re not building the infrastructure layer yourself. Especially if you’re using AI to build anything that matters.
Resources I’ve found useful
- “Lost in the Middle: How Language Models Use Long Contexts” by Liu et al, 2023. The paper that named the attention degradation problem.
- “Unlocking Data with Generative AI and RAG” by Keith Bourne, Packt 2024. The most practical hands-on book I’ve found on production RAG patterns.
- Microsoft GraphRAG. Research framework and accompanying paper from Microsoft Research, the academic grounding for the graph-based approach.
- Anthropic Model Context Protocol. The open standard for AI-to-data connections.
- “Context Engineering” essays. Search this term to find the moving frontier. Substack and engineering blogs throughout 2025 and 2026 have surfaced the most current thinking.