Vikram Srinivasan
18 May 2026
8 mins
Information Intelligence

Context Is Becoming the Real Infrastructure Layer for Enterprise AI

Table of contents
This is some text inside of a div block.
Summary
Enterprise AI is moving fast from single-turn assistants to multi-step autonomous agents, and that shift makes the context layer the most critical piece of infrastructure in the stack. Here is why. Agents do not just answer questions. They take sequences of retrieval-dependent actions where each step's output becomes the next step's input. That means context is not a one-time input. It is required at every step. And if an agent is fed the wrong context at any single step, the failure is not contained. It cascades downstream, corrupting every subsequent step, until the final output is fluent, well-cited and wrong. This is catastrophic in a way that single-turn AI failures never were. The math makes it concrete. A 95% accurate retriever delivers a correct full chain only 60% of the time across ten steps. An 82% retriever falls to just 14%. Needl addresses this through lexical-first hybrid retrieval, query-time chunking, schema-only structured ingestion and runtime permission enforcement. FinanceBench results depend heavily on how it is run. Some systems run it on only the specific SEC 10K filings relevant to the questions being asked. When the answer is in a small pre-selected pile, retrieval is trivial. Needl runs it against the full SEC corpus with no pre-selection, which is the actual enterprise retrieval problem. Needl scores 97.27%. A leading AI search competitor scores 82.76%. In a 5-step agentic workflow, that gap translates to an 86% success rate versus 37%. That is the difference between a system that can be deployed and one that cannot.


The scenario nobody is talking about

An agent is running a 10-step credit workflow. At step 3, it retrieves plausible-looking context from a SharePoint folder, a Slack thread and a redlined PDF. The SharePoint memo is ranked low because the file uses a deal codename. The Slack thread surfaces the wrong conversation. The PDF was never indexed. The agent reasons precisely about what it has and moves forward. By step 10, it produces a fluent, well-cited, incorrect recommendation. Nobody flags it, because the answer looks right.

This is not a reasoning failure. The model reasoned correctly. It is a context failure, and it is the central unsolved problem in enterprise AI. In a multi-step agentic workflow, a single bad retrieval does not produce a single wrong answer. It poisons every step that follows.

Why agentic workflows change everything

The stakes could not be simpler to state: agents require context. In a multi-step agentic workflow, if the agent is fed the wrong context at any step, there is catastrophic downstream failure. A reliable, accurate context layer is not a feature of a well-built agent. It is the critical intermediary that makes the agentic world function at all.

Agents are not better chatbots. They take sequences of retrieval-dependent actions where each step's output becomes the next step's input. In a single-turn interaction, a retrieval error produces a wrong answer and the user retries. In a multi-step workflow, a retrieval error at step 2 corrupts step 3, which corrupts step 4, and so on. The error does not announce itself. It propagates.


The math is unforgiving. A retrieval step correct 95% of the time looks reliable in isolation. Across five sequential steps, end-to-end accuracy falls to 77%. Across ten steps it falls to 60%. For an 82% per-step retriever, ten steps leaves you at 14%. The agent is not failing because it cannot reason. It is failing because it was given the wrong inputs, repeatedly, with no way to detect it.

These failures are frequently invisible to the people consuming the outputs. That makes context failure a governance issue, not just an engineering one.

Reliable reasoning without reliable context produces fluent errors at scale. The harder problem is the context layer itself.


Why common architectures fall short

MCP servers wired to LLMs

Exposing each SaaS application through an MCP server feels composable and needs minimal infrastructure. It does not work at scale. Slack ranks recent conversational matches. SharePoint handles keyword and metadata lookups. Salesforce handles record retrieval by object type. None was designed to feed an analytical agent operating across multiple steps. When a query crosses three or four such systems, the answer that reaches the model is whatever each underlying engine returns, glued together, weighted by nothing and normalized against nothing. Accuracy is bounded by the weakest indexer in the stack.

A concrete failure: Query: "What did our team decide about the Acme covenant amendment?" Slack returns the relevant thread ranked third. SharePoint returns the final memo ranked low because the title uses a deal codename. The legal system returns nothing because the redline lives inside a PDF that was never indexed. The agent produces a confident, fluent, wrong summary and carries it forward into subsequent steps.

Vector-only RAG

A 2025 Google DeepMind paper (Theoretical Limitations of Embedding-Based Retrieval, Weller et al.) established a hard ceiling: with 512-dimensional embeddings, retrieval degrades meaningfully around 500K documents. At 4096 dimensions you reach roughly 250 million. Enterprise corpora routinely exceed both, while BM25-style lexical retrieval scales through that ceiling without degrading. Vector RAG can work on a small curated knowledge base. It cannot deliver the accuracy a credit committee requires on a hundred-terabyte corpus. Full paper at AlphaXiv.

Vector retrieval is also opaque. There is no scoring trace to read, no term-level signal to tune, and no way to bridge "this chunk matched" with "here is why it matched." For a system whose first commitment is auditability, that is disqualifying.

Context windows are not the answer either

As McKinsey's explainer on context windows makes clear, the context window is working memory for a single interaction, not a storage layer. A typical enterprise has hundreds of terabytes of unstructured content, and putting the model in charge of locating relevant fragments inside an enormous undifferentiated input is a strictly harder version of the problem retrieval is built to solve, with worse latency, higher cost and lower accuracy.

Common patterns What actually works
Scoring Bounded by weakest SaaS search Uniform retrieval plane the operator controls
Scale Vector RAG hits theoretical ceiling Lexical-first scoring scales without ceiling
Chunking Boundaries drawn before the query exists Query-time chunking shaped to the question
Audit No inspectable scoring trace Term-level scoring trace for audit
Permissions Index can drift from source ACL Runtime enforcement, never precomputed

How Needl engineers the context layer

Needl is organized in four layers, mediated throughout by a permission layer that never drifts from source-system ACL state.

Indexing plane: Crawls, parses, normalizes and indexes content from heterogeneous sources under explicit freshness, fairness and observability SLAs. Unstructured content goes into a lexical index and an object store with full provenance (user view, ACL snapshot, version, checksum). Structured systems contribute only schema and metadata, never rows.

Query-time retrieval and ranking: Returns ranked, explainable results via lexical-first hybrid retrieval over the unstructured index and runtime SQL generation against structured sources. A query understanding layer adds semantic awareness without depending on per-customer behavioral data.

Agentic routing and verification: Decomposes mixed queries into structured and unstructured sub-queries, dispatches each to a specialized sub-agent under an inner ReAct loop, and verifies completeness through an outer ReAct loop before returning to the caller.

Permission layer: Every result is joined at runtime against the source system's access controls. No precomputed denormalization. Access revoked at 9:01 AM is gone at 9:02 AM.

Stable scoring across a changing corpus

Needl's proprietary scoring does not key off corpus-level statistics like term frequency, document frequency or document length. This matters because BM25 drifts: in March, "covenant" appears in 4,000 documents and scores as moderately rare. By September, 35,000 new documents containing that term collapse its IDF score and the same query silently surfaces the wrong documents. Needl's scoring stays stable as the corpus grows, with firm-specific business logic layered on top to personalize ranking to each enterprise's vocabulary and authority signals.

Query-time chunking

Documents are stored intact and indexed whole. At query time, a dynamic chunking module assembles a span shaped by the query, not by an arbitrary token budget chosen at ingest. Tables come back with their headers. Management commentary comes back with the data it references.

Query understanding without behavioral data

Needl substitutes two inputs for the behavioral data enterprise environments lack: pretrained world knowledge from frontier language models and a customer-specific ontology covering internal product codes, deal codenames and sector taxonomies. Both are fully traceable.

Schema-only structured retrieval

Needl never ingests rows from structured sources. Queries are answered by generating SQL at query time and executing it directly against the source under the requesting user's authenticated identity. This avoids freshness lag, duplicate storage of sensitive data, and the burden of keeping two systems consistent. Database-native security primitives apply automatically.

Inner and outer ReAct loops

Each sub-agent runs an inner ReAct loop, issuing a query, inspecting the result and refining if needed. The outer ReAct loop verifies completeness across the combined answer and dispatches follow-up sub-queries if gaps remain. Only when the outer loop is satisfied does the engine return to the caller, with citations to source content for every claim.

Explainability and deduplication

Every retrieval decision is composed of primitives a human can read and modify, producing scoring traces, per-claim citations and full audit reconstruction for any historical query. The same document often lives across drives, channels and inboxes with different permissions on each copy. Needl deduplicates at search time using near-duplicate hashing, surfacing one canonical result while retaining the full provenance graph beneath.

Indexing pipelines and operational SLAs

The pipeline targets three SLAs: freshness (new content searchable within 15 to 30 minutes), burst tolerance (no user's indexing is starved when another's content surges), and immediate utility (a new user finds value within minutes). Needl runs a two-crawler pattern for new users: a recent crawler that pulls the last sixty days immediately, and a backfill crawler that walks back through history in the background. A fair scheduler routes objects to complexity-tiered parsers based on document complexity, not file extension, keeping costs bounded at scale.

Benchmarks: a 15-point gap driven by architecture, not model choice

Needl ranks first on the S&P Global Kensho Long-Document QA leaderboard, operated independently by S&P Global over long, complex financial documents.

FinanceBench tests AI systems on complex financial question answering, but results vary significantly depending on how it is run. Some systems run it on a curated document set, downloading only the specific SEC 10K filings relevant to the questions being asked. When the answer is almost certainly in the small pile of pre-selected documents, retrieval becomes trivial. That is not enterprise retrieval. That is a controlled test.


Needl runs FinanceBench against the full SEC filings corpus with no pre-selection. The model has to find the right answer across all of SEC data, which is the actual retrieval problem enterprises face every day. That is a fundamentally harder and more realistic test.

Needl (full SEC corpus, no pre-selection): 97.24%
Leading AI search competitor : 82.76%

Same underlying language model. The gap is purely retrieval architecture, and it is worth noting that the competitor's setup was the easier one. On a level playing field the gap would only be wider.

Translated into agentic terms: at 97% per-step accuracy across a 5-step workflow, the full chain succeeds 86% of the time. At 82%, it succeeds 37% of the time. That gap widens as the corpus grows, because vector retrieval's ceiling tightens with corpus size while lexical retrieval does not. An 86% success rate is deployable. A 37% success rate is not.

Enterprise search and retrieval is already the second-most-deployed generative AI use case at 28% adoption (Menlo Ventures, 2024). As that adoption extends into agentic workflows, the accuracy requirements only increase.


Conclusion: context is the infrastructure agents depend on

Better reasoning models are a genuine advance. But as enterprises hand more consequential decisions to autonomous agents, the context layer becomes the single point of failure that nobody is watching. The reasoning model gets the credit when things go right. The retrieval layer absorbs the blame, silently, when things go wrong.

Accuracy at scale, permission awareness without drift, explainability for engineers and auditors, and indexing pipelines that meet real freshness and burst constraints: none of these come from a vector database, an MCP wrapper, or a larger context window. They come from an architecture built specifically for them. The context layer is not a feature inside an agent. It is the infrastructure agents depend on, and like all infrastructure, it either holds or it does not.

FAQs

Won't larger context windows eventually eliminate the retrieval problem?
No. Context windows are working memory for a single interaction, not a storage layer. Hundreds of terabytes of enterprise content will not fit, and the retrieval problem does not disappear. It just gets more expensive.

Why not just use hybrid search (BM25 plus vectors)?
Needl does use hybrid retrieval. The distinction is that raw BM25 statistics drift as new documents arrive, silently degrading ranking. Needl's proprietary scoring stays stable at scale, and adds query-time chunking, agentic verification and runtime ACL enforcement on top.

What does schema-only ingestion mean for data privacy?
Needl never copies rows from databases. Queries are answered by generating SQL executed against the source under the requesting user's identity. The database's own access controls apply automatically. No second copy of sensitive data exists.

How does Needl handle queries that span structured and unstructured data?
An agentic routing layer decomposes mixed queries into sub-queries targeting the right retrieval mode. Each sub-agent refines its result in an inner ReAct loop. An outer verification loop checks the combined answer and dispatches follow-ups if gaps remain.

Table of contents
This is some text inside of a div block.