Engineering

From Chatbot to Context Engine: Building AskNeedl’s Remote MCP Server

May 20, 2025

From Chatbot to Context Engine: Building AskNeedl’s Remote MCP Server

by Sushanth Kummitha

Introduction

When we built AskNeedl at needl.ai, our flagship Q&A chatbot at Needl, the early results were promising. It could dig deep into user-linked data sources — annual reports, presentations, exchange filings — and produce insightful answers using RAG and structured prompts.

But we quickly hit a wall: prompt quality was everything.

If the prompt wasn’t perfectly framed, or if we didn’t guide the model through a series of intermediate reasoning steps, the output would drop in quality fast. What we needed was a system that could break down, reframe, and enrich user queries dynamically.

That’s when we started exploring Model Context Protocol (MCP). By exposing AskNeedl’s core capabilities (tools, prompts, data access) via a fully async, SSE-capable MCP server, we enabled external clients like Claude to:

  • Discover and invoke Needl’s tools dynamically
  • Inject structured prompts and intermediate queries
  • Construct context-aware answers programmatically

What is MCP?

Model Context Protocol (MCP) is a specification for exposing prompts, tools, and resources over a network interface.

In our case, MCP became the way we let outside models like Claude interact with AskNeedl’s engine — treating it less like a chatbot, and more like an extensible Q&A backend.

From the official MCP intro:

“MCP enables agents and runtime environments to fetch model input context dynamically, from structured remote sources.”

Through MCP, we turned our LLM expertise into remote-callable capabilities. This includes:

  • Custom summarizers for feeds, documents defined on needl
  • Structured document analysis
  • Prompt templates for various Report Generation usecases
  • and so on…

Why Remote MCP?

AskNeedl Before MCP:
  • Prompt logic was bundled in our app backend
  • Context generation wasn’t composable
  • External models couldn’t leverage AskNeedl capabilities

AskNeedl After MCP:

MCP gave us a structured way to expose:

  • Prompt Templates: Like our `summarize_annual_report` prompt etc
  • Tools: Like `ask_needl_web`, `ask_us_capital_markets`, `ask_india_capital_markets`, `ask_document`, `ask_feed` etc.
  • User-context aware APIs: that return structured data, not plain strings

This turned AskNeedl from a monolith into a context provider, capable of powering external agents through composable APIs.

With MCP, AskNeedl became a backend for anyone building smart model workflows — not just our UI.

The async-first, SSE-powered implementation means external clients can stream responses, invoke tools concurrently, and get partial answers immediately.

📊 Architecture Overview

⚡ The Async-First Philosophy

In a system where every request might:

  • Hit a retrieval engine
  • Call external APIs
  • Access secured user data
  • Return streamed responses

async is optional

We designed our server so that every layer is async — from request parsing to final tool execution — runs on `asyncio`. This ensures:

  • High concurrency
  • Non-blocking I/O
  • Smooth streaming over SSE

Token Context with `contextvars`

We use Python’s `contextvars` to isolate auth context across coroutines:

import contextvars
token_ctx: contextvars.ContextVar[str] = contextvars.ContextVar("token")

This allows us to:

  • Store the user token at request entry
  • Access it later from deep inside a tool or prompt
  • Prevent accidental leakage between concurrent requests

🔧 Implementation Highlights

Authentication via Headers

We use a`Bearer token` Authentication header as auth layer to secure tool access.

Example header:

Authorization: Bearer my-api-key

Our middleware extracts the token, stores it in an async-safe `contextvar`, and validates it. This design ensures:

  • Auth is enforced before tool execution
  • Token is available across async layers
  • Invalid requests fail early with a clear error

Sample logic:

token = request.headers.get("Authorization", "").replace("Bearer ", "")
token_ctx.set(token)
if not token:
 return JSONResponse({"error": "Unauthorized"}, status_code=401)

Tool Registration

@mcp.tool(name="ask_feed")
async def ask_feed(prompt: str, feed_id: str) -> AskNeedlResponse:
 token = token_ctx.get()
 return await _ask_needl_request({"prompt": prompt, "feed_id": feed_id}, token)

Features:

  • Param validation via `pydantic`
  • Auto-generated OpenAPI-compatible metadata
  • Token propagated via async-local storage
  • Error handling with decorators (`_tool_error_handler`)

Prompt Registration

@mcp.prompt(name="summarize_annual_report")
async def summarize_annual_report(message: str) -> str:
 return f"{ANNUAL_REPORT_SUMMARY_PROMPT}\n\nUser Input:\n{message}"

This lets external agents construct well-scaffolded input prompts by simply passing a message string. The structure, tone, and format are handled remotely.

SSE Transport

def create_sse_server(mcp: FastMCP):
 app = mcp.http_app(path="/sse", transport="sse", middleware=[…])
 return Starlette(routes=[Mount("/", app=app)])

SSE streams tokens or structured payloads as they’re generated. Perfect for:

  • Live inference feedback
  • Long-running tools
  • Reducing user wait time

Middleware + Auth

class AuthenticateRequestMiddleware(BaseHTTPMiddleware):
 async def dispatch(self, request, call_next):
 token = extract_token(request)
 token_ctx.set(token)
 if not validate_token(token):
   return JSONResponse({"error": "Unauthorized"}, status_code=401)
 return await call_next(request)

Protects all routes

  • Token stays local to coroutine
  • Logging integrated with `contextvars` for traceability

Logging

We use structured logs throughout:

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger.info(f"Calling tool {tool_name} with params: {params}")

Why it matters:

  • Tool invocations are auditable
  • Performance bottlenecks are visible
  • Failures can be correlated with user sessions

Best Practices Followed

  • `httpx.AsyncClient` for outbound requests
  • `@wraps` decorators for consistent exception handling
  • Timeout tuning for long RAG flows
  • All routes mounted under `/sse` for clean transport separation
  • Gunicorn config uses `UvicornWorker` + thread tuning

🤔 Challenges & Learnings

  • Sparse Documentation: Very little online content or community knowledge around implementing remote MCP. Even LLMs struggled to assist.
  • Hard to Debug: Without clear references, diagnosing transport issues or tool behavior required deep dives.
  • Async Everything: Reliability came from fully embracing async at all layers — transport, tools, HTTP clients, middleware.
  • Error Handling: Carefully wrapped all tool logic to prevent server crashes and propagate clear failure states.
  • Frequent Library Updates: Had to track and adopt updates to `fastmcp` to improve SSE reliability and reduce connection drops.
  • Token Management: Ensuring per-request isolation and secure propagation via `contextvars` was essential for correctness.

👨‍💻 Developer Guide: Building Your Own MCP Server

This guide is for developers who want to build and deploy their own Remote MCP server. Whether you’re working on internal LLM orchestration, agent frameworks, or enterprise tooling — here’s a clear path.

1. Set Up Your Project

mkdir my-mcp-server && cd my-mcp-server
python -m venv venv && source venv/bin/activate
pip install fastapi fastmcp httpx uvicorn starlette

2. Define Token Context

Use `contextvars` to store per-request token safely.

# context.py
import contextvars
token_ctx = contextvars.ContextVar("token")

3. Auth Middleware

# middlewares/auth.py
from fastapi.responses import JSONResponse
from starlette.middleware.base import BaseHTTPMiddleware
from context import token_ctx

class AuthenticateRequestMiddleware(BaseHTTPMiddleware):
 async def dispatch(self, request, call_next):
  token = request.headers.get("Authorization", "").replace("Bearer ", "")
  token_ctx.set(token)
  if not token:
   return JSONResponse({"error": "Unauthorized"}, status_code=401)
  return await call_next(request)

4. Initialize FastMCP

# server.py
from fastapi import FastAPI
from fastmcp import FastMCP
from starlette.routing import Mount
from starlette.applications import Starlette
from starlette.middleware import Middleware
from middlewares.auth import AuthenticateRequestMiddleware
from context import token_ctx

mcp = FastMCP("My Remote MCP")

# Register tools or prompts here (see next step)

def create_sse_app():
 mcp_app = mcp.http_app(
 path="/sse",
 transport="sse",
 middleware=[Middleware(AuthenticateRequestMiddleware)],
 )
 return Starlette(routes=[Mount("/", app=mcp_app)], lifespan=mcp_app.router.lifespan_context)

app = FastAPI()

app.mount("/", create_sse_app())

if __name__ == "__main__":
 import uvicorn
 
 uvicorn.run(
  "server:app",
  host="0.0.0.0",
  port=CONFIG.PORT,
  loop="asyncio",
 )

5. Register Your Tools

# tools.py
from context import token_ctx
from pydantic import BaseModel
import httpx
class ResponseModel(BaseModel):
 result: str
@mcp.tool(name="echo_tool", description="Echoes the input message.")
async def echo_tool(message: str) -> ResponseModel:
 token = token_ctx.get()
 return ResponseModel(result=f"Echo: {message}, Token: {token}")

6. Run the Server (Production)

#!/bin/bash
set -e
TIMEOUT=180
WORKERS=1
THREADS=8
PORT=${PORT:-8000}
gunicorn -b 0.0.0.0:$PORT \
 -k uvicorn.workers.UvicornWorker \
 src.askneedl_mcp_server.server:app \
 - timeout $TIMEOUT \
 - workers $WORKERS \
 - threads $THREADS \
 - max-requests 1000 \
 - max-requests-jitter 100 \
 - log-level "info"

7. Local Testing

One way is you can use the official MCP inspector tool which offers a interactive interface to test and debug your remote MCP server’s liveness, resources, prompts, and tools — all without writing extra code. It supports direct inspection of locally developed or packaged servers via npx, making it ideal for rapid iteration and validation during development.

Another way to test your MCP server locally especially with LLM agents like Claude, you can use the `mcp-remote` npm package. This acts as a shim/proxy that:

  • Speaks MCP over `stdio` to the local LLM client
  • Connects to your MCP server over HTTP + SSE

Even if your MCP server is hosted remotely, the Claude desktop app do not allow remote integrations unless you’re on Claude Max plan. However, using `mcp-remote`, you can bridge that gap on any plan.

Here’s how:

{
 "mcpServers": {
  "my_mcp_server": {
   "command": "npx",
   "args": [
    "-y",
    "mcp-remote@latest",
    "http://0.0.0.0:8000/sse",
    "8000",
    "--allow-http",
    "--transport",
    "sse-only",
    "--header",
    "Authorization: Bearer ${API_KEY}"
   ],
   "env": {
    "API_KEY": "<your_api_key>"
   }
  }
 }
}

How this works:

  • `mcp-remote` launches a local stdio listener
  • Forwards every MCP request to your actual FastAPI+SSE MCP server
  • Handles headers (like `Authorization`) automatically
  • Makes your remote tools/prompts callable from Claude

This is a great way to test your tools locally before deploying them.

Learn more: npmjs.com/package/mcp-remote

Tips for Production

  • Use `gunicorn` with `UvicornWorker`
  • Monitor memory usage during long SSE sessions
  • Use pydantic models for input validation
  • Integrate structured logging (JSON format)
  • Automate tool discovery with decorators

Bonus: Dev Tips

  • Organize by responsibility: keep tools, prompts, auth, config in separate modules
  • Start with one working tool: prove end-to-end SSE flow early
  • Log everything: especially inputs/outputs at the tool level
  • Stay close to the `fastmcp` changelog: fixes and features ship fast
  • Write client simulators: test from both LLM and raw HTTP/SSE

Bonus: Tips for Scaling Further

  • ✅ Add observability: log every tool invocation with user/session ID
  • ✅ Use typed payloads everywhere with `pydantic`
  • ✅ Validate request tokens against a central auth service
  • ✅ Add rate-limiting per API key for multi-tenant safety
  • ✅ Provide tool discovery via OpenAPI schema or `/metadata` endpoint

These help ensure the MCP server is secure, stable, and extensible for broader use.

🤖 Conclusion

MCP transformed AskNeedl from a chatbot into a platform.

  • Now, external agents can discover and invoke our tools.
  • Prompt logic lives where it should — in a managed service.
  • Streaming outputs feel fast and usable.
  • All of it is async, robust, and production-tested.

And most importantly: it’s not just about sending better prompts. It’s about unlocking the full value of your LLM system by separating concerns, adding structure, and thinking like a protocol designer.

If you’re building LLM infra, it might be time to ship your own MCP server.

You can also check this post on Medium.

X iconLinkedin icon

Read more from Needl.ai