Engineering

From Chatbot to Context Engine: Building AskNeedl’s Remote MCP Server

May 20, 2025

From Chatbot to Context Engine: Building AskNeedl’s Remote MCP Server

by Sushanth Kummitha

‍

Introduction

‍

When we built AskNeedl at needl.ai, our flagship Q&A chatbot at Needl, the early results were promising. It could dig deep into user-linked data sources — annual reports, presentations, exchange filings — and produce insightful answers using RAG and structured prompts.

But we quickly hit a wall: prompt quality was everything.

If the prompt wasn’t perfectly framed, or if we didn’t guide the model through a series of intermediate reasoning steps, the output would drop in quality fast. What we needed was a system that could break down, reframe, and enrich user queries dynamically.

That’s when we started exploring Model Context Protocol (MCP). By exposing AskNeedl’s core capabilities (tools, prompts, data access) via a fully async, SSE-capable MCP server, we enabled external clients like Claude to:

Discover and invoke Needl’s tools dynamically
Inject structured prompts and intermediate queries
Construct context-aware answers programmatically

‍

What is MCP?

‍

Model Context Protocol (MCP) is a specification for exposing prompts, tools, and resources over a network interface.

In our case, MCP became the way we let outside models like Claude interact with AskNeedl’s engine — treating it less like a chatbot, and more like an extensible Q&A backend.

From the official MCP intro:

“MCP enables agents and runtime environments to fetch model input context dynamically, from structured remote sources.”

‍

Through MCP, we turned our LLM expertise into remote-callable capabilities. This includes:

Custom summarizers for feeds, documents defined on needl
Structured document analysis
Prompt templates for various Report Generation usecases
and so on…

Why Remote MCP?

AskNeedl Before MCP:

Prompt logic was bundled in our app backend
Context generation wasn’t composable
External models couldn’t leverage AskNeedl capabilities

‍

AskNeedl After MCP:

‍

MCP gave us a structured way to expose:

‍

Prompt Templates: Like our `summarize_annual_report` prompt etc
Tools: Like `ask_needl_web`, `ask_us_capital_markets`, `ask_india_capital_markets`, `ask_document`, `ask_feed` etc.
User-context aware APIs: that return structured data, not plain strings

This turned AskNeedl from a monolith into a context provider, capable of powering external agents through composable APIs.

‍

With MCP, AskNeedl became a backend for anyone building smart model workflows — not just our UI.

‍

The async-first, SSE-powered implementation means external clients can stream responses, invoke tools concurrently, and get partial answers immediately.

‍

📊 Architecture Overview

‍

‍

⚡ The Async-First Philosophy

‍

In a system where every request might:

Hit a retrieval engine
Call external APIs
Access secured user data
Return streamed responses

‍

async is optional

‍

We designed our server so that every layer is async — from request parsing to final tool execution — runs on `asyncio`. This ensures:

High concurrency
Non-blocking I/O
Smooth streaming over SSE

‍

Token Context with `contextvars`

‍

We use Python’s `contextvars` to isolate auth context across coroutines:

import contextvars
token_ctx: contextvars.ContextVar[str] = contextvars.ContextVar("token")

‍

This allows us to:

Store the user token at request entry
Access it later from deep inside a tool or prompt
Prevent accidental leakage between concurrent requests

‍

🔧 Implementation Highlights

‍

Authentication via Headers

We use a`Bearer token` Authentication header as auth layer to secure tool access.

Example header:

Authorization: Bearer my-api-key

‍

Our middleware extracts the token, stores it in an async-safe `contextvar`, and validates it. This design ensures:

Auth is enforced before tool execution
Token is available across async layers
Invalid requests fail early with a clear error

‍

Sample logic:

token = request.headers.get("Authorization", "").replace("Bearer ", "")
token_ctx.set(token)
if not token:
 return JSONResponse({"error": "Unauthorized"}, status_code=401)

‍

Tool Registration

‍

@mcp.tool(name="ask_feed")
async def ask_feed(prompt: str, feed_id: str) -> AskNeedlResponse:
 token = token_ctx.get()
 return await _ask_needl_request({"prompt": prompt, "feed_id": feed_id}, token)

‍

Features:

Param validation via `pydantic`
Auto-generated OpenAPI-compatible metadata
Token propagated via async-local storage
Error handling with decorators (`_tool_error_handler`)

‍

Prompt Registration

‍

@mcp.prompt(name="summarize_annual_report")
async def summarize_annual_report(message: str) -> str:
 return f"{ANNUAL_REPORT_SUMMARY_PROMPT}\n\nUser Input:\n{message}"

‍

This lets external agents construct well-scaffolded input prompts by simply passing a message string. The structure, tone, and format are handled remotely.

‍

SSE Transport

‍

def create_sse_server(mcp: FastMCP):
 app = mcp.http_app(path="/sse", transport="sse", middleware=[…])
 return Starlette(routes=[Mount("/", app=app)])

‍

SSE streams tokens or structured payloads as they’re generated. Perfect for:

Live inference feedback
Long-running tools
Reducing user wait time

Middleware + Auth

‍

class AuthenticateRequestMiddleware(BaseHTTPMiddleware):
 async def dispatch(self, request, call_next):
 token = extract_token(request)
 token_ctx.set(token)
 if not validate_token(token):
   return JSONResponse({"error": "Unauthorized"}, status_code=401)
 return await call_next(request)

‍

Protects all routes

Token stays local to coroutine
Logging integrated with `contextvars` for traceability

Logging

We use structured logs throughout:

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger.info(f"Calling tool {tool_name} with params: {params}")

‍

Why it matters:

Tool invocations are auditable
Performance bottlenecks are visible
Failures can be correlated with user sessions

‍

Best Practices Followed

`httpx.AsyncClient` for outbound requests
`@wraps` decorators for consistent exception handling
Timeout tuning for long RAG flows
All routes mounted under `/sse` for clean transport separation
Gunicorn config uses `UvicornWorker` + thread tuning

‍

🤔 Challenges & Learnings

Sparse Documentation: Very little online content or community knowledge around implementing remote MCP. Even LLMs struggled to assist.
Hard to Debug: Without clear references, diagnosing transport issues or tool behavior required deep dives.
Async Everything: Reliability came from fully embracing async at all layers — transport, tools, HTTP clients, middleware.
Error Handling: Carefully wrapped all tool logic to prevent server crashes and propagate clear failure states.
Frequent Library Updates: Had to track and adopt updates to `fastmcp` to improve SSE reliability and reduce connection drops.
Token Management: Ensuring per-request isolation and secure propagation via `contextvars` was essential for correctness.

‍

👨‍💻 Developer Guide: Building Your Own MCP Server

This guide is for developers who want to build and deploy their own Remote MCP server. Whether you’re working on internal LLM orchestration, agent frameworks, or enterprise tooling — here’s a clear path.

1. Set Up Your Project

mkdir my-mcp-server && cd my-mcp-server
python -m venv venv && source venv/bin/activate
pip install fastapi fastmcp httpx uvicorn starlette

2. Define Token Context

Use `contextvars` to store per-request token safely.

# context.py
import contextvars
token_ctx = contextvars.ContextVar("token")

3. Auth Middleware

‍

# middlewares/auth.py
from fastapi.responses import JSONResponse
from starlette.middleware.base import BaseHTTPMiddleware
from context import token_ctx

class AuthenticateRequestMiddleware(BaseHTTPMiddleware):
 async def dispatch(self, request, call_next):
  token = request.headers.get("Authorization", "").replace("Bearer ", "")
  token_ctx.set(token)
  if not token:
   return JSONResponse({"error": "Unauthorized"}, status_code=401)
  return await call_next(request)

‍

4. Initialize FastMCP

# server.py
from fastapi import FastAPI
from fastmcp import FastMCP
from starlette.routing import Mount
from starlette.applications import Starlette
from starlette.middleware import Middleware
from middlewares.auth import AuthenticateRequestMiddleware
from context import token_ctx

mcp = FastMCP("My Remote MCP")

# Register tools or prompts here (see next step)

def create_sse_app():
 mcp_app = mcp.http_app(
 path="/sse",
 transport="sse",
 middleware=[Middleware(AuthenticateRequestMiddleware)],
 )
 return Starlette(routes=[Mount("/", app=mcp_app)], lifespan=mcp_app.router.lifespan_context)

app = FastAPI()

app.mount("/", create_sse_app())

if __name__ == "__main__":
 import uvicorn
 
 uvicorn.run(
  "server:app",
  host="0.0.0.0",
  port=CONFIG.PORT,
  loop="asyncio",
 )

‍

5. Register Your Tools

‍

# tools.py
from context import token_ctx
from pydantic import BaseModel
import httpx
class ResponseModel(BaseModel):
 result: str
@mcp.tool(name="echo_tool", description="Echoes the input message.")
async def echo_tool(message: str) -> ResponseModel:
 token = token_ctx.get()
 return ResponseModel(result=f"Echo: {message}, Token: {token}")

‍

6. Run the Server (Production)

#!/bin/bash
set -e
TIMEOUT=180
WORKERS=1
THREADS=8
PORT=${PORT:-8000}
gunicorn -b 0.0.0.0:$PORT \
 -k uvicorn.workers.UvicornWorker \
 src.askneedl_mcp_server.server:app \
 - timeout $TIMEOUT \
 - workers $WORKERS \
 - threads $THREADS \
 - max-requests 1000 \
 - max-requests-jitter 100 \
 - log-level "info"

‍

7. Local Testing

‍

One way is you can use the official MCP inspector tool which offers a interactive interface to test and debug your remote MCP server’s liveness, resources, prompts, and tools — all without writing extra code. It supports direct inspection of locally developed or packaged servers via npx, making it ideal for rapid iteration and validation during development.

Another way to test your MCP server locally especially with LLM agents like Claude, you can use the `mcp-remote` npm package. This acts as a shim/proxy that:

Speaks MCP over `stdio` to the local LLM client
Connects to your MCP server over HTTP + SSE

Even if your MCP server is hosted remotely, the Claude desktop app do not allow remote integrations unless you’re on Claude Max plan. However, using `mcp-remote`, you can bridge that gap on any plan.

Here’s how:

{
 "mcpServers": {
  "my_mcp_server": {
   "command": "npx",
   "args": [
    "-y",
    "mcp-remote@latest",
    "http://0.0.0.0:8000/sse",
    "8000",
    "--allow-http",
    "--transport",
    "sse-only",
    "--header",
    "Authorization: Bearer ${API_KEY}"
   ],
   "env": {
    "API_KEY": "<your_api_key>"
   }
  }
 }
}

‍

How this works:

`mcp-remote` launches a local stdio listener
Forwards every MCP request to your actual FastAPI+SSE MCP server
Handles headers (like `Authorization`) automatically
Makes your remote tools/prompts callable from Claude

This is a great way to test your tools locally before deploying them.

Learn more: npmjs.com/package/mcp-remote

Tips for Production

‍

Use `gunicorn` with `UvicornWorker`
Monitor memory usage during long SSE sessions
Use pydantic models for input validation
Integrate structured logging (JSON format)
Automate tool discovery with decorators

‍

Bonus: Dev Tips

‍

Organize by responsibility: keep tools, prompts, auth, config in separate modules
Start with one working tool: prove end-to-end SSE flow early
Log everything: especially inputs/outputs at the tool level
Stay close to the `fastmcp` changelog: fixes and features ship fast
Write client simulators: test from both LLM and raw HTTP/SSE

‍

Bonus: Tips for Scaling Further

‍

✅ Add observability: log every tool invocation with user/session ID
✅ Use typed payloads everywhere with `pydantic`
✅ Validate request tokens against a central auth service
✅ Add rate-limiting per API key for multi-tenant safety
✅ Provide tool discovery via OpenAPI schema or `/metadata` endpoint

These help ensure the MCP server is secure, stable, and extensible for broader use.

‍

🤖 Conclusion

‍

MCP transformed AskNeedl from a chatbot into a platform.

Now, external agents can discover and invoke our tools.
Prompt logic lives where it should — in a managed service.
Streaming outputs feel fast and usable.
All of it is async, robust, and production-tested.

And most importantly: it’s not just about sending better prompts. It’s about unlocking the full value of your LLM system by separating concerns, adding structure, and thinking like a protocol designer.

If you’re building LLM infra, it might be time to ship your own MCP server.

‍

You can also check this post on Medium.