by Sushanth Kummitha
Introduction
When we built AskNeedl at needl.ai, our flagship Q&A chatbot at Needl, the early results were promising. It could dig deep into user-linked data sources — annual reports, presentations, exchange filings — and produce insightful answers using RAG and structured prompts.
But we quickly hit a wall: prompt quality was everything.
If the prompt wasn’t perfectly framed, or if we didn’t guide the model through a series of intermediate reasoning steps, the output would drop in quality fast. What we needed was a system that could break down, reframe, and enrich user queries dynamically.
That’s when we started exploring Model Context Protocol (MCP). By exposing AskNeedl’s core capabilities (tools, prompts, data access) via a fully async, SSE-capable MCP server, we enabled external clients like Claude to:
- Discover and invoke Needl’s tools dynamically
- Inject structured prompts and intermediate queries
- Construct context-aware answers programmatically
What is MCP?
Model Context Protocol (MCP) is a specification for exposing prompts, tools, and resources over a network interface.
In our case, MCP became the way we let outside models like Claude interact with AskNeedl’s engine — treating it less like a chatbot, and more like an extensible Q&A backend.
From the official MCP intro:
“MCP enables agents and runtime environments to fetch model input context dynamically, from structured remote sources.”
Through MCP, we turned our LLM expertise into remote-callable capabilities. This includes:
- Custom summarizers for feeds, documents defined on needl
- Structured document analysis
- Prompt templates for various Report Generation usecases
- and so on…
Why Remote MCP?
AskNeedl Before MCP:
- Prompt logic was bundled in our app backend
- Context generation wasn’t composable
- External models couldn’t leverage AskNeedl capabilities
AskNeedl After MCP:
MCP gave us a structured way to expose:
- Prompt Templates: Like our `summarize_annual_report` prompt etc
- Tools: Like `ask_needl_web`, `ask_us_capital_markets`, `ask_india_capital_markets`, `ask_document`, `ask_feed` etc.
- User-context aware APIs: that return structured data, not plain strings
This turned AskNeedl from a monolith into a context provider, capable of powering external agents through composable APIs.
With MCP, AskNeedl became a backend for anyone building smart model workflows — not just our UI.
The async-first, SSE-powered implementation means external clients can stream responses, invoke tools concurrently, and get partial answers immediately.
📊 Architecture Overview
.png)
.png)
⚡ The Async-First Philosophy
In a system where every request might:
- Hit a retrieval engine
- Call external APIs
- Access secured user data
- Return streamed responses
async is optional
We designed our server so that every layer is async — from request parsing to final tool execution — runs on `asyncio`. This ensures:
- High concurrency
- Non-blocking I/O
- Smooth streaming over SSE
Token Context with `contextvars`
We use Python’s `contextvars` to isolate auth context across coroutines:
import contextvars
token_ctx: contextvars.ContextVar[str] = contextvars.ContextVar("token")
This allows us to:
- Store the user token at request entry
- Access it later from deep inside a tool or prompt
- Prevent accidental leakage between concurrent requests
🔧 Implementation Highlights
Authentication via Headers
We use a`Bearer token` Authentication header as auth layer to secure tool access.
Example header:
Authorization: Bearer my-api-key
Our middleware extracts the token, stores it in an async-safe `contextvar`, and validates it. This design ensures:
- Auth is enforced before tool execution
- Token is available across async layers
- Invalid requests fail early with a clear error
Sample logic:
token = request.headers.get("Authorization", "").replace("Bearer ", "")
token_ctx.set(token)
if not token:
return JSONResponse({"error": "Unauthorized"}, status_code=401)
Tool Registration
@mcp.tool(name="ask_feed")
async def ask_feed(prompt: str, feed_id: str) -> AskNeedlResponse:
token = token_ctx.get()
return await _ask_needl_request({"prompt": prompt, "feed_id": feed_id}, token)
Features:
- Param validation via `pydantic`
- Auto-generated OpenAPI-compatible metadata
- Token propagated via async-local storage
- Error handling with decorators (`_tool_error_handler`)
Prompt Registration
@mcp.prompt(name="summarize_annual_report")
async def summarize_annual_report(message: str) -> str:
return f"{ANNUAL_REPORT_SUMMARY_PROMPT}\n\nUser Input:\n{message}"
This lets external agents construct well-scaffolded input prompts by simply passing a message string. The structure, tone, and format are handled remotely.
SSE Transport
def create_sse_server(mcp: FastMCP):
app = mcp.http_app(path="/sse", transport="sse", middleware=[…])
return Starlette(routes=[Mount("/", app=app)])
SSE streams tokens or structured payloads as they’re generated. Perfect for:
- Live inference feedback
- Long-running tools
- Reducing user wait time
Middleware + Auth
class AuthenticateRequestMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
token = extract_token(request)
token_ctx.set(token)
if not validate_token(token):
return JSONResponse({"error": "Unauthorized"}, status_code=401)
return await call_next(request)
Protects all routes
- Token stays local to coroutine
- Logging integrated with `contextvars` for traceability
Logging
We use structured logs throughout:
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger.info(f"Calling tool {tool_name} with params: {params}")
Why it matters:
- Tool invocations are auditable
- Performance bottlenecks are visible
- Failures can be correlated with user sessions
Best Practices Followed
- `httpx.AsyncClient` for outbound requests
- `@wraps` decorators for consistent exception handling
- Timeout tuning for long RAG flows
- All routes mounted under `/sse` for clean transport separation
- Gunicorn config uses `UvicornWorker` + thread tuning
🤔 Challenges & Learnings
- Sparse Documentation: Very little online content or community knowledge around implementing remote MCP. Even LLMs struggled to assist.
- Hard to Debug: Without clear references, diagnosing transport issues or tool behavior required deep dives.
- Async Everything: Reliability came from fully embracing async at all layers — transport, tools, HTTP clients, middleware.
- Error Handling: Carefully wrapped all tool logic to prevent server crashes and propagate clear failure states.
- Frequent Library Updates: Had to track and adopt updates to `fastmcp` to improve SSE reliability and reduce connection drops.
- Token Management: Ensuring per-request isolation and secure propagation via `contextvars` was essential for correctness.
👨💻 Developer Guide: Building Your Own MCP Server
This guide is for developers who want to build and deploy their own Remote MCP server. Whether you’re working on internal LLM orchestration, agent frameworks, or enterprise tooling — here’s a clear path.
1. Set Up Your Project
mkdir my-mcp-server && cd my-mcp-server
python -m venv venv && source venv/bin/activate
pip install fastapi fastmcp httpx uvicorn starlette
2. Define Token Context
Use `contextvars` to store per-request token safely.
# context.py
import contextvars
token_ctx = contextvars.ContextVar("token")
3. Auth Middleware
# middlewares/auth.py
from fastapi.responses import JSONResponse
from starlette.middleware.base import BaseHTTPMiddleware
from context import token_ctx
class AuthenticateRequestMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
token = request.headers.get("Authorization", "").replace("Bearer ", "")
token_ctx.set(token)
if not token:
return JSONResponse({"error": "Unauthorized"}, status_code=401)
return await call_next(request)
4. Initialize FastMCP
# server.py
from fastapi import FastAPI
from fastmcp import FastMCP
from starlette.routing import Mount
from starlette.applications import Starlette
from starlette.middleware import Middleware
from middlewares.auth import AuthenticateRequestMiddleware
from context import token_ctx
mcp = FastMCP("My Remote MCP")
# Register tools or prompts here (see next step)
def create_sse_app():
mcp_app = mcp.http_app(
path="/sse",
transport="sse",
middleware=[Middleware(AuthenticateRequestMiddleware)],
)
return Starlette(routes=[Mount("/", app=mcp_app)], lifespan=mcp_app.router.lifespan_context)
app = FastAPI()
app.mount("/", create_sse_app())
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"server:app",
host="0.0.0.0",
port=CONFIG.PORT,
loop="asyncio",
)
5. Register Your Tools
# tools.py
from context import token_ctx
from pydantic import BaseModel
import httpx
class ResponseModel(BaseModel):
result: str
@mcp.tool(name="echo_tool", description="Echoes the input message.")
async def echo_tool(message: str) -> ResponseModel:
token = token_ctx.get()
return ResponseModel(result=f"Echo: {message}, Token: {token}")
6. Run the Server (Production)
#!/bin/bash
set -e
TIMEOUT=180
WORKERS=1
THREADS=8
PORT=${PORT:-8000}
gunicorn -b 0.0.0.0:$PORT \
-k uvicorn.workers.UvicornWorker \
src.askneedl_mcp_server.server:app \
- timeout $TIMEOUT \
- workers $WORKERS \
- threads $THREADS \
- max-requests 1000 \
- max-requests-jitter 100 \
- log-level "info"
7. Local Testing
One way is you can use the official MCP inspector tool which offers a interactive interface to test and debug your remote MCP server’s liveness, resources, prompts, and tools — all without writing extra code. It supports direct inspection of locally developed or packaged servers via npx, making it ideal for rapid iteration and validation during development.
Another way to test your MCP server locally especially with LLM agents like Claude, you can use the `mcp-remote` npm package. This acts as a shim/proxy that:
- Speaks MCP over `stdio` to the local LLM client
- Connects to your MCP server over HTTP + SSE
Even if your MCP server is hosted remotely, the Claude desktop app do not allow remote integrations unless you’re on Claude Max plan. However, using `mcp-remote`, you can bridge that gap on any plan.
Here’s how:
{
"mcpServers": {
"my_mcp_server": {
"command": "npx",
"args": [
"-y",
"mcp-remote@latest",
"http://0.0.0.0:8000/sse",
"8000",
"--allow-http",
"--transport",
"sse-only",
"--header",
"Authorization: Bearer ${API_KEY}"
],
"env": {
"API_KEY": "<your_api_key>"
}
}
}
}
How this works:
- `mcp-remote` launches a local stdio listener
- Forwards every MCP request to your actual FastAPI+SSE MCP server
- Handles headers (like `Authorization`) automatically
- Makes your remote tools/prompts callable from Claude
This is a great way to test your tools locally before deploying them.
Learn more: npmjs.com/package/mcp-remote
Tips for Production
- Use `gunicorn` with `UvicornWorker`
- Monitor memory usage during long SSE sessions
- Use pydantic models for input validation
- Integrate structured logging (JSON format)
- Automate tool discovery with decorators
Bonus: Dev Tips
- Organize by responsibility: keep tools, prompts, auth, config in separate modules
- Start with one working tool: prove end-to-end SSE flow early
- Log everything: especially inputs/outputs at the tool level
- Stay close to the `fastmcp` changelog: fixes and features ship fast
- Write client simulators: test from both LLM and raw HTTP/SSE
Bonus: Tips for Scaling Further
- ✅ Add observability: log every tool invocation with user/session ID
- ✅ Use typed payloads everywhere with `pydantic`
- ✅ Validate request tokens against a central auth service
- ✅ Add rate-limiting per API key for multi-tenant safety
- ✅ Provide tool discovery via OpenAPI schema or `/metadata` endpoint
These help ensure the MCP server is secure, stable, and extensible for broader use.
🤖 Conclusion
MCP transformed AskNeedl from a chatbot into a platform.
- Now, external agents can discover and invoke our tools.
- Prompt logic lives where it should — in a managed service.
- Streaming outputs feel fast and usable.
- All of it is async, robust, and production-tested.
And most importantly: it’s not just about sending better prompts. It’s about unlocking the full value of your LLM system by separating concerns, adding structure, and thinking like a protocol designer.
If you’re building LLM infra, it might be time to ship your own MCP server.
You can also check this post on Medium.