by Ruchika Chourey
In Part 1, I outlined common patterns of perceived hallucination in our RAG system — like missing citations or vague answers that broke user trust.
This post continues that journey, focusing on how we acted on those signals — using real user feedback, lightweight QA loops, and product–ML collaboration, all without retraining the model.
Turning Patterns into Progress — The Feedback Loop
Spotting patterns is only the first half of the job. The real impact comes from translating those insights into structured, actionable feedback that product and ML teams can rally around.
As we cataloged hallucination-adjacent failures — missing citations, incomplete answers, wrong references — we realized that the system wasn’t broken. It was behaving exactly as designed. But it wasn’t meeting user expectations.
And that gap had to be bridged — not just through model updates, but through product rigor and close collaboration with our users.
🛠️ What We Built: A Lightweight, High-Signal QA Loop
We didn’t have an automated hallucination evaluator or gold-standard benchmark suite. But what we did have was usage data, internal testing, and a clear understanding of what our enterprise users considered “trustworthy.”
To turn observations into progress, we built a high-signal manual feedback loop that included:
- Logging all queries across internal and pilot sessions, then focusing on failed or suspicious queries
- Annotating each case: “hallucination,” “citation missing,” “partial answer,” or “retrieval gap”
- Creating themed QA sets — e.g., “stake change over 6 months,” “revenue drivers,” “SEBI circular compliance” — based on recurring pain points
- Sharing these batches with the ML team for model and retrieval iteration
This turned scattered frustration into structured insight. I set up a simple shared sheet to log issues, link the correct documents, tag themes, and review them regularly. No fancy tools — just discipline and iteration.
We’ve recently integrated our MCP setup to partially automate response evaluation — adding structure to what was previously a manual loop. While we’re using it internally for QA, the same system powers how AskNeedl routes insights into reports, dashboards, and decision systems — turning grounded answers into actual outcomes.
It’s not a fully autonomous QA system — but it’s no longer just spreadsheets and gut feel either.
Bringing Users into the Loop: Early Adopters as Ground Truth
We actively worked with a few early adopter teams — compliance officers, market analysts, and documentation experts — who were already using AskNeedl in production or pilot settings. Their feedback became essential in shaping what we tagged as:
- Acceptable vs Unacceptable Paraphrasing
- Sufficient vs insufficient citation coverage
- Answers that “felt right” vs answers that sounded confident but were incomplete
In many cases, they weren’t pointing out factual errors — they were flagging breaks in trust. And that distinction shaped how we evaluated and prioritized issues internally.
These early users didn’t just report bugs — they taught us what “truthful” means in context.
Collaboration: PM x ML x Users = Targeted Optimization
By narrowing the problem space and attaching clear, user-validated examples, we enabled the ML team to:
- Adjust retrieval strategies (e.g., prioritizing recent disclosures)
- Tune prompts for clarity in date range and entity matching
- Improve citation formatting and fallback logic
Instead of “model is wrong,” the message became:
“Here’s what this user expected, why this output felt unreliable, and what could have made it better.”
I often found myself translating between what users said — “this feels vague” — and what the ML team needed to hear, like “retrieval precision dropped due to a fuzzy match between HDFC Bank and HDFC Securities.” That translation layer became part of the product muscle.
Prioritization: Not All Failures Are Equal
As a PM, your job isn’t just to surface what’s broken — it’s to decide what’s worth fixing now. We focused on:
- High-frequency queries with repeat issues
- High-trust personas (e.g., compliance teams, investor relations) who needed grounded outputs
- High-sensitivity topics like stock movement, regulatory changes, or stakeholder action
This let us direct ML and engineering effort where it mattered most — not for model elegance, but business-critical trust restoration.
Measuring Quality When There’s No Ground Truth
One of the most complex challenges in AI product development — especially with large language models and RAG systems — is evaluating output quality when there’s no clear right answer.
In traditional software, correctness is binary. In AI, it’s often a matter of:
- Completeness (“Did it cover all the points?”)
- Factuality (“Was this pulled from a real source?”)
- Relevance (“Was this the right data to show?”)
- Trustworthiness (“Does this feel reliable to the user?”)
We faced all of these — and often, at once.
.png)
The Reality: You Can’t Always Define “Correct”
Many user queries — especially in enterprise search — were broad or open-ended:
“What are the key risk disclosures in the latest filings?”
“Has the company responded to the recent allegation?”
“What’s the company’s outlook for next quarter?”
There wasn’t one perfect answer, and no benchmark dataset to score against. What mattered was whether the answer:
- Was grounded in actual documents
- Contained relevant detail
- Could be traced back to a known source
In other words, quality wasn’t just factuality — it was auditability.
Our Evaluation Strategy
In the absence of automated benchmarks, we built a multi-layered, semi-manual evaluation loop:
1. Test Sets Built Around Real Queries
We created a bank of ~200 task-specific queries, each tagged with:
- The expected behavior (e.g., what documents to retrieve)
- What would count as a “red flag” (e.g., no date, wrong company, vague summary)
2. Live Usage Reviews
We routinely sampled real user sessions:
- Did they reformulate the same question?
- Did they stop after an answer or click the source?
- Were citations shown and clicked?
This usage behavior became a proxy for satisfaction and, indirectly, trust.
3. Early User Feedback as Truth Proxy
We looped in early users and asked:
“Would you trust this answer in a report?”
“Is anything missing?”
“Does this feel inferred or grounded?”
“Does it has all the documents you wanted?”
Their feedback formed the human layer of our quality assurance process.
Evaluation in the Industry: What Leading AI Tools Reveal
While our setup was manual and grounded in real-world QA, we also drew perspective from how leading AI products — like OpenAI’s ChatGPT, Perplexity, and Claude — approach answer quality and trust. Observing their strengths and failure patterns gave us valuable signals for where RAG systems tend to succeed — and where they often struggle.
Some common practices and themes we noted across these systems:
1. Human Evaluation is Still Central
Even the most advanced tools rely heavily on human judgment for evaluating output quality. These tools often use:
- Side-by-side comparison of model outputs
- Human scores for factuality, completeness, and usefulness
- Internal red-teaming to stress-test edge cases
This reaffirmed our decision to center real-user feedback in our own QA loops, rather than depend on automation too early.
2. Multi-Dimensional Metrics Over One-Liner Scores
Rather than relying on metrics like BLEU or accuracy alone, these tools evaluate based on:
- Faithfulness: Is the answer grounded in retrieved or source content?
- Coverage: Does the output include all relevant aspects of the user query?
- Confidence calibration: Does the tone match the certainty of the source?
This inspired us to tag answers not just as “correct/incorrect” but as:
- Fully grounded with citation
- Partially retrieved
- Fluent but unverifiable
- Factually incorrect or hallucinated
3. Adversarial Prompting and “Stress Tests”
Models like GPT-4 and Claude are often tested on ambiguous, multi-intent, or underspecified queries to evaluate reasoning boundaries.
This reflected our own discovery: the more ambiguous or summary-based the user query, the more fragile the RAG output became, especially when citations were missing or retrieval was partial.
Conclusion: Building Trust in AI — A PM’s Role
In traditional product management, we focus on usability, conversion, and retention. In AI product management, we add something deeper: credibility.
AI systems — especially ones built on RAG — don’t just need to answer questions. They need to do so with transparency, humility, and traceability. Because when they don’t, even correct answers can feel wrong. And trust, once lost, is hard to win back.
As product managers, we may not train the model or write the prompt, but we shape the environment in which trust is earned. That environment includes:
- What we surface and when (like citations)
- How do we handle ambiguity and partial answers
- The tools we give users to verify, trace, and decide
In our journey, we learned that:
- Citation isn’t a UI element — it’s a trust contract.
- Not all hallucinations are obvious — many are silent, subtle, and pattern-driven.
- Working with users early is the fastest way to define what “truth” means in context.
- You don’t need full automation to drive quality — structure, collaboration, and prioritization go a long way.
In AI products, quality is a shared responsibility. And as PMs, we’re responsible not just for what the system says, but what the user believes about it.
And if there’s one thing I’ve learned along the way, it’s that for ML engineers, it’s never a “bug” or an “issue.” It’s always an “optimization” 😉
You can also check this post on Medium.