Retrieval Augmented Generation for Domain-specific Question Answering (2404.14760v2)
Abstract: Question answering (QA) has become an important application in the advanced development of LLMs. General pre-trained LLMs for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a LLM. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.
Summary
- The paper presents a comprehensive retrieval-augmented generation (RAG) framework for domain-specific question answering, integrating data-centric retriever optimization and retrieval-aware LLM fine-tuning.
- A key contribution is the high-precision in-domain retriever fine-tuned using user behavioral data and a novel weighted cross-entropy loss, achieving significantly better performance than larger generic models.
- The retrieval-aware LLM tuning and overall architecture demonstrate a practical approach for industrial QA deployment, delivering accurate, grounded, and domain-relevant answers while addressing privacy concerns.
Retrieval Augmented Generation for Domain-specific Question Answering: A Technical Overview
"Retrieval Augmented Generation for Domain-specific Question Answering" (2404.14760) presents a comprehensive framework for constructing domain-specific QA systems, with a production deployment targeting Adobe products. The approach integrates advanced retrieval-augmented generation (RAG) techniques, data-centric retriever optimization, and retrieval-aware LLM fine-tuning to deliver accurate, up-to-date, and grounded responses within a highly dynamic product ecosystem. This summary emphasizes the system's methodological contributions, experimental results, and broader implications for domain-adapted language technologies.
System Architecture and Methodology
The framework consists of two principal components: a high-precision, in-domain document retriever and a purpose-built LLM-based generator.
Retriever Design and Training
A core insight is the critical impact of domain-adapted retrieval on overall QA performance. The retriever is modeled using a transformer backbone (MPNet) and is fine-tuned using user behavioral data—specifically, click logs from Adobe's Helpx support portal. The training is cast as a contrastive learning problem, leveraging a bespoke relevance metric:
- Relevance scoring uses the log of click ratio to weight query-document pairs based on real user interactions.
- Weighted cross-entropy loss incorporates these relevance scores, ensuring the model emphasizes highly relevant matches in its learned representation space.
Both queries and documents are embedded using the same encoder with mean pooling, optimizing for semantic similarity. Notably, the use of product-intent modeling for disambiguation enables the system to resolve vague or ambiguous queries, mitigating common errors caused by product name confusion.
Retrieval Index and Data Engineering
The retriever operates over a composite index, aggregating multiple data sources:
- Primary: Helpx documents, Adobe community QA.
- Secondary (derived): LLM-generated QAs from Helpx/YouTube transcripts.
A dedicated QA generation module operates in a few-shot regime, producing granular QA pairs optimized for step-wise, user-preferred instructional style. A privacy-preserving NER/regex module ensures PII sanitization at the ingestion stage, supporting deployment in privacy-sensitive settings.
Generator Fine-tuning Protocol
LLM fine-tuning is conducted in a retrieval-aware manner, with the following key design choices:
- Ground-truth augmentation: Positive and negative document pairs are combined with query-answer samples, challenging the model to learn not just factual recall but contextual grounding and abstention ("This question cannot be answered at the moment") when no relevant evidence exists.
- Answer informativeness: Short, uninformative QA pairs are filtered from the training set to promote substantial, complete answer generation.
- Deduplication and ranking: Candidate QA contexts are curated with deduplication (Levenshtein/semantic similarity), source prioritization, and elimination of redundant or less credible information.
The prompt engineering explicitly conditions the LLM to use only the retrieved QA context, directly addressing the tendency of foundation models to hallucinate.
Empirical Evaluation
Retriever Benchmarking
The in-domain retriever demonstrates strong empirical performance over industry-standard baselines, achieving nDCG@4 scores substantially above larger generic retrievers:
- 0.7281 (Acrobat), 0.6921 (Photoshop), 0.7249 (Lightroom), 0.8221 (Express) with a 109M parameter model.
- Outperforming alternatives such as BGE-large, UAE-Large, SimCSE, and SFR-Embedding-Mistral—many of which are 3x to 10x larger.
The improvements are pronounced for both common (head) and rare (tail) queries, especially with short query formulations, which are prevalent in real user logs.
Generation and End-to-End QA
Quantitative evaluation on 137 Acrobat QA pairs, with human-authored gold responses, uses GPT-4-based answer assessment as an automatic proxy. Key results:
- Finetuned Retriever + GPT-4 Generation: 0.7242 mean relevance score.
- Finetuned Retriever + Finetuned LLM: 0.5150.
- Generic GPT-4: 0.1705.
This demonstrates the bottleneck is primarily the relevance and coverage of retrieval, and that retriever fine-tuning yields substantial downstream gains. The LLM generator's performance, while competitive, highlights the ongoing challenge of aligning LLM outputs with human stylistic nuance.
Qualitative Findings
Expert comparison with ChatGPT underscores the system's ability to:
- Accurately resolve domain-specific product ambiguities (e.g., Adobe Firefly).
- Deliver current, productized answers to questions about features or pricing, where ChatGPT’s generalist nature and frozen knowledge cut-off limit utility.
- Embed actionable content (hyperlinks, feature callouts) not possible in external hosted LLMs.
Implications and Future Directions
The proposed RAG-based QA framework establishes a reusable architecture for vertical-specific assistants where knowledge volatility, privacy, and task grounding are essential. Key practical implications include:
- Industrial QA deployment: Demonstrates that high-fidelity retrieval, combined with supervision grounded in real behavioral data, is achievable with moderate model sizes and without full retraining of large LLMs.
- Privacy and security alignment: Data sanitization and in-house infrastructure facilitate compliance and control—critical in enterprise scenarios.
- Continuous domain adaptation: The modular QA pair generation allows rapid ingestion of new content, adapting to shifting product releases and documentation.
Future research may generalize this protocol to other domains (e.g., healthcare, finance), further automate and personalize query augmentation, and exploit larger-scale feedback loops for iterative improvement. The impact of advanced negative sampling regimes and the integration of retrieval-augmented confidence estimation merit deeper exploration.
Conclusion
This work presents a robust, quantitatively validated approach to domain-specific, retrieval-augmented QA that achieves both practical deployment goals and state-of-the-art domain relevance. Its technical contributions around retriever fine-tuning, behavioral-weighted supervision, and generative grounding make it a reference architecture for organizations seeking to build accurate, context-sensitive, and privacy-conscious automated assistants for proprietary knowledge domains.
Related Papers
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)
- Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases (2024)
- RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation (2024)
- QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance (2025)
- Multi-task retriever fine-tuning for domain-specific and efficient RAG (2025)