RAG and Fine-Tuning

Dieser Blogpost ist auch auf Deutsch verfügbar

TL;DR

On their own, LLMs don’t provide reliable answers for knowledge queries and they hallucinate information.
At runtime, RAG retrieves wiki pages including sources and reduces hallucinations.
A source of errors may be how context is used – not the retrieval.
Production logs and user feedback become fine-tuning examples with context and citations.
Direct Preference Optimization fine-tuning learns source prioritization; as a loop, the RAG system stays stable over time.

Quick overview: RAG and fine-tuning

RAG supplies an LLM with external knowledge at runtime – for example, wiki pages, manuals, or product documentation. The system searches for relevant documents and passes them as context. This reduces hallucinations and makes answers traceable, because sources are cited and can be looked up.

Fine-tuning shapes a model’s behavior through examples. It doesn’t necessarily add new knowledge, but it can train the model to use domain-specific language, produce answers in a particular format, and follow instructions more reliably – for example, in data extraction or document classification.

On their own, each approach optimizes toward a different goal. Together, their effects compound – assuming the use case and available data are a good fit.

In this article, we use “fine-tuning” to mean supervised fine-tuning, since other variants exist. “Supervised” here means the model is given one or more expected answers for each prompt during training.

The use case: knowledge queries at an insurance company

An insurance company uses an LLM to make its internal wiki more accessible to employees. Typical questions include: Which policy applies to plan X? Where can I find information about claim class Y? Summarize the key points from guideline Z.

Here’s how it works technically: the wiki is indexed and stored in a data source. That data source can be anything – a vector database, an MCP server, or a relational database. A RAG pipeline finds the right pages for each query and passes them as context to the LLM, which generates answers or summaries from that context.

The problem

On paper, the system works. The right pages are in the context. But in practice, the LLM overlooks relevant documents, overweights irrelevant passages, and answers end up incomplete or drawn from the wrong sections.

The issue isn’t retrieval – it’s how the model uses context during generation. Put differently: the model receives the right documents but doesn’t reliably know which one matters for the question at hand. The information isn’t missing – what’s missing is the ability to prioritize, weight, and correctly cite within the given context.

From RAG outputs to fine-tuning data

In production, every query the system answers already contains the building blocks for fine–tuning. For each query, the system logs the prompt, the retrieved documents along with their source references, the final model response, and optional user feedback – whether the answer was accepted, rejected, corrected, or annotated with free text.

These elements are assembled into fine-tuning examples that teach the model how to use context – not what the correct fact is.

The key: feedback from production

The insurance company systematically collects feedback. Users rate answers as helpful or unhelpful, flag incorrect or incomplete results, and provide free-text comments to improve responses. This creates a dataset of prompts, context documents, and preferred versus rejected answers.

This feedback loop is essential for stabilizing the system over time. Crucially, users need to be actively encouraged to give feedback. Clear messaging about why feedback matters, low-friction ways to provide it directly in the interface, and structured feedback formats all help ensure the loop actually delivers value.

If it’s unclear why and in what form feedback helps – or if providing it feels like too much effort – users will quickly lose motivation, and the feedback loop will break down.

The fine-tuning dataset (from the field)

A suitable fine-tuning dataset: the prompt, multiple loaded documents, and an expected answer — with explicit attribution of which document was actually used.

Here’s what matters about this data point: multiple documents are present in the context, but only one is referenced in the answer. The model learns to select, prioritize, and cite correctly – not to memorize documentation, but to distinguish relevant from irrelevant context under RAG conditions. This directly addresses the observed failure mode: relevant documents are retrieved but used incorrectly during generation.

Fine-tuning with Direct Preference Optimization

Instead of classic supervised fine-tuning, the insurance company uses Direct Preference Optimization (DPO). DPO is a fine-tuning variant where the model doesn’t just learn what a good answer looks like – it simultaneously learns what makes an answer poor. For each prompt, the model receives one preferred and several rejected answers during training. This doesn’t just reinforce good answers – it also makes poor ones measurably less likely.

For this use case, that’s a decisive advantage: the focus is on decisions, not facts. The model learns which documents are relevant for which prompts, how strongly to weight them, and when to ignore information entirely. It doesn’t memorize the wiki – it learns to use the provided context correctly.

Why RAG and fine-tuning work well together

RAG and fine-tuning solve different problems: RAG delivers context and traceable source citations, because it draws directly on current data that we provide. Fine-tuning with DPO controls how the model handles that context, stabilizing its behavior based on real user preferences. Together, they complement each other – making the provided company knowledge more reliably reflected in answers.

Conclusion

If you plan to run LLMs long-term in a similar use case, think of RAG and fine-tuning not as competing alternatives but as approaches that combine naturally. And don’t treat fine-tuning as a one-off training run. It’s a continuous process, tightly coupled to real usage. The fine-tuning dataset needs regular updates, and the human feedback loop has to be maintained. That’s how the system grows more reliable over time and makes better decisions, even as new documents are added.