13 mai 2025

How does a RAG Work?

parBaptiste

Colin

9 minutes reading

🇫🇷 This post is also available in french

Read the french version

In recent years, the democratisation of LLMs (Large Language Models) has taken place. These language models generate text based on general data. Today's interest in them far exceeds the boundaries of tech and mathematics. These models have numerous and varied applications, ranging from text generation to natural language understanding and translation. In short, they are powerful tools that continuously develop and improve, but they still suffer from certain limitations that we will address in this article.

Introduction

With the democratisation of these tools, many people have been able to experiment and use 'generic' language models. I specify 'generic' because these LLMs are trained on general data and are therefore not specialised in a specific field. This can quickly become a barrier when one wishes to generate precise and contextual responses. Organisations wishing to use LLMs for their needs have had to find ways to improve the quality of generated responses. Two solutions have emerged:

Fine-Tuning on a generic model
RAG (Retrieval-Augmented Generation)

In this article, we will focus on the drawbacks of the first solution, Fine-Tuning, and concentrate on the second, RAG, and how it can be used to improve the quality of the LLM's responses and how to implement it. We will explain in detail how a RAG works and how to set it up in general. However, there are many variations of this system that will depend on the language models and/or platforms used. The idea is to understand the general principles and apply them to specific cases.

TL;DR

A RAG combines the power of LLMs with specific knowledge from your documents. This system first transforms your data into vector representations (embeddings) stored in a vector database. When a question is asked, the system automatically searches for the most relevant information in this base by using vector search or similarity search and adds it as context to the question sent to the LLM, thus providing precise and data-based responses. It also helps avoid hallucinations that can occur when using LLMs on specific topics for which they were not trained.

RAG

Fine-Tuning

LLMs are language models trained on general data, allowing them to generate highly effective responses. However, it's possible to train them on domain-specific data, enabling them to generate more precise and contextual responses. This can be seen as model overtraining on specific data. This can be an effective solution but comes with downsides:

Training can be time-consuming and resource-intensive
Training is done on static data, so the model needs to be updated regularly to keep knowledge up to date
Training data management can be complex and tedious

Clearly, this solution has its uses as it allows for highly effective responses in certain applications. But here, we will consider a Question/Answer focused usage that should answer a question in a precise and contextual manner. To do this, we will delve into and explore another solution: the RAG.

For specific needs, it's possible to combine both solutions. By using in a RAG a fine-tuned model.

Retrieval-Augmented Generation

A RAG, or Retrieval-Augmented Generation, is a system that allows us to enrich our questions to a LLM with contextual information. To do this, it uses embedding (or vector embedding) of documents and a vector database to store these vectors. This involves setting up and filling the vector database with our documents beforehand. Finally, when sending a question to the LLM, not only will it receive the original question but also the context of the documents closest to the question thanks to vector search or similarity search.

How does a RAG work?

As mentioned earlier, an important part of a RAG is setting up and filling the vector database with our data. This step is crucial for the proper functioning of the RAG. We will return in detail to the different ways of creating these embeddings.

Once these embeddings are created, when sending a question to the LLM, the question will be vectorised and compared to the vectors of the documents in the vector database. This comparison allows the retrieval of the most relevant pieces of documents and sends them to the LLM along with the original question.

The LLM then generates a response based on the provided context.

Preliminary Work

Input Data

An important part of a RAG is the quality of the data used to create the embeddings. Indeed, the more relevant and contextual the data, the more precise the response generated by the LLM will be.

The origin and format of the input data can vary greatly. Indeed, it can be textual documents, images, PDFs, spreadsheets, or even data extracted directly from a database. Therefore, it's important to be able to process them uniformly and convert them into embeddings. The only constraint is that the data must be convertible into text.

In certain cases, it may be necessary to convert the data into text. For instance, if the data is images, it's possible to convert them into text using computer vision models, allowing for a textual description of the image's content. For PDFs, OCR (Optical Character Recognition) models can be used to extract the text, such as Mistral OCR.

Chunking

Chunking is the process of dividing our documents into smaller pieces for storage in our vector database. This step is crucial as it directly influences the performance of our RAG.

Chunking

Size of Chunks

Chunk size is an important parameter to consider when implementing a RAG. One often has to experiment with various sizes and overlap levels to find the optimal configuration. Smaller chunks tend to allow for a more precise search as they contain less text padding that could dilute the semantic representation. This helps the RAG system to identify and extract relevant information more efficiently. However, this precision comes at a cost: smaller chunks increase processing time and resources needed.

Cutting Methods

Although the simplest method is to cut the text by character, other options exist depending on the use case and the document's structure:

By tokens: to avoid exceeding token limits in API calls
By sentences or paragraphs: to maintain chunk coherence
By HTML headers or JSON properties: to respect document structure
By meaningful code chunks: if you're working with code, it's often recommended to use an abstract syntax tree parser (AST)

Embeddings

Now that our data is chunked, we can convert it into embeddings, which means in vectors that will represent our data and allow us to capture their semantic meaning. To do this, we will use an embedding model, such as mistral-embed by Mistral or text-embedding-3-small by OpenAI, for example.

Embedding Visualization

Why use an embedding model instead of generating embeddings programmatically?

There are indeed many programmatic ways to generate embeddings, but using a pre-trained embedding model allows us to generate better quality embeddings, as the model better grasps the meaning, intent, and context of a chunk. The generated vector will, therefore, be more relevant.

Vector Database

The vector database is an essential tool for storing and retrieving embeddings. It allows us to store vectors and retrieve them quickly. There are several solutions for implementing a vector database, either dedicated to RAG, or integrated into an existing software stack if it allows it. Among the most common solutions are Pinecone, Faiss, Annoy, HNSW or Milvus.

If your project already uses a PostgreSQL database, it's possible to use the pgvector extension to store and retrieve embeddings. This allows for a vector database natively integrated into your project, enabling the creation of links between your data and embeddings. This is very useful when wanting to retrieve information related to documents.

Vector Database

Now we have our chunked data, embeddings and our vector database. We can now move on to interpreting questions and generating responses by creating a RAG Assistant.

Creating a RAG Assistant

This RAG usage phase will highly depend on the platform you used and/or the framework used. In some cases, you might be able to create a RAG Assistant in standalone mode. Otherwise, you will need to create a coupled RAG Assistant by explicitly defining the necessary tools and how to use them in the code of your application. In either case, you will need to define necessary tools to generate precise and contextual responses.

Defining Tools

By 'tool', we mean the different functions that will be used to retrieve data in the vector database and use it to generate precise and contextual responses.

Vector Search

Vector search is the function that will allow data retrieval in the vector database based on the question.

Initially, it will be necessary to transform the question into a vector using the embedding model that we used to create our embeddings earlier. In this way, we can compare the question vector to all vectors of the documents present in the vector database using a vector search function such as similarity with the PostgreSQL pgvector extension.

This vector search will allow us to obtain the documents closest to the question, i.e., the documents that most resemble the question.

For the example, we will use the similarity function with the pgvector extension of PostgreSQL.

tool.ts

export const findRelevantContent = async (userQuery: string) => {
  const embedding = await generateEmbedding(userQuery);
  const vectorQuery = `[${embedding.join(",")}]`;
  const embeddings = await db.$queryRaw
  `
      SELECT
        embeddings.id,
        embeddings.content,
        documents.name,
        1 - (embedding <=> ${vectorQuery}::vector) as similarity
      FROM embeddings
      INNER JOIN documents ON embeddings.document_id = documents.id
      WHERE 1 - (embedding <=> ${vectorQuery}::vector) > .5
      ORDER BY similarity DESC
      LIMIT 5;
    `;
  return embeddings;
};

Here we limit the search to 5 chunks with a similarity greater than 0.5. This value is arbitrary and can be adjusted based on the relevance of the results.

Prompt

The prompt is the text that will be sent to the LLM to generate a response. It will contain instructions for the LLM to generate a relevant response by telling it to use the vector search tool to retrieve the documents closest to the question.

This is where you will also need to tell the assistant how it should use the document's context.

For the example, we will use Vercel's SDK to generate a response by telling it to use the vector search tool to retrieve the documents closest to the question.

generate-answer.ts

export async function POST(req: Request) {
  const { messages } = await req.json();
  const result = await streamText({
    model: model,
    messages: convertToCoreMessages(messages),
    system: `You are a helpful assistant. Check your knowledge base before answering any questions.
    Only respond to questions using information from tool calls.
    If no relevant information is found in the tool calls, respond, "Sorry, I don't know."`,
    tools: {
      getInformation: tool({
        description: `get information from your knowledge base to answer questions.`,
        parameters: z.object({
          question: z.string().describe("the users question"),
        }),
        execute: async ({ question }) => findRelevantContent(question),
      }),
    },
    toolChoice: "required",
  });
  return result.toAIStreamResponse();
}

The way to impose the assistant's use of the vector search tool will depend on the used SDK.

Conclusion

As you've seen, RAG is a powerful tool that can be used to generate precise and contextual responses and avoid issues like hallucinations, which are common when using LLMs on complex topics. By guiding and contextualising the questions sent to the LLM, it's possible to get responses that truly rely on your data.

There are many use cases. At Premier Octet, we have been able to implement RAGs on chatbot projects and response generation for organisations wishing to use the power of LLMs but adapt it to their own needs and data.

If you have any feedback on these topics, or if you would like to know more about RAG, feel free to contact us.

References

Mistral OCR - Optical Character Recognition Model
OpenAI - Embedding Model
Pinecone - Vector Database
pgvector - Vector Extension for PostgreSQL
Vercel AI SDK - AI SDK
LangChain - AI SDK

How does a RAG Work?

Introduction

TL;DR

Fine-Tuning

Retrieval-Augmented Generation

How does a RAG work?

Preliminary Work

Input Data

Chunking

Size of Chunks

Cutting Methods

Embeddings

Vector Database

Creating a RAG Assistant

Defining Tools

Vector Search

Prompt

Conclusion

References

À découvrir également

DotJS 2025 : focus sur l’avenir de JavaScript et l’IA

Document Picture-in-Picture, le PiP pour n’importe quel contenu HTML

Comment simuler une réponse streamée avec le SDK Vercel AI

Premier Octet vous accompagne dans le développement de vos projets avec openai