13 mai 2025
How does a RAG Work?
9 minutes reading

In recent years, the democratisation of LLMs (Large Language Models) has taken place. These language models generate text based on general data. Today's interest in them far exceeds the boundaries of tech and mathematics. These models have numerous and varied applications, ranging from text generation to natural language understanding and translation. In short, they are powerful tools that continuously develop and improve, but they still suffer from certain limitations that we will address in this article.
Introduction
With the democratisation of these tools, many people have been able to experiment and use 'generic' language models. I specify 'generic' because these LLMs are trained on general data and are therefore not specialised in a specific field. This can quickly become a barrier when one wishes to generate precise and contextual responses. Organisations wishing to use LLMs for their needs have had to find ways to improve the quality of generated responses. Two solutions have emerged:
- Fine-Tuning on a generic model
- RAG (Retrieval-Augmented Generation)
In this article, we will focus on the drawbacks of the first solution, Fine-Tuning, and concentrate on the second, RAG, and how it can be used to improve the quality of the LLM's responses and how to implement it. We will explain in detail how a RAG works and how to set it up in general. However, there are many variations of this system that will depend on the language models and/or platforms used. The idea is to understand the general principles and apply them to specific cases.
TL;DR
A RAG is a system that will allow us to enrich our questions to a LLM with contextual information. To do this, it uses embedding
(or vector embedding
) of documents and a vector database to store these vectors. This involves setting up and filling the vector database with our documents beforehand. Finally, when sending a question to the LLM, not only will it receive the original question but also the context of the documents closest to the question thanks to vector search
or similarity search
.
Fine-Tuning
LLMs are language models trained on general data, allowing them to generate highly effective responses. However, it's possible to train them on domain-specific data, enabling them to generate more precise and contextual responses. This can be seen as model overtraining on specific data. This can be an effective solution but comes with downsides:
- Training can be time-consuming and resource-intensive
- Training is done on static data, so the model needs to be updated regularly to keep knowledge up to date
- Training data management can be complex and tedious
Clearly, this solution has its uses as it allows for highly effective responses in certain applications. But here, we will consider a Question/Answer
focused usage that should answer a question in a precise and contextual manner. To do this, we will delve into and explore another solution: the RAG
.
For specific needs, it's possible to combine both solutions. By using in a RAG a pre-trained model.
Retrieval-Augmented Generation
A RAG, or Retrieval-Augmented Generation
, is a system that allows us to enrich our questions to a LLM with contextual information. To do this, it uses embedding
(or vector embedding
) of documents and a vector database to store these vectors. This involves setting up and filling the vector database with our documents beforehand. Finally, when sending a question to the LLM, not only will it receive the original question but also the context of the documents closest to the question thanks to vector search
or similarity search
.
How does a RAG work?
As mentioned earlier, an important part of a RAG is setting up and filling the vector database with our data. This step is crucial for the proper functioning of the RAG. We will return in detail to the different ways of creating these embeddings.
Once these embeddings are created, when sending a question to the LLM, the question will be vectorised and compared to the vectors of the documents in the vector database. This comparison allows the retrieval of the most relevant pieces of documents and sends them to the LLM along with the original question.
The LLM then generates a response based on the provided context.
Preliminary Work
Input Data
An important part of a RAG is the quality of the data used to create the embeddings. Indeed, the more relevant and contextual the data, the more precise the response generated by the LLM will be.
The origin and format of the input data can vary greatly. Indeed, it can be textual documents, images, PDFs, spreadsheets, or even data extracted directly from a database. Therefore, it's important to be able to process them uniformly and convert them into embeddings. The only constraint is that the data must be convertible into text.
In certain cases, it may be necessary to convert the data into text. For instance, if the data is images, it's possible to convert them into text using computer vision models, allowing for a textual description of the image's content. For PDFs, OCR (Optical Character Recognition) models can be used to extract the text, such as Mistral OCR.
Chunking
Chunking is the process of dividing our documents into smaller pieces for storage in our vector database. This step is crucial as it directly influences the performance of our RAG.
Size of Chunks
Chunk size is an important parameter to consider when implementing a RAG. One often has to experiment with various sizes and overlap levels to find the optimal configuration. Smaller chunks tend to allow for a more precise search as they contain less text padding that could dilute the semantic representation. This helps the RAG system to identify and extract relevant information more efficiently. However, this precision comes at a cost: smaller chunks increase processing time and resources needed.
Cutting Methods
Although the simplest method is to cut the text by character, other options exist depending on the use case and the document's structure:
- By tokens: to avoid exceeding token limits in API calls
- By sentences or paragraphs: to maintain chunk coherence
- By HTML headers or JSON properties: to respect document structure
- By meaningful code chunks: if you're working with code, it's often recommended to use an abstract syntax tree parser (AST)
Embeddings
Now that our data is chunked, we can convert it into embeddings, which means in vectors that will represent our data and allow us to capture their semantic meaning. To do this, we will use an embedding
model, such as mistral-embed by Mistral or text-embedding-3-small by OpenAI, for example.
There are indeed many programmatic ways to generate embeddings, but using a pre-trained embedding model allows us to generate better quality embeddings, as the model better grasps the meaning, intent, and context of a chunk. The generated vector will, therefore, be more relevant.
Vector Database
The vector database is an essential tool for storing and retrieving embeddings. It allows us to store vectors and retrieve them quickly. There are several solutions for implementing a vector database, either dedicated to RAG, or integrated into an existing software stack if it allows it. Among the most common solutions are Pinecone, Faiss, Annoy, HNSW or Milvus.
If your project already uses a PostgreSQL database, it's possible to use the pgvector extension to store and retrieve embeddings. This allows for a vector database natively integrated into your project, enabling the creation of links between your data and embeddings. This is very useful when wanting to retrieve information related to documents.
Now we have our chunked data, embeddings and our vector database. We can now move on to interpreting questions and generating responses by creating a RAG Assistant.
Creating a RAG Assistant
This RAG usage phase will highly depend on the platform you used and/or the framework used. In some cases, you might be able to create a RAG Assistant in standalone mode. Otherwise, you will need to create a RAG Assistant in standalone mode by explicitly defining the necessary tools and how to use them. In either case, you will need to define necessary tools to generate precise and contextual responses.
Defining Tools
By 'tool', we mean the different functions that will be used to retrieve data in the vector database and use it to generate precise and contextual responses.
Vector Search
Vector search is the function that will allow data retrieval in the vector database based on the question.
Initially, it will be necessary to transform the question into a vector using the embedding model that we used to create our embeddings earlier. In this way, we can compare the question vector to all vectors of the documents present in the vector database using a vector search function such as similarity
with the PostgreSQL pgvector
extension.
This vector search will allow us to obtain the documents closest to the question, i.e., the documents that most resemble the question.
For the example, we will use the similarity
function with the pgvector
extension of PostgreSQL.
export const findRelevantContent = async (userQuery: string) => {
const embedding = await generateEmbedding(userQuery);
const vectorQuery = `[${embedding.join(",")}]`;
const embeddings = await db.$queryRaw
`
SELECT
embeddings.id,
embeddings.content,
documents.name,
1 - (embedding <=> ${vectorQuery}::vector) as similarity
FROM embeddings
INNER JOIN documents ON embeddings.document_id = documents.id
WHERE 1 - (embedding <=> ${vectorQuery}::vector) > .5
ORDER BY similarity DESC
LIMIT 5;
`;
return embeddings;
};
Here we limit the search to 5 chunks with a similarity greater than 0.5. This value is arbitrary and can be adjusted based on the relevance of the results.
Prompt
The prompt is the text that will be sent to the LLM to generate a response. It will contain instructions for the LLM to generate a relevant response by telling it to use the vector search tool to retrieve the documents closest to the question.
This is where you will also need to tell the assistant how it should use the document's context.
For the example, we will use Vercel's SDK to generate a response by telling it to use the vector search tool to retrieve the documents closest to the question.
export async function POST(req: Request) {
const { messages } = await req.json();
const result = await streamText({
model: model,
messages: convertToCoreMessages(messages),
system: `You are a helpful assistant. Check your knowledge base before answering any questions.
Only respond to questions using information from tool calls.
If no relevant information is found in the tool calls, respond, "Sorry, I don't know."`,
tools: {
getInformation: tool({
description: `get information from your knowledge base to answer questions.`,
parameters: z.object({
question: z.string().describe("the users question"),
}),
execute: async ({ question }) => findRelevantContent(question),
}),
},
toolChoice: "required",
});
return result.toAIStreamResponse();
}
The way to impose the assistant's use of the vector search tool will depend on the used SDK.
Conclusion
As you've seen, RAG
is a powerful tool that can be used to generate precise and contextual responses and avoid issues like hallucinations, which are common when using LLMs on complex topics. By guiding and contextualising the questions sent to the LLM, it's possible to get responses that truly rely on your data.
There are many use cases. At Premier Octet, we have been able to implement RAGs on chatbot projects and response generation for organisations wishing to use the power of LLMs but adapt it to their own needs and data.
If you have any feedback on these topics, or if you would like to know more about RAG, feel free to contact us.
References
- Mistral OCR - Optical Character Recognition Model
- OpenAI - Embedding Model
- Pinecone - Vector Database
- pgvector - Vector Extension for PostgreSQL
- Vercel AI SDK - AI SDK
- LangChain - AI SDK