Retrieval Augmented Generation (RAG) what exactly is that?

Before we fly into it, there are some terms that can be used that may still be relevant.

A Model is a piece of software trained on a set of data to recognize certain patterns. A model can generate text as output, or images or predict a simple numerical value.

Some examples of AI models:

DALL-E: model that generated images.
Whispers: model that converts audio into text.
Embedding: model that converts text into numerical form.
GPT: set of models that can generate natural language.

A Token is a piece of a text. For example, in the text "Lorem ispsum dolor sit amet" there are several tokens: "sit", "dol", "or", "orem" and so on.

A prompt is a command given to an AI model to generate some type of output. A prompt can consist of several components. Components that sometimes occur: "format" (how should the output come back out), "reference" (may the AI rely on existing works), "request" (what exactly should happen, "Framing" (background information).

Did you know?

Did you know that creating or fine-tuning prompts is called "prompt engineering"?
Did you also know that there are masses of cheat sheets to create smart prompts? A prompt engineering cheat sheet example.

The temperature is the scale from 0 (deterministic) to 1 (random). This is usually set on the low side in our RAG applications because we are not very fond of randomness.

embedding visualisation (distances between words)

There are AI models that have as input text and as output a vector of numbers. We call this kind of AI embeddings. Thanks to this numeric representation, we distance ourselves (for example, to do smart searching) from certain wording. In other words, this eliminates the need to use explicit search terms.

We then store the created embeddings in a vector Database. These are databases where there is no more text. All the content of the site can then be indexed in such a vector database. If we do a query later, we can find related articles.

A large language model is a type of AI model that can recognize and generate text. LLMs are trained on gvery large datasets (today Large).
Some examples of LLMs:

chatGPT
Claude
Ernie
Llama
StableLM

You can compare the best LLMs for yourself here: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

We want to query our own content, that way any modification the editors do in the content is also immediately used to generate correct answers.

The moment an editor makes a change in the content we take our text (can be text from the CMS or text from a PDF file) to an embedding model.
For example, we use the Mistral or OpenAI embedding model.
This model will then turn the text into a vector. This vector is then stored in the Vector Database.
This way, every piece of content we want to query has a vector in our Vector Database almost immediately.
In the diagram, this is steps 1 to 4.

I am visiting the site as an anonymous site user and I have a pressing question. I can't (or can't immediately find the answer without a complex combination of search terms) on the site.

In the question field, I enter my question.
We are going to have the embedding model translate this question into a vector.

Then, using this vector, we are going to ask the vector database the question, "what vectors are around here". The reason we choose a vector database is because this kind of query (which ones are around here) can be answered very well by vector databases.
We also give this query a limit. We call it K. K should not be too low, but also not too high. The higher you set K, the poorer the results. This seems counterintuitive, yet it is so. We'll look at the exact tweaking of K for your situation on a case-by-case basis.

The Vector database returns a list of articles that are content-wise related to your question.

In the diagram, this is steps 5 & 6.

Now we are going to create (behind the scenes) a prompt to have the language model (LLM) generate a response.

We'll create a prompt that looks something like this:
1 Given these K articles,
2 Given this user question
3 Given our tone of voice and house rules
Generate an answer to the question.

End result: The user gets an answer to their question, based on the content from our knowledge base.

Different language models have different pricings. Fortunately, people have already put them side by side. You can compare the prices here.

We can also estimate based on the number of articles on your site and ask how much it would cost to use the AI. We have made an excel sheet for this purpose. Please contact Frederik Wouters to have this forwarded to you. Be sure to include "AI RAG price excell" in your email!

Retrieval Augmented Generation (RAG) what exactly is that?

Parts & terminology

AI Model

Token

Prompt

Did you know?

Temperature

Embeddings

Vector Database

LLM (Large Language Model)

How it works

Indexing content

From query to vector query

From database result to textual response

Examples

And how much does such a thing cost?

Want to know more?