The Magic Box: RAG AI Explained Without Using Technical Terms

Imagine you have a magic box. It looks ordinary, but you can put anything into it, ask questions about it, and get good answers back. Put a watch in it, and it will tell you the exact type of watch, its value, and even the occasions when wearing such a watch would be appropriate. Pretty amazing, right? Well, RAG AI (which stands for Retrieval-Augmented Generation Artificial Intelligence) is the same thing, just for your documents or data.

There are a few limitations: the box is pretty small. Big enough for a watch, but not anything bigger than, say, a soda can. Also, the answers magically appear in the box, so you have to leave enough room. The less room you leave, the shorter the answer you can get. RAG AI is also a way to get answers from the box even though the item is too big to fit into the box.

Too Small a Box?

There are a number of clever solutions to get around the small size of the box¹:

1. Find a Bigger Box

2. Slice things into pieces and have the box look at the slices

3. Describe the item in written detail and put the detail in the box

4. Find the most relevant pieces and put just those pieces into the box (This is most analogous to RAG AI)

The Other Limiting Factor: Capability of the Underlying AI

Before explaining those four potential solutions further, there’s another important limitation to discuss: the capabilities of the underlying AI. In short, how much information can you get back from the box, how accurate is it and at what cost?

There are many AI models out there, ranging from free open-source models such as Lama2 to state-of-the-art models like Claude 3.5 Sonnet and OpenAI’s ChatGPT-4o. Huggingface has over 400,000 models, most of which are specialized in one particular field such as object detection, image classification, or translation.

GPT-4o has been likened to the level of a “smart high schooler” with broad general knowledge. Other models are often less capable, except in their particular specialty area where they excel.

Even the best current model has limitations about what it can tell you about what you put in its Magic Box. We will talk more about this later.

Find a Bigger Box

One solution to the size of the box would simply be to find a bigger box. GPT-4o currently has a maximum size of 8192² tokens, which is approximately equal to 8,192 words³. This is about the size of 25 pages from a standard-sized paperback novel. In contrast, Claude 3.5 Sonnet is a highly capable state-of-the-art model with a maximum size more than 12 times as large, close to the equivalent of 300 pages. Gemini 1.5 is promising to have a maximum size of 3,000 pages.

This is pretty big. You could pass the entirety of the Fitzgerald novel The Great Gatsby into Claude 3.5 Sonnet and get answers back. You could put all of the Harry Potter novels into Gemini 1.5 and get good answers back. Moreover, as new models come out, they tend to have larger context windows and maximum token lengths. In short, they have a bigger box to work with.

A bigger box is a great answer, assuming the box is good enough to give you the type of answers you want and not too expensive. Unfortunately, this is often not the case. For instance, GPT-4o was great at understanding how I wanted my Beauty Insights finder to format the output, embedding video links in the appropriate part of the narrative. To do this with other models, I would have needed to write extensive and error prone code; GPT-4o just handles it naturally (most of the time).

The ideal solution remains: get a magic box that can handle your task and that’s big enough, put everything you want into it, and let the magic box figure out what’s there and what to do with it. It’s quite possible that within time, models will have progressed in size, capability, and cost so that this will no longer be an issue. For now, however, other methods are often needed.

Slice things into pieces and have the box look at the slices

Another option is to slice things into pieces and have the magic box examine each piece. So, instead of putting your car, which is too big to fit, in the box, you’d take the car apart and have the box examine each of the pieces.

This can be especially useful for things where each of the pieces matters. Sometimes this is even preferable to putting the entire thing in the box. In the car example, it might be useful to know about the wheels, the bumper, and the windshield separately.

The problem with slicing things up to put in the box is how big to make the slices⁴. You could make them all equal-sized, say six inches square. Or you could divide them by function, the steering wheel here, the brake pads there, etc.

Moreover, all of this slicing adds complexity. You have to keep track of all the slices and instead of one answer, you now get one answer per slice. Then you have to take those answers and decide what they mean. You could take those individual answers, put them in the box, and ask for a summary.

Don’t underestimate the complexity this adds. For instance, if you sliced up The Great Gatsby and put the slices in the box, you might get different results if you put them in a chapter at a time versus a set number of pages at a time. Hopefully, the difference wouldn’t be too much, but since it looks at each slice separately and without knowledge of the others, there’s a real potential for the summaries to be different.

The other factor is that by slicing the entire thing, the cost is high as the cost of each slice being processed adds to the total cost. As a consequence, I am not a fan of simple slicing except in rare cases.

Describe the item in written detail and put the detail in the box

Another alternative is to dispense with the slicing altogether and instead create a detailed description of the item and put that in the box. For instance, instead of putting the entire The Great Gatsby in the box, you could take a detailed summary of it instead and put that into the box.

If you already have a good summary together, this is often a great alternative as it is quick and much less expensive. The results are only as good as the detailed summary is, which, by definition, is never as complete as the real item but is usually good enough for most purposes.

For instance, putting a summary of The Great Gatsby in the box is fine if you have general questions, say about themes or characters. It is insufficient if you have more detailed questions about interactions between characters or scenes.

In short, if the summary is detailed enough to cover the sorts of questions you want to ask, it’s a good way to go.

Find the most relevant pieces and put just those pieces into the box

We are finally getting to the definition of RAG AI. Instead of slicing up all the pieces, you can instead first decide what the most relevant pieces are based on the query and then take those and put them in the box.

In the car example, if someone asked about brake pads, you would first do a quick look over the car, gather those things that are related to brake pads, and just put those into the box. If the next person asked about the steering wheel, then you’d put just the steering wheel in the box.

The advantage is that this is much quicker and less expensive than slicing up the entire car. It’s also more likely to get the correct answer as only the items in question are put in the box⁵. Quicker, less expensive, and more accurate is a winning combination!

The disadvantage is that you have to accurately determine which parts are most relevant. This can be a challenging task. For instance, with my AI app Beauty Insights, if you say “How should I dress for a camping trip,” it has to know that in this case ‘dress’ is a verb and not the noun, lest it give you advice for picking out dresses!

Most things you hear in regards to RAG AI, vector indexes, semantic search, etc., are all attempts to tackle this problem in the quickest, least expensive, most accurate way possible.

If your needs are straightforward, or the people you work with are expert enough in RAG, then it is an excellent way to go. Quick, inexpensive and accurate. However, it all depends on the ability to accurately determine which parts are most relevant and should go into the box. This is very complex and nuanced work and many large and expensive implementations have failed because they underestimated the challenges here. If you are going to go this route, make sure to use an expert. Don’t take their word for it, use systems they’ve built before and test them. Understand how yours might be different and what unique challenges you might face. When it works, it’s amazing, but it often doesn’t. I like to tell people “RAG AI is as difficult as rocket science”, it’s unfortunate how few believe me until their projects fall apart.

Conclusion

RAG AI, or Retrieval-Augmented Generation AI, is a “magic box” that allows you to use the vast knowledge of AI large language models with your own documents or data. While the magic box has limitations, such as size and the capability of the underlying AI, there are innovative ways to maximize its potential. From finding bigger boxes and slicing items into manageable pieces to creating detailed summaries and focusing on relevant parts, these strategies help overcome the constraints.

As this technology continues to evolve, we can expect to see even more innovative applications across various industries, from customer service and content creation to data analysis and decision-making support. The future of AI is not just about having access to vast amounts of information, but about being able to quickly and accurately retrieve and apply the most relevant pieces of that information to solve real-world problems.

It might soon be that the magic box is so big that RAG AI is irrelevant, but in the mean time, if you are implementing a RAG AI implementation, make sure to carefully scrutinize the work the builder has done before. These are highly complex implementations which software engineers often underestimate and a 75%+ failure rate is common. The potential benefits are life changing for any company, but don’t underestimate the complexities of RAG AI.

Footnotes:

1. There are also a number of more advanced solutions such as multi-step reasoning and query expansion. There’s also fine-tuning a model, which is preferable in high-volume tasks.

2. While GPT-4o has a Context Window of 128K, it has a Maximum Token size of 8,192, Meaning that each interaction can only be 8,192K total, but it will ‘remember’ past interactions up to 128K.

3. It’s actually closer to 1.3 tokens per word on average, but for this discussion 1-to-1 is close enough.

4. The great thing about the computer world is that anything can be copied and sliced up without disrupting the original. Therefore, using a car is not a perfect analogy, but you get the idea.

5. Signal to noise ratio is a big deal in AI queries. The more noise that can be removed from the system the better. Anything not related to the given query is noise.

Lowry On Leadership