Evaluating Vector Search Performance on a RAG AI — A Detailed Look

While starting to build My Virtual College Advisor, I violated one of my main rules, which I call Danny’s Law. Danny’s Law states that resilience is built by facing adversity, not by avoiding it. In building computer systems, this means that building observability into the system is better than attempting to bulletproof it. For a RAG AI system, this means building in evaluation systems. In my last post, I noted my lack of a fully evolved evaluation system to ensure that system accuracy and performance were well known. This prompted me to work on that implementation, which I detail here.

Having this evaluation system in place will allow me to determine the optimal cost vs. time vs. accuracy of the current system by giving me objective measures of results. Equally important, having this built in will allow me to later try different vector store indexes, rerankers, and LLMs and get objective comparisons of their results. Using an open-source model instead of OpenAI could save me a lot of money in running costs if I can prove the results are still accurate after the switch.

Note: There’s a temptation when writing these posts to just show the steps that led to the final results. While useful, there is a survivor bias problem by failing to show the paths that led nowhere. I’ve included a few of those dead ends because I think there’s value in seeing them. My approach is extremely Agile, meaning I have a general plan and build modular components. I expect to constantly revise all the components over time, and, as such, my first concern is whether they do the job correctly. I can always go back and refactor them for speed and cost later.

Metrics to Measure

My research on measuring RAG AI models shows how much thought has been put into this area. That said, universal standards don’t seem to exist. Some of the most common metrics I see are:

Precision: The fraction of relevant instances among the retrieved instances.
Recall: The fraction of relevant instances that have been retrieved over the total amount of relevant instances.
F1 Score: The harmonic mean of precision and recall.
Mean Reciprocal Rank (MRR): The average of the reciprocal ranks of results for a sample of queries.
Normalized Discounted Cumulative Gain (NDCG): Measures the ranking quality of the results.

My implementation is a bit different as My Virtual College Advisor often has a large number of ‘correct’ answers for any query but returning more than a few will overwhelm the user. For instance, if someone asks which schools have baseball scholarships, that list might number into the hundreds. As I wish to limit the number of results to no more than 5, so as not to overwhelm the user, the question is really whether the five chosen are good, not if they are the very best 5.

As such, precision is what I’m focused on with this analysis. I may well add others later.

In the future, I will add in filtered cases; for instance, if someone says they wish to go to a Division 1 school, I can pre-filter the results to only those cases. These pre-filters often involve agents, essentially functions I’ve setup to return specific data that the LLM calls as needed. The performance of these filters will be paramount, and I will tackle that after the basic searches have been evaluated fully.

Setup

At present, my RAG AI is running on MongoDB using an Atlas Vector Store of about a million documents. As with most vector stores, I indicate a number of candidates for it to select as well as a limit. To use an analogy, this is like going into a library with a million books in it and asking the librarian to pick out 100 for you that most fit the topic you are interested in. You then ask the librarian to look through those 100 and find the 20 best. This would be a number of candidates of 100 and a limit of 20.

The MongoDB vector search is based on strict semantic similarity, which may or may not be good enough at retrieving the desired results. Without a good evaluation system, there’s no way to be sure.

Reranking: Sometimes with RAG AIs, after the 20 candidates are returned, they are resorted, or reranked, so the most relevant are on top. This would be like having the librarian go through your 20 books page by page and ordering them by which are most relevant for you. My hope is always that the vector store will give me results that are scored correctly, and reranking, along with the overhead it entails, won’t be needed. The way to know for sure is to test it!

The reranker I’m starting with is an LLM-based reranker leveraging GPT-4o. This is a very slow reranker but has the big advantages of being more flexible, and I can better observe why the LLM is choosing what it is choosing. In short, great for the evaluation phase.

My plan is to test rerankers and find one that works as well but with faster speed and lower cost. However, as usual, my approach is to get things working and then refactor later. I don’t try to build the perfect beast from the beginning; evolution does a better job of it by focusing energy where it is most needed.

First Automated Run

I made several manual test runs where I had GPT-4 (not GPT-4o) evaluate the responses from My Virtual College Advisor. This assured me that GPT-4 was accurate at evaluating whether the responses matched the prompt. This was a minor exercise in prompt engineering to make sure GPT-4 was using the evaluation rubric I wanted.

Further, by feeding the RAG documents into GPT-4, I could have it assess whether the best documents were pulled or not.

GPT-4 was told to evaluate on a scale of 1 to 10. A score of one means the results had no relation to the prompt, and a score of ten means it was perfectly related. The following points could be awarded:

4 points: Relevance — Does the response properly answer the prompt? 4 points if the prompt is answered.
4 point: Detail — Does the response provide details? Up to 4 points regarding additional detail. For instance, instead of just stating the college has an economics department, does it discuss some of the courses offered and professors’ accomplishments?
2 points: Inclusion of URLs — My Virtual College Advisor links applicants right to the web pages where things are mentioned, so the prompt correctly including URLs is very important. 1 point for a URL from the right school and 1 point for it being the exact right page of the URL.

Anecdotally, I had noticed that having fewer than 1,000 candidates led to lower quality results. I wanted to test this objectively by giving it an official evaluation. My hypothesis was that the larger the number of candidates, the higher the evaluation score would be, but the slower the average query time.

To test this, I made 10 runs of candidates numbering 5, 10, 25, 50, 100, 250, 500, 750, and 1000 to see how the timing changes. For my first test, I decided to leave out the reranker as part of the evaluation is whether the reranker is worth the extra compute time and complexity. A reminder that the evaluation is only made on the top five results returned. Here were the results. Note that the evaluation score is only for the top 5 results returned:

This didn’t match my hypothesis at all! Average query time increased linearly as expected, but instead of evaluation scores rising with the number of candidates, the evaluation scores were all over the place.

So, I started digging into the data a little further and here’s what I found:

My evaluator was using different query strings to avoid back-end caching on MongoDB, which would make any reported query times overly optimistic. What I found, a bit to my surprise, is that my evaluation is much more affected by the query itself than by the number of candidates. It does a great job with answering questions about biology departments but is rather lousy at answering similar questions about art at colleges.

Second Run

Given the past results, I decided to test the impact of the number of candidates on the evaluation using the same query every time so I could split out any effect of the query and just focus on the effect of having a larger number of candidates. This time, I didn’t worry about the average time. I prefer to make sure results are consistently right before optimizing for speed. Here’s what I got:

In short, no impact. Also, when this goes live, evaluation scores need to be consistently above 9, preferably 9.5, for them to be useful. These scores are far too low for a viable product. They are also much lower than expected from my manual testing.

Add in the Reranker

Since the reranker made a tremendous difference in manual testing, I decided to add it in here and see the results. It takes all of the results the vector store index returns and then takes what it thinks are the best 5 from there.

The results were still awful! This prompted me to look at the raw data, particularly the evaluator’s inputs and outputs. Since I’m using an LLM evaluator, I can simply ask it to detail its reasoning. Of course, asking it to explain its reasoning can skew its results, but it is still a very informative exercise.

These were the results:

Detailed Breakdown: Score: 7/10

Relevance to Prompt (Rugby Programs) — 8/10

· The content is highly relevant as it lists comprehensive details about Regis University’s Men’s Rugby club, its achievements, and its context within the rugby landscape. However, it primarily focuses on one rugby program rather than a broader range of programs.

Content Depth and Useful Information — 7/10

· The results provide in-depth coverage of the rugby program, including history, achievements…

Visual and Aesthetic Appeal — 5/10

· The text alone does not imply any use of visuals. For higher engagement, including images or videos of the rugby team, matches, and campus would significantly increase appeal.

Overall, the results provide valuable and in-depth information about a specific rugby program that could greatly appeal to prospective students. However, improvements in structure, broader context, and visual elements could vastly enhance the user experience and comprehensiveness.

In short, the evaluation rubric it was using did not match, even closely, what I thought it was doing! For instance, visual and aesthetic appeal is not part of the criteria at all! Further investigation showed that this issue started by adding a single character. When I changed the LLM engine from GPT-4 to GPT-4o, the behavior of my prompt changed. I moved from GPT-4 to GPT-4o because of the lower cost. I was mistaken to assume it would perform similarly to GPT-4. I have found GPT-4o to be a tremendous asset, but this did show that when you change something foundational, you must retest everything thoroughly.

Again, this shows the need for a comprehensive evaluation system built in from the very start.

Start Again

I spent a while working on the prompt until I felt it gave me similar results to what I had seen before. Here’s how it looked after my next run without reranking. This evaluation was done using ALL candidates, not just the top 5.

This data is much closer to what I expected! Funny how getting the prompt right can make a difference 😉.

It’s to be expected that having more candidates would give worse results. You would expect that if the semantic search was done well, that all of the first 50 or so items might be relevant, whereas far fewer of the first 1,000 would be.

I did some digging into the details, and the results that get low evaluations are items that are semantically similar but not similar enough. For instance, if I ask My Virtual College Advisor to find the best rugby programs, sometimes the semantic search will bring related things like “team”, “league”, and “football”. On occasion, especially when returning 1,000 items, some of those close-but-not-quite-rugby items are picked up.

So, I tried putting the top 20 from the vector store through reranking — Top 3 picked. This time I did 10 runs of 10 and averaged the results.

This is good! Even in a dataset of 20, the top 3 are highly relevant.

Top 5

Choosing the top 5 with reranking isn’t quite as stellar. Still pretty solid when 100 or more candidates are returned from MongoDB. Based on previous results, we can expect the first 3 to be right on, while the 4th and 5th mostly ok.

Conclusion and Next Steps

Since a number of candidates of 100, with a limit of 20 and then reranked to the top 5, is much faster than higher numbers of candidates and still highly accurate, that’s the value I’m going with. I’d prefer to give 5 results, instead of 3, as long as I can do so quickly and they are accurate.

Overall, I’m pleasantly surprised. My manual testing led me to believe I’d need a lot higher number of candidates and I was looking at options to speed processing. I will put those off, for now, as they don’t appear to be in the critical path given these results.

Next Steps:

I will rerun all of these tests, with a larger n, to make sure the results are consistent.
Although I’m happy with the results, I’ve built systems long enough to know that stopping when you think everything looks good is a confirmation bias trap. In essence, you are saying, “this looks as I expected it to, so it must be right.” To combat this, I step through the code line by line and make sure every input and output is what I expect. This is in addition to automated tests.
Testing performance with pre-filtering will be important to the final product.
In another post, I want to talk more about how to set up evaluation prompts. Ideally, I’d also like to get some synthetic data going. Ultimately, that should be based on user input, so the testing mirrors what people are asking for.
I will use this framework to test faster and less expensive rerankers. Eventually, I will explore other vector store indexes as well.

My big takeaway from this exercise is that it was a mistake not to build the evaluation component in from the very start. As Danny’s Law states, observability is key to a robust and well-performing system.

One response to “Evaluating Vector Search Performance on a RAG AI — A Detailed Look”

Global Graph RAG versus RAG AI Versus ChatGPT: Which is better? – Lowry On Leadership

July 24, 2024

[…] results were compared manually (meaning, I reviewed them) because automating evaluations involves creating rubrics, which is tough without seeing initial results. If I continue with Graph RAG, I will certainly […]

Loading…

Lowry On Leadership