RAG AI Performance Evaluation – With and Without Agents and Reranking

Today we explore the impact of using agents on the performance of our RAG AI system. As discussed in my previous post, agents in My Virtual College Advisor are used to determine criteria for filtering our extensive database of over 1,200,000 college and university documents. These criteria include factors such as location, median SAT and ACT scores, religious affiliation, NCAA sports division, tuition costs, and distance from a specified point.

Process Overview

Criteria Definition: We first define the criteria for filtering the documents. Agents are used to determine the criteria based on the query. Typically, only a part of the query relates to discrete criteria.
Semantic Search: Using anything from the query not used for the discrete criteria, we perform a semantic search on any documents that meet the criteria.
Reranking: We may re-rank the results to improve relevance.
Query Answering: Finally, the RAG AI answers questions based on the retrieved documents.

Note: Caching was turned off wherever possible and I attempted to mitigate it by varying queries randomly.

Danny’s Law and Performance

The hypothesis is that agents, by returning object criteria, will allow us to do the semantic search on fewer documents. For instance, allowing us just to search on documents from schools in Tennessee instead of all documents. By focusing on fewer documents, we aim to speed up query times. However, using agents consumes resources. The goal is to determine if the time and resource investment in agents is justified by improved performance and relevance of results.

Danny’s Law suggests that trying to build perfect systems through upfront overengineering is bound to fail. Instead, an iterative approach with a focus on system observability can build reliable, high-performing systems cost-effectively. Overengineering to prevent failure often incurs hidden costs, making Agile methodologies more effective in producing significant results.

Testing the Components

The system we are testing today consists of four main components:

Database: We use MongoDB to store school data.
Vector Store: For quick semantic searches, we use Atlas Vector Search.
Reranker: We use LLMRerank to refine search results.
Agents: Agents help gather criteria to filter the search, detailed here.

Observability is crucial, and we aim to test each component’s cost, speed, and accuracy. Our modular system, built using llamaindex, allows for such observability. In the future that will allow for quick and easy component swaps, and testing them for accuracy and performance.

Here, we are testing with 5, 10, 25, 50, 100, 250, 500, 750, and 1,000 candidates, always limiting results to either the candidate size or 20, whichever is less. To understand what “candidates” and “limit” are, an analogy is helpful. It is like going into a library with a million books in it and asking the librarian to pick out 100 for you that most fit the topic you are interested in. You then ask the librarian to look through those 100 and find the 20 best. This would be a number of candidates of 100 and a limit of 20.

Evaluation Process

The evaluator scores the results based on relevance to the original query. For instance, for a query about “good economics programs in Alabama schools,” the evaluator rates the results on a scale from 1 (no relevance) to 10 (fully relevant). This rating uses a detailed rubric to ensure consistency.

We run twenty different queries through each configuration, with nine candidate numbers tested for each combination of:

With agents, no reranking
With agents and reranking
Without agents, no reranking
Without agents, with reranking

This results in a total of 7,200 runs.

Agents and Filtering

A few notes on agents:

If we don’t use agents, we need another method to filter based on user criteria. This omission makes non-agent run times underestimations.
I am using an LLM to parse the criteria to feed the agents. A Natural Language Processing (NLP) function would likely be faster.
Carefully monitoring agent performance will determine how much rewriting the agents for speed would help performance.

Challenges with Evaluators

Metrics can mislead if not properly understood. Ensuring that the evaluator measures what we intend is crucial. It’s essential to review evaluator results thoroughly to avoid misinterpretation. I will discuss these challenges in more detail in a future post.

Reranker Causing Rancor

Let me be clear, I have a love-hate relationship this Reranker. Rerankers add complexity and overhead but improve results significantly. Ideally, we wouldn’t need a Reranker; a single call would yield the correct data. Unfortunately, semantic searches are quick, but can pick up extraneous but related things. For example, a search for “Medical Schools” might return pre-med programs, which the Reranker helps filter out if irrelevant.

While it would be ideal to use one call, we currently use four: Vector Index candidate, vector index limit, agent, and Reranker. Each adds both time and complexity to the program.

Rerankers are slow, often much slower than the initial search. I chose LLMRerank for its strong performance in relevance. Unfortunately, while very good at reranking, LLMRerank can hit errors and operates on raw text rather than embeddings, which can split documents into multiple pieces, increasing calls and costs. This splitting can also lead to redundant results, requiring careful evaluation to avoid duplication. Also, since it’s using an LLM to do the reranking, it’s slow.

(For a refresher on embeddings and vector search, see my quick primer).

Results

The first thing we see is the total time to return a result versus the number of candidates:

Observations:

Queries without the Reranker are faster than those with it. On average a touch over 8 seconds faster. No surprise that the Reranker takes time, but 8 seconds is a lot when someone is sitting and waiting for a result. That’s a big strike against the Reranker.
Agents on the other hand, don’t have much impact on performance, even with a high number of candidates. I’m a bit surprised as I hypothesized that at high candidate numbers agents would reduce the search time more than enough to offset the time it takes to call them. Instead, the numbers look very similar. On average, Agents added about .2 seconds to total times no matter how many candidates.

Now let’s look at evaluation score by number of candidates:

Well, that’s telling. A few quick observations:

The Reranker makes a big difference in the quality of the evaluations.
Using agents slightly decreases the quality of the evaluations. I expected this and hypothesize that it’s because it has a smaller pool to pull from. In this performance test before the filter there are 6,200 schools documents. After there are about 1,100.
Evaluations suffer when 50 or fewer candidates are picked.

For completeness sake, let’s look at Performance (time) vs Evaluation score.

This one isn’t so clear, but there’s a few take aways:

The Reranker takes a lot of time.
Evaluations are pretty variable at lower candidate levels.
On average, the longer the time taken the better the evaluation. Up until it maxes out at 10.

Conclusion

A few things are clear from these results:

The reranker improves evaluation scores notably (.63 on average).
The reranker is slow (8 seconds on average).
Agents don’t affect performance much either positively or negatively (and additional .2 seconds on average).
With the reranker, even a small number of candidates achieves high-quality results (perfect scores with as few as 10 candidates).

For now, I’m going to go with 10 candidates, keeping the agents and reranker as-is. A 6.63-second response seems reasonable, and the results are good. I’ll put an even higher volume through to make sure that these numbers hold. Frankly, I didn’t expect so few candidates to provide good results, even with a reranker. Of course, this could just be confirmation bias, I didn’t think that would be good enough so I have to double and triple check it because it isn’t what I expected.

Longer term, I will look into other rerankers. If I could find something almost as good but faster, and preferably less expensive, that would be ideal. As mentioned, the reranker uses GPT-3.5-turbo, so there are fees associated with it. This series of tens of thousands of tests cost me about $10 in fees. That won’t break the bank, but since speed is an issue, I might as well check out lower-cost options as well.

The beauty of having a lot of observability built into the product is that I can run these tests easily, and I can switch out components as new ones come along or as the need arises.

I’m glad I tested my hypotheses instead of just assuming that agents would speed up longer queries (proved false) and that a reranker was critical to accurate answers (proved true). I think our lives would all be better off if we questioned more of our assumptions and put them to the test instead of assuming them to be true.

Moving forward, I plan to: a. Experiment with different rerankers to find a balance between speed and performance. b. Continue to refine the agents to ensure they are contributing positively to the system.

Thank you for following along in this exploration of our RAG AI system’s performance. Stay tuned for more updates and insights as we further enhance My Virtual College Advisor.

2 responses to “RAG AI Performance Evaluation – With and Without Agents and Reranking”

The Critical Capability That AI Lacks – Lowry On Leadership

June 25, 2024

[…] someone deeply immersed in the world of AI—coding it, blogging about it, and dedicating my spare time to studying the latest advancements—I’ve […]

Loading…

How I Became a Top Reddit Makeup Contributor Despite Knowing Nothing About Makeup: The Power of RAG AI – Lowry On Leadership

July 9, 2024

[…] meant a lot of website crawling and a lot of work on the search to return good results from complex data. After a month of hard work and constant evaluations, I settled on a search that combines a hybrid […]

Loading…

Lowry On Leadership