Automating Your AI Evaluation: A How-To Guide

Automating allows you to switch models quickly, add features, and know if an underlying change in the model you’re using might cause some difficulties. Today, I will walk you through the entire process, covering both the code and the prompting.

Overview

The basics are:

Take the Query:
Take the Results from the LLM
Combine Them into One Prompt and Evaluate: Merge the original query and the LLM’s response into a single prompt. In this combined prompt, include instructions for the LLM to assess whether the response accurately and appropriately answers the query.

It’s just that simple. Take the query and the result, pass it back into the LLM (or into another LLM), and have it evaluate the answer. Of course, the devil, as always, is in the details.

Handle Poor Evaluations: Develop a system to manage instances where the LLM’s evaluation indicates that the response is not satisfactory.
Check the Evaluation Prompt for Accuracy: It’s crucial to verify that the evaluation prompt itself is well-designed and accurate.
Include Relevant Documents/Embeddings in RAG: When using Retrieval-Augmented Generation (RAG), which involves supplying the LLM with additional documents or data points (embeddings) along with the query, make sure these resources are also part of the evaluation process.

If you’re interested in how this is coded in Python with Llama Index, I have an example here: https://github.com/Troyusrex2/AIEvaluationAutomation.

The Prompt’s the Thing

You can start trying this process in ChatGPT. Let’s start simple:

Here’s a prompt and a result. Now, we just take those and feed them back into ChatGPT or a similar LLM. In this case, I used Claude 3.5 Sonnet. I added some evaluation instructions at the front.

Evaluations are often context-dependent, and it’s up to you to provide that context to the LLM. The place for that context is in the prompt. For example, if the question came up during a discussion about Tesla Motor Corp., then the evaluation would be different if the LLM assumed “President” meant the President of the USA.

Rubrics

Evaluations of more complex materials often benefit from more structured evaluation criteria called rubrics. Rubrics provide a structured and consistent way to evaluate any material. They can be simple or incredibly complex. While standardized tests and other high-stakes exams require detailed and complex rubrics, most LLM automation rubrics only require a few minutes of thought to determine how the evaluations should be done.

Some areas you might include in the rubric:

How well does it answer the question?
Does it provide extra information beyond what is asked?
Is the output in the requested format?

A typical prompt might look like: “Evaluate the prompt and answer below. A score of 1 means not relevant at all. A score of 10 means a perfect match in relevance. Provide your response in the following format: Score: [Your score as an integer] Explanation: [a one-sentence breakdown explaining why you gave this score.]”

Retrieval Augmented Generation (RAG) AI

Automating evaluations is especially beneficial with RAG AI. In RAG AI, additional information is fed into the LLM along with the prompt to help it answer the question. Here, there are two pieces to evaluate:

Are the documents fed into the RAG AI pertinent to the topic?
Did the LLM do a good job of answering the questions based on the extra RAG documents provided?

Knowing if the documents passed to the RAG AI are appropriate is a complex topic focused on the data inputs. For this blog, I will focus on #2, the outputs of the LLMs.

Let’s consider a simple RAG where the only document is as follows:

“The Mediterranean diet is renowned for its health benefits, primarily due to its emphasis on whole foods, healthy fats, and a balanced intake of nutrients. Key components include olive oil as a primary fat source, abundant consumption of fruits and vegetables, whole grains, legumes, nuts, and moderate consumption of fish and poultry. Red meat is limited, and dairy products are consumed in moderation, primarily as cheese and yogurt. Studies have shown that adherence to the Mediterranean diet can lead to reduced risks of cardiovascular diseases, improved blood sugar control, and lower levels of bad cholesterol (LDL). Additionally, it has been associated with longevity and a decreased risk of chronic diseases such as cancer and Alzheimer’s disease.”

Any calls made to the LLM will automatically include this document, and the LLM will be instructed to only answer from this document. This happens in the background where the user can’t see it, but we can simulate it in ChatGPT to show how it works.

To evaluate it, again just take the question, answer, and extra information, add an evaluation prompt, and ask the LLM.

Note here that the evaluation is low. Why? Because it is evaluating whether the response is correct generally rather than whether the directions were followed. A quick tweak to the evaluation instructions should handle it.

Note that RAG AIs generally send embeddings of documents and not the documents themselves to the LLM. I use the text here for illustrative purposes.

Iterate

Setting up the automation of the evaluation process is iterative, wherein a human must first ensure that the evaluation is accurate. Once the evaluation prompt is finely tuned, you can automate it by adding another call to the LLM API in your code and returning the evaluation.

One trick is to put through any conversation that gets negative feedback from users into the evaluator. Then you can review the result to see if it was truly disappointing. Thus you use your user feedback mechanism to also refine your evaluations.

As with any feedback process, someone must periodically review lower evaluations to determine why they are lower. Take a look at my Rules for Prompting when putting together your evaluation prompt. I’ve seen a trend toward oversized prompts. My advice would be to avoid these, but the main thing is to find an evaluation prompt that works for you.

Conclusion

Automating your evaluations is simple, helps you improve your product, and allows you to test new LLMs for your products. It’s really just as simple as putting the question, answer, and any additional information (if using a RAG) back into an LLM and asking it to evaluate. While there are far more extensive and complex evaluations, most implementations only need a simple evaluation. Even if you ultimately plan to use a complex evaluation, starting with a simple evaluation is a good place to begin.

One response to “Automating Your AI Evaluation: A How-To Guide”

Testing DeepSeek V3

January 27, 2025

[…] is crucial to one of my company’s products, I wanted to see how it stacks up. I used my automated AI analyzer —which compares various models—to put DeepSeek V3 through its […]

Loading…

Lowry On Leadership