Implementing Caching in Your Chatbot Will Save You 80%

80% of the questions to your chatbot will be repeat questions. Some simple caching will slash your costs.

One reason to build a chatbot is to allow people to ask any question they want, as opposed to a typical FAQ where standard frequently asked questions are asked. However, I find that over 60% of questions asked are exact duplicates of questions asked within the previous 24 hours. If I instead use semantic caching, in which questions asked differently but that mean the same thing, such as “What time is it?” and “What’s the time?”, then that number rises to over 80%.

Benefits of Caching

Implementing even a basic caching mechanism has some strong benefits:

Cost: Calls to Large Language Models (LLMs) are expensive (although decreasing rapidly).
Speed: LLM calls tend to be slow. Returning cached results is quick. You will improve the customer experience mightily by having 4 out of 5 queries return instantly.
Scale: Because serving a cached response is far faster and less resource-intensive than getting an answer from the LLM, caching allows you to process many more transactions for the same cost.
Consistency: LLM responses, especially when tied to RAG AI, have some level of inherent inconsistency. This is true even if the temperature, or level of randomness, for the LLM is set to 0. Cached responses, however, are guaranteed to always be exactly the same.

Cautions in Caching

Caching is not all upside. There are some things to watch out for:

Outdated Responses: If you are using a RAG AI or another dynamic way to generate your chatbot responses, then responses will not include data newer than the time the cached response was generated. Therefore, it is important to renew the cache periodically depending on the data and retrieval volume. If the data changes frequently, a cache that renews every 15 minutes, or even every 5 minutes, may be needed.
Caching Complicates Systems: The logic needed to create and maintain caching adds complexity to the system. It’s another thing that might break and that must be maintained. The effects of complexity in systems are often greatly underestimated. As a rule of thumb, if I can’t cache 60% of calls, I usually don’t include caching.
Guardrails: While guardrails, ensuring inputs and outputs are appropriate and not harmful, are important for any chatbot implementation, they are even more important with caching. Make certain you are not storing any confidential or private information in either the query or the response.
Context Dependency: If there are different possible contexts, make sure to cache within that context. For instance, I have many RAG AIs that go through the same database and cover topics from makeup to fishing to baking. For instance, “What kind of preparation do I need for Salmon?” has a completely different meaning for fishing than for baking. Therefore, I cache my fishing and baking questions separately.

Conclusion

When I first built my chatbots, I neglected caching, a lesson that cost me in real dollars. I figured, erroneously, that if people could ask freeform questions, there would be a wide variety of questions. While there were many questions, a large number of them covered topics previous people had already asked about. Caching allows quick, inexpensive, and consistent responses. There is a price to be paid in system complexity, and it’s important to have guardrails implemented around privacy and security, but these costs are usually far outweighed by the benefits.

Lowry On Leadership