As of now, the ultimate model for most businesses

If NBA teams evaluated players the way AI journalists evaluate models, the late great Fred “Curly” Neal, the famed Harlem Globetrotters trick play specialist, would be deemed the ultimate player. “He reliably makes shots from half court! He dribbles circles around opponents!” All true, and all extremely impressive, but the skills that look so amazing in exhibition don’t always translate to the real game.

These journalists would never have chosen GPT-4.1-nano as the ultimate model, but for many real businesses, it is.

What most companies need from AI

Most companies need AI that’s quick, reliable, cheap and, above all, accurate. When a user is asking how to reset their internet modem, they won’t wait 20 minutes for a chatbot to reply. Or, worse, tolerate chatbots that consistently return wrong answers.

Read most AI journalists and you’ll see them extolling the newest AI model for passing the bar exam, or scoring a new high on some AI benchmark. What you won’t see is reporting on the workhorse models. These models are designed to be good, not great, at reasoning, but to be very fast and very cheap.

All of my products rely on these models and so I’ve developed tools to automatically evaluate new models in real-world situations. If I find a new model that produces as good results and is faster or cheaper, I switch to it.

OpenAI’s new models

OpenAI just released a slew of new models. I ran all of them through my evaluator and I share the results with you. There are main models, such as GPT-4o, GPT-4.1 and GPT-o1, as well as derivative models: mini, nano, and search. They also have GPT-o1 and GPT-o3 models, which are not reviewed here as initial test showed them orders of magnitude too slow for my needs.

Mini and nano models are scaled-down versions of large AI models that sacrifice some intelligence for dramatically improved speed and lower costs. Think of them as lighter-weight versions of the full models – a full GPT model might have a trillion parameters (the values that determine how a model processes information), while a mini might have 1/5th as many, and a nano even fewer. Mini models offer a balanced middle ground, while nano models are the smallest and fastest options, making them ideal for applications where rapid response times and cost efficiency matter more than maximum capability.

Search-enabled AI models combine traditional language model capabilities with the ability to access and retrieve up-to-date information from the web in real-time. While they excel at answering factual questions about current events and retrieving specific information, they may perform poorly on tasks that require focusing solely on provided context. For Retrieval-Augmented Generation (RAG) applications, where you want the model to work with specific documents you provide rather than searching the internet, these search models can actually perform worse since they’re programmed to prioritize external information over locally supplied documents.

Results:

Model Benchmarking Results

ModelAccuracySpeedCost per Million Tokens (Input/Output)Cost Relative to GPT-4o-mini
GPT-4.197.62%7.43s$2.00/$8.0014.47× more expensive
GPT-4.1-mini95.24%2.83s$0.40/$1.602.89× more expensive
GPT-4.5-preview92.86%18.48s$75.00/$150.00325.0× more expensive
GPT-4o88.10%4.73s$2.50/$10.0018.09× more expensive
GPT-4.1-nano83.33%1.34s$0.10/$0.400.72× (28% cheaper)
GPT-4o-mini61.91%6.36s$0.15/$0.601.0× (baseline)
GPT-4o-search-preview21.43%7.01s$2.50/$10.0018.09× more expensive
GPT-4o-mini-search-preview21.43%7.86s$0.15/$0.601.0× (same as baseline)

In short, for my RAG AI, ALL of the non-search models give more accurate results than my current model, GPT-4o-mini. GPT-4o-mini consistently gets high accuracy results in my product’s user testing, so while these more accurate results are great, it’s unlikely to be a game-changer for my app.

The hands-down winner, GPT-4.1-nano

GPT-4.1-nano gives more accurate answers than my current model, GPT-4o-mini, but does it almost 5 times faster and at only three fourths of the cost. Faster, cheaper, more accurate. It’s so clearly head and shoulders above the others that before I completed this blog post, I already moved most of my products over to this model.

Honorable mentions:

GPT-4.1-mini: It’s no surprise that the mini version of GPT-4.1 would perform more accurately, but more slowly and more expensively, than the nano version. After all, it’s a smaller, less capable, but better-performing version of the original GPT-4.1 model. Giving much better answers than my original GPT-4o-mini model and much quicker, it would also be a great choice, especially if better answers were something essential to my products. The cost is quite low compared to the big models, being 1/5th the cost of its parent GPT-4.1 while giving results that were almost as good. Unfortunately, while a very inexpensive model, it’s also 4 times the cost of GPT-4.1-nano.

GPT-4o-search-preview and GPT-4o-mini-search-preview: These performed extremely poorly in my tests. Essentially, these models are much too eager to search the web for answers to be able to work with RAG AI which, by definition, is looking at a smaller fixed set of data. Worse, they ignored my other prompt guidelines. For instance, any product recommendation should always be clear why this product was recommended over the nearest competitor, but these searches simply ignored that command. That said, I wouldn’t be surprised if the next iteration of search is good enough that I can dispense with a lot of my custom AI for makeup recommendations from influencers and instead use the search functionality for this.

Closing Thoughts

All of these models are impressive steps forward. GPT-4.1-nano is an amazing choice for me that will make my product more accurate, faster, and cheaper. That’s a combination that’s hard to beat.

But the model that’s best for your needs may be a different one. If I needed more accuracy, GPT-4.1-mini would likely be where I’d go. The searches seem like they have huge potential, although they fail for my purposes.

My advice remains the same: test with your own use case and choose the cheapest one that meets your accuracy and speed requirements. Don’t listen to the AI journalists, most teams want Steph Curry, not Curly Neal. The latest model that breaks some benchmark is probably a great model but is not necessarily the model you need to win at your business.


Discover more from Lowry On Leadership

Subscribe to get the latest posts sent to your email.

Leave a Reply

Quote of the week

“AI will probably most likely lead to the end of the world, but in the meantime, there’ll be great companies.”

~ Sam Altman (apocryphal)

Designed with WordPress

Discover more from Lowry On Leadership

Subscribe now to keep reading and get access to the full archive.

Continue reading