I don't think these models are GPT-4 level. Yes they seem to be on benchmarks, b...

simonw · on March 24, 2025

I'm not talking about GPT-4o here - every benchmark I've seen has had the new models from the past ~12 months out-perform the March 2023 GPT-4 model.

To pick just the most popular one, https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd now.

th0ma5 · on March 24, 2025

How have you been able to tie benchmark results to better results?

simonw · on March 24, 2025

Vibes and intuition. Not much more than that.

th0ma5 · on March 25, 2025

Don't you think that presenting this as learning or knowledge is unethical?

tosh · on March 25, 2025

Also "GPT-4 level" is a bit loaded. One way to think about it that I found helpful is to split how good a model is into "capability" and "knowledge/hallucination".

Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).

That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.