Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I don't think these models are GPT-4 level. Yes they seem to be on benchmarks, but it has been known that models increasingly use A/B testing in dataset curation and synthesis(using GPT 4 level models) to optimize not just the benchmarks but things which could be benchmarked like academics.


I'm not talking about GPT-4o here - every benchmark I've seen has had the new models from the past ~12 months out-perform the March 2023 GPT-4 model.

To pick just the most popular one, https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd now.


How have you been able to tie benchmark results to better results?


Vibes and intuition. Not much more than that.


Don't you think that presenting this as learning or knowledge is unethical?


Also "GPT-4 level" is a bit loaded. One way to think about it that I found helpful is to split how good a model is into "capability" and "knowledge/hallucination".

Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).

That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: