I don't think these models are GPT-4 level. Yes they seem to be on benchmarks, but it has been known that models increasingly use A/B testing in dataset curation and synthesis(using GPT 4 level models) to optimize not just the benchmarks but things which could be benchmarked like academics.
Also "GPT-4 level" is a bit loaded. One way to think about it that I found helpful is to split how good a model is into "capability" and "knowledge/hallucination".
Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).
That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.