The difference is the impact of contaminated datasets. Exam boards tend to reuse questions, either verbatim or slightly modified. This is not such a problem for assessing humans, because it is easier for a human to learn the material than to learn 25 years of prior exams. Clearly that is not the case for current LLMs.