Even the benchmarks for maths only checked numerical answers for ground truth, w...

charlieyu1 10 months ago | parent | context | favorite | on: Benchmarking GPT-5 on 400 real-world code reviews

Even the benchmarks for maths only checked numerical answers for ground truth, which means the LLM can output a lot of nonsense and guess the correct answer to pass it