It would be interesting - but a lot of work - to add "time to modify" and "time to debug" metrics.
Each task would come with a modification, and an error to make in implementing the original task. You'd recruit undergraduates, who hadn't used the language before. Some would be given the original program to modify, others the broken version to fix. You'd measure how long they took to do it.
This way, language communities that gamed the machine benchmarks would pay a price on the human ones.
Each task would come with a modification, and an error to make in implementing the original task. You'd recruit undergraduates, who hadn't used the language before. Some would be given the original program to modify, others the broken version to fix. You'd measure how long they took to do it.
This way, language communities that gamed the machine benchmarks would pay a price on the human ones.