More

anentropic · 2026-03-09T16:17:24 1773073044

Yeah...

https://github.com/opengraviton/graviton?tab=readme-ov-file#...

the benchmarks don't show any results for using these larger-than-memory models, only the size difference

it all smells quite sloppy

anentropic · 2026-03-09T10:55:46 1773053746

This vs JTD?

anentropic · 2026-03-06T11:27:41 1772796461

Arguably the original name was the newspeak and the new name is more honest

anentropic · 2026-03-04T23:23:12 1772666592

Do you have a nice way to let it 'use the app' or receive visual feedback?

I imagine that would help the process a lot

anentropic · 2026-03-04T16:31:55 1772641915

8GB RAM is fine for these

topping out at 512GB storage is lame though

zamadatix · 2026-03-04T18:59:15 1772650755

8 GB is plenty for the base of this use case but I'd still wish for a 16 GB upgrade option before I'd wish for a 1 TB upgrade option.

anentropic · 2026-03-04T11:46:18 1772624778

Ah, the Lobster guy!

I have been running across that repo for years and wondered if anything was happening with it - great to see an impressive game project built on it now.

anentropic · 2026-03-04T11:38:40 1772624320

Fantastic stuff!

FYI some code snippets are unreadable in 'light mode' ("what substrings does the regex (a|ab)+ match in the following input?")

ieviev · 2026-03-04T12:14:02 1772626442

ah thank you for letting me know, fixed it now!

anentropic · 2026-03-03T15:42:06 1772552526

Do any frameworks manage to use the neural engine cores for that?

Most stuff ends up running Metal -> GPU I thought

abhikul0 · 2026-03-03T16:44:44 1772556284

It's referring to the neural cores(for matrix mul) in the GPU itself, not the NPU.

https://creativestrategies.com/research/m5-apple-silicon-its...

sumek83 · 2026-03-03T16:51:12 1772556672

https://github.com/maderix/ANE

anentropic · 2026-03-02T17:01:14 1772470874

and Claude is remote inference anyway, just an http api

anentropic · 2026-02-19T10:11:47 1771495907

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

networked · 2026-02-19T12:38:10 1771504690

51% doesn't tell you much by itself. Benchmarks like this are usually not graded on a curve and aren't calibrated so that 100% is the performance level of a qualified human. You could design a superhuman benchmark where 10% was the human level of performance.

Looking at https://www.tbench.ai/leaderboard/terminal-bench/2.0, I see that the current best score is 75%, meaning 51% is ⅔ SOTA.

andai · 2026-02-19T14:07:00 1771510020

This is interesting, TFA lists Opus at 59. Which is the same as Claude Code with opus on the page you linked here. But it has Droid agent with Opus scoring 69. Which means the CC harness harness loses Opus 10 points on this benchmark.

I'm reminded of https://swe-rebench.com/ where Opus actually does better without CC. (Roughly same score but half the cost!)

pitched · 2026-02-19T11:59:28 1771502368

That score is on par with Gemini 3 Flash but these scores look much more affected by the agent used than the model, from scrolling through the results.

varispeed · 2026-02-19T12:24:39 1771503879

Gemini 3 Flash is pure rubbish. It can easily get into loop mode and spout information no different than Markov chain and repeat it over and over.

YetAnotherNick · 2026-02-19T12:51:46 1771505506

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

esafak · 2026-02-19T14:58:21 1771513101

What do you mean? It tests whether the model knows the tools and uses them.

YetAnotherNick · 2026-02-19T15:47:08 1771516028

Yeah it's a knowledge benchmark not agentic benchmark.

esafak · 2026-02-19T15:54:03 1771516443

That's like saying coding benchmarks are about memorizing the language syntax. You have to know what to call when and how. If you get the job done you win.

YetAnotherNick · 2026-02-19T16:07:50 1771517270

I am saying the opposite. If a coding benchmark just tests the syntax of a esoteric language, it shouldn't be called coding benchmark.

For a benchmark named terminal bench, I would assume it would require some terminal "interaction", not giving the code and command.