I'm new/uninformed in this world, but I have an idea for an eval that I think ha...

koakuma-chan · 2025-11-18T18:30:53 1763490653

> Can anyone direct me towards how to ... make one?

> What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm?

LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data.

> Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?

ncgl · 2025-11-18T22:14:52 1763504092

"LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data."

Is this true? I remember there being a randomization factor in weighing tokens to make the output more something, dont recall what

Obviously I'm not an Ai dev

koakuma-chan · 2025-11-18T23:06:48 1763507208

In my experience, the response may not be exactly the same, but the difference is negligible.

gregsadetsky · 2025-11-18T18:52:49 1763491969

I'm very grateful! Thanks a lot

moltar · 2025-11-18T22:03:37 1763503417

Take a look at promptfoo