I'm new/uninformed in this world, but I have an idea for an eval that I think has not been tried yet.
Can anyone direct me towards how to ... make one? At the most fundamental level, is it about having test questions with known, golden (verified, valid) answers, and asking different LLM models to find the answer, and comparing scores (how many were found to be correct)?
What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm? What are non-obvious gotchas?
Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?
Can anyone direct me towards how to ... make one? At the most fundamental level, is it about having test questions with known, golden (verified, valid) answers, and asking different LLM models to find the answer, and comparing scores (how many were found to be correct)?
What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm? What are non-obvious gotchas?
Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?
Thanks!