We need a Science of Evals

Current evals cannot stand scrutiny in scenarios where responsible scaling policies stop deployment and companies sue
- Simple format changes affect eval results A LOT
  - changing between () and [], between A) and 1), etc. To do with the learned features and distributions of each model
- Results change significantly on the same questions depending on prompting

Questions that a robust eval system should be able to answer (directly from article)

What precisely does the evaluation measure, e.g. do we have strong agreement that this is a good way to measure any particular concept?
How large is the “coverage” of the evaluation, i.e. how large is the potential input space of the system that was covered? For example, an evaluation that would use only one prompt without any paraphrases or semantically similar prompts has minimal coverage.
How robust and general are the results of and evals, i.e. how does the system perform under different conditions or in varied contexts (e.g. out of distribution)?
How can we increase the replicability and reliability of evals, e.g. consistently get the same results for reruns?
Does the evaluation make some form of statistical guarantee, e.g. can we make a statement like “the bad outcome is less likely than 1 in 10000 under normal use conditions?
How accurate are the predictions about future systems when using the results of evaluations on current systems?