Running cognitive evaluations on LLMs: the Do’s and Dont’s

Some interesting takeaways:

Consider shortcuts models might use to ‘cheat’ the correct answer: models may perform what is parallel to heuristics based on statistical co-occurrence in the training data, be ware of potential shortcomings in benchmarks that allow ‘cheating’ and identify where this occurs
I mean of course make sure the eval is not in the training data and beware of alternative mechanisms to solve the problem (like for multi-choice, will the model answer one letter/numbered option just because of training data distributions?)
The need to compare model vs human performance is often overlooked due to assumptions about human performance, and be fair about the tests