https://arxiv.org/abs/2312.01276
Some interesting takeaways:
- Consider shortcuts models might use to ‘cheat’ the correct answer: models may perform what is parallel to heuristics based on statistical co-occurrence in the training data, be ware of potential shortcomings in benchmarks that allow ‘cheating’ and identify where this occurs
- I mean of course make sure the eval is not in the training data and beware of alternative mechanisms to solve the problem (like for multi-choice, will the model answer one letter/numbered option just because of training data distributions?)
- The need to compare model vs human performance is often overlooked due to assumptions about human performance, and be fair about the tests