Stop overbuilding evals

4 points by softwaredoug

tomhukins

Get people to tell you which search result listing they prefer. Do this blindly - don’t tell anytime which side is the new change or not.

Presenting information blindly matters a lot. Around twenty years ago, the Yahoo search developers tested the results from their own search engine against results from Google. They discovered that users routinely preferred results presented with a Google logo rather than a Yahoo logo, regardless of which engine provided the results.

twotwotwo

Evals don’t only apply to AI, but to the extent you’re thinking about those, a key thing about them right now is all this stuff is super unreliable.

There comes a point where a search (or whatever) system is well-tuned and you have to be careful lest a change to improve results in use case A accidentally degrade the existing behavior in use case B and leave you subtly worse off on net in a way that only a careful evaluation with a lot of statistical power is going to catch.

But while use cases A, C, F-J, and V-Z are all broken, just grinding and trying just to fix a use case at a time without obviously regressing the rest is often progress. And as the post points out, real use might uncover issues/dimensions/use cases you didn’t think to check.

Evaluation can stull still useful to decide e.g. whether something meets a baseline level of confidence to deploy when stakes are high, or so you can do science to learn big-picture things you’d have trouble noticing looking response-by-response.

But: with this stuff it’s easy for people assume that they’re already at the stage where it’s time to polish and optimize while they’re in fact still at the stage of getting something kind-of working.