How should we peer review software?
38 points by MiraWelner
38 points by MiraWelner
For software engineers who don't usually deal with the academic world and think scientific peer reviews are high quality, you couldn't be more wrong. Or rather, I suppose that depends on the field. But to take an example, currently in Computer Vision / Deep Learning if you take three random peer-reviewed papers published in well-known papers or conferences and actually do the work of checking everything and reproducing the results it is very likely that you will find something terribly wrong with one of them.
And indeed the code published with the paper is even worse. Either it does something different from the paper, or something very important to the method is only in the software and not even discussed in the paper. Of course the outputs of the software are used to draw the conclusions in the paper which end up being wrong...
More generally some (many) researchers fill papers with platitudes expressed as complex-looking mathematical equations to make them seen complex when the core idea is often very simple, and that shows when you implement them in software properly. More often than not it ends up a diff of a handful of lines on a popular model in Diffusers.
This is not exceptional, this is the norm. Of course not all researchers or labs are like that. If you want examples of really insightful and well-written papers in that field look at Kaiming He's work. But unlike you know the team, you cannot trust a paper until you review it, implement it and do your own ablation study.
Completely agreed. A few years ago there was a paper that (if memory serves) won the best paper award at one of the top compiler conferences for some novel techniques for vectorisation. They had worked from a version of LLVM that was quite old by the time their paper was accepted. I queried this, because there had been a lot of work in the vectoriser since then and this would affect their baseline (I also queried the lack of any error bars in their graphs, in a field that is known to have 10-30% error margins in experiments). But the work looked promising, so I tried to persuade them to upstream it. This required a chunk of work to update it and then it turned out that their work did not deliver a measurable speedup over what upstream had done by the time their paper was published.
And that's a pretty good paper. There was a panel at ISCA in 2014 when we presented the first CHERI paper on the role of simulation in computer architecture research. We went there expecting (at least one of the) people on the panel to advocate for actually building things, rather than relying on simulations. They did not. The debate was all about whether you needed to go as far as a detailed simulation or whether it was okay to just do some rough modelling. Someone from NVIDIA said something like 'we don't trust any evaluation from computer architecture papers. We don't even read it. We just look to see if the idea looks interesting and, if so, implement it ourselves'. Other folks from industry nodded along. Somehow an entire room full of people who consider themselves to be scientists did not take that as a damning indictment of their entire field.
A lot of systems papers are much worse. They start with a problem that isn't even really a problem, it's an artefact of how things happen to be implemented and is trivial to work around if it actually matters. And then they show that they can achieve some local optimum with a 20% speedup by making some change to Linux, ignoring the fact that there's an order of magnitude improvement that people are already making in production for the workloads that care about this optimisation axis.
I would really love to see papers become just a thing you do to communicate science, not a thing that you are evaluated on.
I would really love to see papers become just a thing you do to communicate science, not a thing that you are evaluated on.
This. If you read papers from 100+ years ago they're much more like someone writing something down for their buddies 'cause they want to share something cool. That's literally the origin of scientific papers and journals, (rich) 16th-17th century european nerds getting together to publish a newsletter/mailing list with their nerdy stuff so their nerdy friends could read it. To get comparable "actually cool research information presented in a useful way" these days often involves reading... the researcher's blog or other informal writing. The stuff they write down for their buddies because they want to share something cool.
But in the late 19th and 20th centuries science became Something That Built Weapons, so governments started funding it, and they needed a way to measure what they were funding. And so it turned into the multi-billion-dollar example of Goodhart's Law we have today.
I’m sure there’s a term for this that I’m forgetting. Let’s call it the rose tinted rear view mirror.
There were absolutely atrocious works published for ever. They got debunked as time marched forward and people died.
So we are now left with the few works that weren’t completely fabricated and or super sloppy.
There absolutely was crap. There were a lot of 'gentlemen scientists' who would pay to have their (absolutely terrible) research published.
But, for serious researchers, there wasn't a massive push to publish. If people did publish, it was because they wanted other people to know something, not because they wanted a promotion.
Part of the push to use publications for promotions was an effort to improve fairness. Promotions needed to be more objectively motivated. You can measure the number of papers someone has written, so that was a good first cut. But then it turned out that it was really easy to publish a load of crap. So then the metric became papers at top-tier venues. And that was a bit harder to game because you had to pass peer review, but it led to a few things:
There have always been poorer papers, yes, but this isn't just a rosy view of the past.
The modern mandate to regularly publish new papers (or lose access to funding) means there's an absolute flood of drivel.
When I started at SAP research I at some point expressed frustration that I had a hard time reading many papers and draw conclusions from them. My boss looked at me and said: "have you considered that most papers are just not that good?" and handed me a stack of papers he found are good material to compare. And suddenly, I was able to draw conclusions fom papers ;). It sounds so simple, but it was a pretty central moment for me.
It's true elsewhere, too. I specialized in antibiotics and mycobacteria in grad school, which is a long way from computer science. I got to the point where my algorithm for working through a new paper in my field was:
At this point I might refill my tea and sit down to actually read it through.
complex-looking mathematical equations to make them seen complex when the core idea is often very simple
This also has the effect of making the paper impenetrable to those without a mathematical background. They would be so much easier to understand if they just walked the reader through the algorithm instead. Things only seem to become accessible when someone comes along and writes blog posts about them.
One of my favourite papers is about spinlocks and cache contention. It starts with the classic CAS-loop spin lock. Then it incrementally improves it to get to their final design.
The thing I really love about this paper is that they explain the problem so well that their solution is obvious. I had designed the same thing as their final solution by about half way through. I would never have got to their solution without reading the first few pages of their paper.
A paper like that would really struggle to be published today. I've been a reviewer on papers like that and been the only 'accept' vote. All of the other reviewers' comments are about how the idea is probably not sufficiently novel.
Just requiring instructions that work to run the software on input data provided would be a huge first step.
See for instance the ACM’s artifact review and badging policy https://www.acm.org/publications/policies/artifact-review-and-badging-current
I have been involved in several artifact evaluation committees, both as a reviewer and as a chair. Artifact evaluation consists in running some software in some scenarios, to get experimental results that should be close to the ones presented in the accompanying research paper. We usually do not go deeper to check the actual behavior/correctness of the software.
.. and even if you do find "sloppy work" (or worse) as artifact reviewer, you can't actually do anything except for not awarding the badge, because the paper is already accepted at that point.
Well, there are conferences (like ECOOP) who tried to do paper and artefact review at the same time. One year, some reviewed artifacts and their corresponding paper. But it creates around 3x more work as it requires the review of all submitted papers.
Yes, and you would have to include some sort of docker image so you can’t have an “it works on my machine” situation.
Of course knowing how academia works, most people would send a conda environment.
I talked with someone the other day who mentioned that even getting people to generate a list of the exact versions of packages they used was difficult.
I'm a really big UV fan because it makes this easy - also why i hate Conda, because this is so much harder than necessary in Conda