Test-case Reducers Are Underappreciated Debugging Tools
87 points by ltratt
87 points by ltratt
There's some skepticism here, but yes, from the interactions I had when reporting security bugs, I think that there are diagnostic tools like that many developers and security people (!) are blissfully unaware of. I think the main problem is that these features are just not "cool", so we don't talk about this stuff enough. Test case minimizers, in particular, are "write-only": they get reinvented quite often and almost never gain much traction.
When I shipped afl-fuzz, I included a couple of features along these lines. For example, the package contained a completely standalone, fast, execution-path-aware, zero-configuration test case minimizer (afl-tmin) that worked well without needing to be "taught" the syntax of your input data... but I'm pretty sure has gotten very little use - again, because even if it worked for you, it's not something you're gonna blog about?
In the same vein, I shipped a "crash exploration mode" designed for security folks: you give the tool a crashing test case and it catalogs all the execution paths it can get to while still keeping the program in a crashing state, which answers a key question in vuln research and exploit development. In my mind, this was at least as cool as the fuzzer itself, but I think maybe 10 people even knew about it.
Another hugely time-saving feature that took a loooong time to catch on and still isn't used enough: ASAN / MSAN and all the related sanitizers.
I'm legitimately curious - does anyone not appreciate the value of automatic test case reduction? "under appreciated" seems to imply there are people who don't always want a test case reduction? even if you can immediately identify the bug, you need a reduction for your regression tests?
The first line of the post is "Test-case reducers are less well known than they should be, [...]". That's certainly been my experience too, speaking as somebody who has been encouraging people to use fuzzing and property-based testing for years. Both of those often include some form of failure reduction / "shrinking", and it obviously makes them more practical.
I think "less well known" is a much more accurate one - I think there's an element of "I can see the fix locally" that often leads to "reducing this is more effort than I think is necessary", but I feel that that is due to people thinking in the "right now" rather than the long term. Anything the reduces the immediate effort in getting down to a usable saved regression test is a win.
I am vaguely familiar with it in the case of fuzzers: a fuzzer will find a failing case, then automatically reduce it. That part works great. My experience with fuzzing broadly though (mostly AmericanFuzzyLop and AFL++) is that it's an extreme pain to set up, so I mostly stay away.
Most bugs I encounter are also not really "it does the wrong thing given this input file", but more like "it does the wrong thing for some user somewhere". Sometimes I'm able to reduce that to "it does the wrong thing in certain conditions after doing a sequence of steps". 1) not quite sure how to use an automatic test case reducer on "the user doing a sequence of things", and 2) once I have some way to reproduce it locally, 99% of the debugging job is done.
I would guess that the author considers my lack of appreciation "under-appreciation".
it's an extreme pain to set up
If I had a hat, I'd use it, but: fuzzer developer here. What do you find difficult? Harnessing, build configuration, something else?
It's been too long for me to give a good answer, but here's an attempt:
Those are the AFL-specific issues. The more general issue is:
Both fuzzers and test case reducers necessarily assume that input is a file and that it's easy to detect pass/fail (fuzzers specifically assume that a hang, crash or memory corruption is a fail, everything else is a pass). This is not how real world bugs work.
The input is the state of the world (a database, networking on the host machine, GPU drivers, some number of back-end servers, ...) and a sequence of events (computer disconnects and reconnects, back-end sends WebRTC signalling messages, an ICE connection restart happens, it's time to render a new video frame, a hardware video encoder crashes, ...).
The output is a sequence of changes to the state of the world (writing to database, packets sent over the Internet, changes to internal in-memory state which will be reflected in the next screen refresh, changes to routing tables, ...).
Bugs look like, "the other side of this video call is unable to decode the H.264 stream properly and shows a dark green screen". Or "the GPU driver ignored our request to set the stride to be a multiple of 1 pixel and so when the other end sends video frames whose width aren't multiples of 4 pixels, we show a broken diagonal picture". Or "the database somehow entered a weird state on this one machine so that a migration failed". Or "we misinterpreted the response of the DHCP exchange and so we now have the wrong netmask and DNS isn't working for users who have a DNS resolver on their LAN".
I'm not sure how to fuzz this or run a test case reducer.
I often reduce on things that aren't segfaults or even crashes. For example, I might turn logging on, forcibly timeout the program after 2s, and then grep the log for a string /s sequence that I do/don't expect, and use that as the basis for a test. There are clearly limitations to this technique, but it often works well enough to be useful. Occasionally, I admit, it takes me a day or two to work out how I'm going to do this for a context I haven't seen before.
A few years ago we had a project doing differential testing of video accelerators. We did delta debugging by checking the set of hardware devices that exhibited differences. You can use a lot of really interesting levers as a proxy for fault localisation when doing test minimisation, always feels like a fun puzzle :)
Thanks, it's a good feedback signal from a user case I don't often hear from :) Most of the folks I talk to are deeply, deeply embedded in the space, so sometimes even basic usability issues get overlooked...
You're welcome! I don't want to come across as if I'm anti-fuzzing here, I have a lot of respect for it and have used it with great results in some projects. I think it excels at parsers (and a lot of software is either straightforwardly a parser or a parser in disguise).
I was surprised, looking back at it now, how much of the friction is literally just "it's not packaged by distros".
In theory, a more holistic approach is something like what Antithesis can provide (and a deterministic way to deal with the complexity that), but IIRC it's still hard to get your hands on, and for the case of client software, still probably unrealistic.
not quite sure how to use an automatic test case reducer on "the user doing a sequence of things"
You might be interested in searching for "stateful property based testing" or "state-machine property based testing", a la https://docs.rs/proptest-state-machine/latest/proptest_state_machine/
The idea is that you model your system as a set of transitions in a state machine, where each transition can influence the next transition, etc. You can check that certain invariants hold on each transition, or check for specific outcomes at the end of a trace.
Upon failure, the system shrinks the history of actions and gives you a trace of actions that lead to a bug/invariant violation.
Edit: changed link to the right library.
Not everything is suitable for test-case reduction, at least not in the specific context I laid out in my post (there are other contexts, but ... it was long enough already). One thing that is worth many of us thinking about is "what is our input?" I agree that it's not always "newline-separated files". Sometimes one can recast into that format, but but one thing that's surprising about Shrink Ray is that it works well enough to be useful on some random binary file formats! However, I agree with your deeper point: not all inputs can be practically/meaningfully cast into something that our current test-case reducers can really deal with.
Speaking personally I have never until now heard about automatic test case reduction.
So very grateful for articles like this that expose me to new things to think about!
This article and its examples are very much from the perspective of a compiler writer, despite saying that reducers should be more popular in non-compiler situations :-) As ~silentbicycle wrote, most test case reduction happens in the context of fuzzers or property-based testing, where reduction is built-in to a broader framework. Compilers are one of those oddities where it’s useful to have a stand-alone test case reducer. Dunno if there are other cases where stand-alone reducers are helpful?
The section on determinism is curious. It starts off with an example where the determinism is caused by the input file (a script) that provokes the bug, rather than being a feature of the buggy program (an interpreter). It isn’t clear to me if the article is saying that the interestingness technique also works in non-compiler situations when the buggy program itself is nondeterministic.
My advice for how to turn a testing problem into something that’s amenable to fuzzing and therefore to test case reduction is to create a collection of numbered imperative commands. Each command includes some lightweight consistency checks that can detect a test failure (if it doesn’t crash outright). It’s best to put heavyweight consistency checks in their own commands so they don’t slow down the testing too much. For simple randomized testing, the test harness chooses commands at random until something breaks. Then you can turn it into a fuzzer harness by using the fuzzy input byte stream to choose commands. And then you automatically get lots of good things like deterministic regression testing as well as test case reduction.
I haven’t had much success asking libfuzzer explicitly to reduce a test case, I guess because the input was reduced while libfuzzer was generating it. So I haven’t been encouraged to try out interestingness checks that help general-purpose fuzzers to reduce test cases. I’m curious if anyone else has had any success…
My advice for how to turn a testing problem into something that’s amenable to fuzzing and therefore to test case reduction is to create a collection of numbered imperative commands. Each command includes some lightweight consistency checks that can detect a test failure (if it doesn’t crash outright).
1000x this. I do this a lot. Make a symbolic representation of a stateful interface, make a simplistic (typically switch-case/match based) interpreter for that, randomly generate lists of operations to exercise it, and then reduce inputs that trigger asserts for violated pre/post-conditions or other internal corruption. It could be viewed as property-based testing, fuzzing, or lightweight model checking, but however you frame it it can be surprisingly effective at finding deep bugs. I've seen many stateful interfaces with operations that are individually correct yet have slightly misaligned assumptions, and when combined with each other in unexpected ways it compounds into internal corruption.
It's also handy to run the operation list side by side with a simplistic implementation (based on in-memory hash tables or lists) and check that results agree. Any divergence there often points to a bug or an edge case that should be better documented.
"Compilers are one of those oddities where it’s useful to have a stand-alone test case reducer. Dunno if there are other cases where stand-alone reducers are helpful?" - I first came across test case reducers working on some firefox bugs; they had https://github.com/MozillaSecurity/lithium/ as a reducer for crashes and assertions (from fuzzing but also from real bug reports), so browsers are another? But I'll admit, I've only reached for the tool a couple of times in the intervening 16 years or so, and it was for parsing problems.
I periodically work on transportation optimization, and often get quite complex scenarios where an invariant ends up be violated. I would kill for a test-case reducer, but settled for doing it manually in the past.[0]
Sadly, I think the data files are too complex for shrinkray to handle--we're ingesting multi-different "files" of tabular data with long-range dependencies, I'd have to encode my own domain knowledge of to do shrinking.
With the speed of AI development, I'll write a custom reducer the next time that I have a scenario like this.
[0] Dubious ontologies time: optimization problems are search problems that minimize cost, which is really just the same thing as a compiler, so this isn't a perfect example.
Related to this topic, property based testing uses a somewhat similar approach to "shrink" the state space of generated input that it uses to build a counter-example to your test.
The advantage of property based testing is that the search space can be guided and structured, such that your input becomes a set of transitions driving a state machine that models your program.
I'm always surprised to see the technique so underutilized, even in domains where there is an obvious fit: databases and distributed systems. Just the other week I spent less than a couple of hours building one such test at $WORK and quickly found out that our system was not convergent. The test spit out a nice trace I could show coworkers in a way that made immediate sense.
IMO it's the testing technique that offers the best ROI at debugging complex systems.
I read through this three times, trying to figure out how to apply it to tests written for pytest. I'll take a fourth pass at it when I'm not at work, because I would love to reduce the complexity of my test suites.
If you’re using Python then the first step is to adopt Hypothesis, which has test case reduction built in.
This technique won't really reduce the size/complexity of your test suite, but e.g. if you have a large user-provided program that crashes your library, ShrinkRay et al. will (try to) mercilessly slice and dice the user-provided program into a succinct regression test.
Incidentally, I was just reading https://www.st.cs.uni-saarland.de/papers/tse2002/tse2002.pdf which may interest commenters here
Last year when looking at some CI test ordering issue I wrote a tool to help me shrink the list of tests down, basically trying to slice lines in half and the run things.
The script itself is quite buggy but it was cool going from a list of 5000 tests go down to a list of like 4 tests that was causing my concurrency bug.
Very curious if Shrink Ray would have "just" worked for my case. I legit think that "shrink a set of lines down based on a test" is something that should be part of the set of "standard" command line tooling