I didn't mention it in the blog, but Paolo Bonzini was helping me and suggested I run the bootbootboot test for 24 hours, to make sure the bug wasn't latent in the older kernel. I got bored after 21 hours, which happened to be 292,612 boots.
Maybe it would have failed on the 292,613rd boot ...
Thanks for mentioning me, but you really did the work!
But in order to contribute something useful, as a rule of thumb you want to have 10 times as many passes than failures in order to reject a commit. If a bug has taken up to 2500 runs to reproduce, don't consider it a pass until 30000 runs have succeeded.
It's something to do with Poisson distributions. If you have 𝑛 runs before a failed run on average, and
you want to be 𝑃 % certain that a fix (including a revert or moving beyond the bug in a bisect) reduced the failure
rate, you can use the formula −
𝑛 ln (1 − 𝑃
/100) for how long to run, and the factor for 𝑃=99.99 is about 10.
In fact that means that once you had landed on a merge commit it was probably much better to switch to a linear backwards search because it might have fewer passing runs and passing runs are 10-15 times more expensive as failures. Is that what you did?
I've been on a similar quest for hard to reproduce, timing/hardware/... bugs, and if you're facing any kind of skepticism (your own or otherwise) it can be very comforting to have a 10x or even 100x no failure occurred confidence.
It's particularly comforting when the reason for the failure/fix/change in behavior isn't completely understood.
If the bug occurs reasonably often, say usually once every 10 minutes, you can model an exponential distribution of the intervals between the bug triggering and then use the distribution to "prove" the bug is fixed in cases where the root cause isn't clear: https://frdmtoplay.com/statistically-squashing-bugs/
Maybe it would have failed on the 292,613rd boot ...