I didn't mention it in the blog, but Paolo Bonzini was helping me and suggested ...

bonzini · on June 15, 2023

Thanks for mentioning me, but you really did the work!

But in order to contribute something useful, as a rule of thumb you want to have 10 times as many passes than failures in order to reject a commit. If a bug has taken up to 2500 runs to reproduce, don't consider it a pass until 30000 runs have succeeded.

It's something to do with Poisson distributions. If you have 𝑛 runs before a failed run on average, and you want to be 𝑃 % certain that a fix (including a revert or moving beyond the bug in a bisect) reduced the failure rate, you can use the formula − 𝑛 ln (1 − 𝑃 /100) for how long to run, and the factor for 𝑃=99.99 is about 10.

In fact that means that once you had landed on a merge commit it was probably much better to switch to a linear backwards search because it might have fewer passing runs and passing runs are 10-15 times more expensive as failures. Is that what you did?

rwmj · on June 15, 2023

> it was probably much better to switch to a linear backwards search

Ha ha, nope! I tested each commit starting at the earliest, and it was the last one in the merge :-(

opello · on June 14, 2023

I've been on a similar quest for hard to reproduce, timing/hardware/... bugs, and if you're facing any kind of skepticism (your own or otherwise) it can be very comforting to have a 10x or even 100x no failure occurred confidence.

It's particularly comforting when the reason for the failure/fix/change in behavior isn't completely understood.

bsilvereagle · on June 14, 2023

If the bug occurs reasonably often, say usually once every 10 minutes, you can model an exponential distribution of the intervals between the bug triggering and then use the distribution to "prove" the bug is fixed in cases where the root cause isn't clear: https://frdmtoplay.com/statistically-squashing-bugs/

quickthrower2 · on June 14, 2023

I think your p value is pretty good here

bonzini · on June 15, 2023

With about 1000 runs to reach a failure I think he has p=0.000001 or something like that.

x86x87 · on June 15, 2023

this is unacceptable :):):) only 21 hours!