Edited that claim, and made several clarifications elsewhere. The whole point of this analysis is that outrage is unjustified on the basis of two totally statistically unremarkable releases that no one would have remarked on pre-AI (my further proof of this is that there was a pre-AI remarkably broken release, and no one did comment!) and zero positive evidence outside cherry-picked anecdotes for any negative impact. We should wait for outrage and version pinning and cancelation until there is evidence, no? I'm just trying to say that these specific releases are unremarkable, and there's no evidence at all of harm currently; I'm not trying to build any kind of predictive model for future Claude releases to say anything grander than "these specific releases are fine, what are we freaking out about?", not some claim about what Claude-exposed releases will look like or trend like in the future or in general.
There is a lot more context to the outrage which is missing from your analysis. People have multiple reasons to be mad at AI usage, you mention some of them in your introduction, and you put a (statistically insignificant) measure on only one of them. In your analysis you have shown that exactly one of these reasons is anecdotal. That does not mean they are wrong, and it especially does not mean they are unjustified.
That you found a single pre-AI release which did not cause outrage is proof of nothing. This single release is equally anecdotal, and statistically insignificant.
So, the biggest context that is missing here is that people hate AI for various reasons, and they don‘t want their favorite tools to fall victim to AI for equally many reasons. It is only natural that people who hate AI react this way when they find out their favorite tool uses AI, and doubly so when they sniff correlation between their favorite tools use of AI and bugs.
> I'm just trying to say that these specific releases are unremarkable, and there's no evidence at all of harm currently.
Well, there is no evidence against harm either. But what you did here is a bit of a slight of hand. In your analysis your null hypothesis is: “There is no difference in bug count between releases which includes code commits from Claude Code and releases which don‘t”. (You then go about doing what every psychology major is taught not to do; find evidence for the null hypothesis, not against it). However what hypothesis testing is for is to use a representative sample to generalize over a wider population. You do hypothesis testing because you want to demonstrate that your sample is representative of a wider population, that you just so happened to have picked the two sample, by random chance, which shows the effect regardless of the experiment.
By calculating the p-values you were telling me that you were in fact ready to make generalizing statements over a wider population of commits, but your results were statically insignificant, so really you should not draw any conclusions from them. You have not, in fact, shown that they aren’t different from the rest of the population.
> In your analysis you have shown that exactly one of these reasons is anecdotal.
This was actually the convincing one for me though. “Did AI increase the rsync bug rate? Dunno, can’t tell yet” seems like a fine conclusion to me. Plenty of people in this thread and previous ones on the topic seem convinced one way or another, so it’s nice to see actual numbers.
The numbers are statistically insignificant though. So you cannot use them to generalize over a wider population.
I think in this era of scientific literacy people tend to overcorrect in the absence of evidence. Anecdotal evidence still evidence though, and people are right to react to them.
If we remove our frequentis hats and put on our baysian hat (which is a wise thing to do when n is very low) we can take into consideration evidence from multiple direction at the same time as we upgrade our belief. A baysian might start with the prior that claude assisted commits have the same distribution as non-assisted commits. I would start with a Poisson distribution as my prior, and then they would factor inn all the evidence of AI slop they have seen in their lives and update their posteriors accordingly. Claude caude has been wrong about so many things in the past, which should contribute to a smaller lambda then the control group.