OK, so maybe we're agreed: you can bet on his abilities in one election. Let's say he's been right 95% of the time (I don't know if that's true) and we believe that's likely to continue, knowing nothing except "this is an election and Silver's predicted the result."
Then if he says "Hillary is likely to win" we can have 95% confidence he's right.
If he says "Hillary has an 80% chance of winning" we ignore the 80, and just observe that it's more than 50.
It's a bit more flexible than that, though - rather than ignore 80% vs 70% and make it all up or down, you can let him predict a few different events in a row, add up the errors, and see how off they are.
Or if you do want to review the past, you can look at the error for a category of elections or that entire year rather than his whole prediction career.
> One could note the number of times that a 25% probability was quoted, over a long period, and compare this with the actual proportion of times that rain fell.
it still depends on many samples, or "over a long period" in your doc.
You can't escape the fact that there are only one or two samples, no matter how much math you throw around.
> You can't escape the fact that there are only one or two samples, no matter how much math you throw around.
That depends on what question you're asking. "How well calibrated are the electoral predictions that FiveThirtyEight makes?" is a sensible question with a lot of data points, seems to speak directly to the crowing about the one call being bad, and seems well suited to the application of a scoring rule for comparison between people making predictions about the same things.
Then if he says "Hillary is likely to win" we can have 95% confidence he's right.
If he says "Hillary has an 80% chance of winning" we ignore the 80, and just observe that it's more than 50.