While I agree with the spirit of the article, this is one of those cases where a Bayesian treatment is conceptually much clearer.
Assume that ratings are being generated by a stable stochastic process where the underlying distribution is multinomial (ignoring the ordinal character of ratings, for the time being) and use a dirichlet conjugate prior. This gives you a posterior distribution over new ratings for an item. The benefit of a posterior here is that it lets you rank items by thinking in terms of the probability that the viewer would rank one item higher than another at random. By adjusting the magnitude of the alpha parameter to the dirichlet prior, you adjust your sensitivity to small numbers of observations. A small initial alpha will lead to rapid changes in the posterior upon observing ratings, whereas a large alpha requires a significant body of evidence.
The best part of the multinomial model with conjugate dirichlet prior is that the math is REALLY simple. The normalizing constant for the dirichlet distribution looks scary when stated in terms of the gamma function, but given this is the discrete case, just pretend everywhere you see the gamma(x), it is replaced with (x - 1)! and you will be ok.
Let me know if you would like to learn more, I would be happy to help.
You could also go one step further on the Bayesian path and infer even alpha from the data on your site, and introduce a loss function on your ordering.
Or you could do a semi-frequentist thing and simplify your math by using MAP estimates to rank. Basically instead of score = #pos/(#pos + #neg), it becomes score = (#pos+x)/(#pos+x + #neg+y), where you choose x and y to suit your needs. You could choose x/y in proportion to the average number of up/down votes on your site or you could even choose x/y in proportion to the average number of up/down votes of the author of the post. That would rank posts of trolls lower than posts of good users. By varying x and y you can tweak the strength of this effect. You can interpret this as giving each item by default x upvotes and y downvotes.
This certainly works much better than the formula in the article. For example if a post has 1 upvote and 2 downvotes, his formula will say that should be ranked lower than a post with 1000 upvotes and 2000 downvotes (because he's using the lower bound of the confidence interval). Obviously that's bad because while the first post could be a good one, we know for certain that the second one isn't. In general his method will rank posts with a low number of votes very low, even compared to posts with a high number of downvotes.
Absolutely! I just wrote a reply where I alluded to that, unfortunately I didn't refresh and see this post or I would have just plugged you instead.
The benefit of the Bayesian treatment here that I want to drill down on is how natural it is to adjust the prior to capture your beliefs about how items should be perceived in the presence of incomplete information. The frequentist approach is fine, but it does not provide such a pleasant, intuitive knob to tune.
Well, it depends on how you look at it. The Bayesian MAP approach is basically the same as the frequentist approach with made up data. Instead of using the maximum likelihood estimator p = pos/(pos+neg) you pretend that each new post already has some up/down votes by default, and then you use simply p = pos/(pos+neg). Seems like an even more intuitive explanation of the Bayesian knob to me! Rather than an abstract "alpha" to a "Dirichlet prior", you get something concrete (the number of made up votes). And you get a simple formula, which some would find desirable.
But I agree that the Bayesian approach is conceptually much cleaner. IMO the frequentist approach is just computational corner cutting for when the math in the Bayesian approach gets too involved, which is sometimes useful. What's nice about the Bayesian approach is that you state your assumptions and then it's just turning the math machinery. In contrast, in the frequentist approach the assumptions are interwoven and hidden in arbitrary choices in how the math is done (And then they claim that Bayesians are subjective! It's just that Bayesians admit that they are subjective. Frequentists try to hide the fact that they are more subjective in the math). The not so nice thing is that turning the math machinery is not always so easy and does not always produce fast algorithms. That's where maximum likelihood and friends come in, but I'd view them as an approximation to Bayesian methods.
Yes, the explanation of "imaginary votes" is by far the simplest way of thinking about the situation. Having mathematical formalism and a rigorously studied methods makes me feel OK about doing it though, otherwise I would be very uncomfortable with the approach :)
I love Bayesian statistics from a conceptual point of view, but the ease with which one ventures into the land of analytic intractability kind of puts me off more complex models. MCMC is such a clumsy tool (in addition to taking forever); variational methods look interesting to me but I don't really feel they are quite there yet.
I read the article, went "The ^&#$?" and came here to post this, more or less. It's amazing the $&#! people will get up to when they don't know Bayesian statistics.
That really isn't fair to what this is doing. If a bayesian wanted to lower bound the distribution, they would be using exactly the same formula, but with an explicit prior. The typical bayesian solution to this of the dirichlet prior has a significantly different convergence rate, 1/n vs 1/sqrt(n) so the resulting ordering is substantially different as well, and I would argue worse for this application. The method of setting a pessimistic prior cares much less about the number of votes for a popular item than does the method of lower bounding the distribution.
If you think most comments are bad, and you have an explicit prior that reflects this, then your ranking will update to "Hey, the comment was actually good!" at the optimal speed - no more, no less - given the rate of incoming ratings. Why would any other rate of updating be better?
Because there's a high variance on your estimate, so in a lot of circumstances just by variance you'll show users bad comments on top most of the time, rather than showing them the good comment that you are certain of.
In this context, the pessimistic prior is more of a hack than the confidence interval, because your mental model isn't that most comments suck. Your mental model is that most of them are ok. The pessimistic prior is just there to say, if 1000 people confirm that a comment is ok, we should show that over a comment for which we don't know whether it's ok or not yet, and this reasoning is much better modeled by a lower bound.
Here's a paper proposing a solution in that space, and which also compares itself to the article linked here (kind of nice to see... papers sometimes fail to cite stuff that's "only" posted online rather than properly published, even if the authors know about it and it's quite relevant): http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...
I emailed Miller a while ago to see what he thought of this reply, and he thought it also seemed like a reasonable approach. But, in his view, the criticisms of his method within their framework include things that in practice he sees as features. In particular, they view the bias caused by using the lower bound as a bug, but he prefers rankings to be be "risk-averse" in recommending, avoiding false positives more than false negatives. Of course, that biased preference could also be encoded explicitly in a more complex Bayesian setup, which would also be a bit more principled, since you could directly choose the degree of bias, instead of indirectly choosing it via your choice of confidence level on the Wilson score interval.
I don't think you have to resort to any overly complex machinery to achieve similar behavior. The simplest approach is to just use a non uniform prior. His pessimistic bound could be emulated by having an initial alpha that places more weight on low star ratings. The intuitive interpretation of that being "things are probably bad unless proven good" roughly. Another option would be to generate the prior based on the posterior distributions of other items. Just take the distribution of ratings observations for all products of a given type (perhaps only items produced by that company?) to get a sensible prior on a new item in that category.
The strength of priors here is that it is very easy to take intuitions and encode them statistically, in an understandable way. Taking the lower bound of a test statistic doesn't admit much in the way of intuition.
I agree the lower bound of a test statistic is a pretty indirect way of encoding intuitions. Somehow I tend to find loss functions the conceptually clearest way of encoding preferences about inference outcomes, though. But, in this case my feeling in that direction isn't very strong, and the priors-based solution seems fine.
There's a distinct difference in the asymptotic behavior though between the lower bound and the prior. The lower bound goes to the mean as 1/sqrt(n), the prior goes to the mean as 1/n.
That makes for a pretty significant difference in practice, and I'm not sure which is preferable.
You are absolutely correct that they are not mathematically identical. I struggled to word it in a way that would not mislead people, the distinction is important to emphasize.
It has a really big effect I think on the tone of what gets selected at the top. The lower bound prefers things that are preferred by a majority and very popular. The prior method prefers things that are completely un-objectionable and liked by just enough people to be sure of that. My hunch is that with the lower bound you get more interesting things bubbling to the top because it puts a stronger emphasis on popularity.
In all of these models, the giant variable that is completely ignored is the actual choice to rate something at all, versus skipping over it and reading the next one. That's a very significant decision that the user makes. The behavior of each of these systems w.r.t that effect will be the dominant thing differentiating them.
Can anybody replicate their results at Proposition 5? My tests contradict their conclusion, namely that the total score is not monotonic (i.e. when I test I do get that the total score is monotonic).
I am a beginning stats student. I took the AP class in high school, and have been doing independent work since. Predicting voters, playing with data sets, using a bit of Python to try to make general solutions, and other non serious but non trivial actions. Bayesian solutions have always required more work from me, even when they seemed the optimal way to solve a problem. I still don't entirely understand how to work with continuous distributions of hypotheses, partially because I lack the deep mathematical intuition at the moment. I don't blame the math, but I do think that much of the time the Bayesian treatment can be more challenging. I suspect that this is why non-Bayesian techniques remain so popular.
What he said! I'd like to add that a key to squeezing more out of NathanRice's post is the phrase "conjugate prior." Another totally natural thing would be to use a Gaussian prior & likelihood, then update the posterior as ratings arrive. This would take advantage of the ordinality of ratings as NR suggests. Bishop's Machine Learning book goes into this sort of stuff in more depth.
That's heavy on mathematical lingo and I confess I didn't understand it. Can you recommend an online resource that would explain what you just wrote in a newbie-friendly manner?
I would absolutely love to learn more. I've been trying to solve some novel NLP and machine learning problems lately but my lack of statistical knowledge is becoming apparent the further along I get.
Do you have any recommendations for a good introductory treatment of Bayesian statistics?
By far my favorite book on the subject is conveniently available for free on the internet! "Information Theory, Inference and Learning Algorithms" by David MacKay is wonderfully written, well paced and comprehensive. If you like the book, you should purchase a copy, David is a great guy.
I want to second MacKay's book. I had a terrible statistics class in college. We spent the entire time looking up tables of p-values and t tests, without a very convincing explanation as to why. The entire topic was damaged for me from then on until I read MacKay online, and then bought the paper version.
His book starts from first principles -- simple ideas about probabilities -- and it builds a foundation for understanding Bayesian methods. And he explains, basically, why the p-values and t-test stuff is a bunch of crap, which was an immense relief.
Assume that ratings are being generated by a stable stochastic process where the underlying distribution is multinomial (ignoring the ordinal character of ratings, for the time being) and use a dirichlet conjugate prior. This gives you a posterior distribution over new ratings for an item. The benefit of a posterior here is that it lets you rank items by thinking in terms of the probability that the viewer would rank one item higher than another at random. By adjusting the magnitude of the alpha parameter to the dirichlet prior, you adjust your sensitivity to small numbers of observations. A small initial alpha will lead to rapid changes in the posterior upon observing ratings, whereas a large alpha requires a significant body of evidence.
The best part of the multinomial model with conjugate dirichlet prior is that the math is REALLY simple. The normalizing constant for the dirichlet distribution looks scary when stated in terms of the gamma function, but given this is the discrete case, just pretend everywhere you see the gamma(x), it is replaced with (x - 1)! and you will be ok.
Let me know if you would like to learn more, I would be happy to help.