Thanks to the internet†, I can reveal a surprise: with the same methodology, the dictionary words in the seven real Harry Potter books concatenated is 19,245 and the total unique words is 21,441.
Total word count is 1,122,131 which is longer than HP:MoR by a factor of three. Plotting mean unique word count for the whole, halves and quarters of MoR gives a fit of uniques=168*length^0.3357, which makes sense given Zipf's law. That formula predicts about 18,050 words for a work of the same length as the original HP.
(Edit to add obvious test in the other direction.) The first 386,829 words of the original HP contain 12,255 unique words. The last 386,829 words contain 13,635 uniques. So, its comparable but perhaps slightly more varied (MoR had 12,685).
In light of those figures, is it possible Eliezer's vocabulary is less good than he thinks (Dunning-Kruger)? Especially as the Harry Potter book were written for children and presumably edited as such.
On the other hand, the fact that Eliezer seems to have used fewer words in his writing than you'd expect if his vocab was excellent doesn't mean that his known vocab is poor — he might just not use all the words he knows in writing.
Additionally, given the success of J K Rowling as an author, you might expect her vocabulary to be excellent, so it is conceivable that he's good and she's better.
† I have all the Harry Potter books on a shelf at home. Is torrenting the pdfs at work so I can word count them infringing copyright? I could have done it manually, it just would have taken longer.
I thought Eliezer's Lesswrong sequences might give different results. Applying your tests to those (from http://jb55.com/lesswrong/), I get 257,646 total words, 11,666 unique dictionary words, and 12,721 unique words (I'm surprised there aren't more unique words, given that the quantum physics sequence is in there).
168*257,646^.3357 = 11,010, so the sequences seem to be at about HP level.
Excellent work, by the way; thanks for the analysis.
Total word count is 1,122,131 which is longer than HP:MoR by a factor of three. Plotting mean unique word count for the whole, halves and quarters of MoR gives a fit of uniques=168*length^0.3357, which makes sense given Zipf's law. That formula predicts about 18,050 words for a work of the same length as the original HP.
(Edit to add obvious test in the other direction.) The first 386,829 words of the original HP contain 12,255 unique words. The last 386,829 words contain 13,635 uniques. So, its comparable but perhaps slightly more varied (MoR had 12,685).
In light of those figures, is it possible Eliezer's vocabulary is less good than he thinks (Dunning-Kruger)? Especially as the Harry Potter book were written for children and presumably edited as such.
On the other hand, the fact that Eliezer seems to have used fewer words in his writing than you'd expect if his vocab was excellent doesn't mean that his known vocab is poor — he might just not use all the words he knows in writing.
Additionally, given the success of J K Rowling as an author, you might expect her vocabulary to be excellent, so it is conceivable that he's good and she's better.
† I have all the Harry Potter books on a shelf at home. Is torrenting the pdfs at work so I can word count them infringing copyright? I could have done it manually, it just would have taken longer.