Maybe there's not yet good open datasets available for this kind of material? Th...

saurik · on July 18, 2017

If you want people carefully reading books, it is pretty easy to get a hold of that kind of data in the form of audio books and the work of Recording for the Blind and Dyslexic. Sure, it isn't chunked into sentences, but since you have all of the source text you could do a quite reasonable job automating the slicing, throw out places you aren't sure, and still have a near infinite amount of great data. (Note that it isn't like these sentences are perfect anyway, hence the filtering process with volunteers: while I was judging some audio files one of the issues was "person turned off microphone a little too soon".)

icc97 · on July 18, 2017

Perhaps that's one of the points of using text from books. You can compare how people are speaking compared to someone who was specifically tasked with reading the book out loud for the audio book.

mbebenita · on July 18, 2017

That's correct. There isn't a good open dataset for that type of material. Perhaps we should try to build one.