Maybe there's not yet good open datasets available for this kind of material?
This gives Amazon, Apple and Google a nice advantage since they are able to collect huge sample sets of actual voice commands used by people and to some extent also correlate them with the actual action taken by the person.
How could we collect such dataset? It's a bit chicken-egg problem. I don't want to talk to some open source system unless it has fairly good chance of understanding me. Should we try to half manually (through crowd sourcing) come up with potential requests like "Check news from CNN.com", "Order me quattro stagioni" which could be then fed to platform like Common Voice?
Or should we work on higher level. Come up with task descriptions ("You want to order taxi to get to airport for your morning flight at 7am") and then let people record how they would actually request this from computer with voice. This might more accurately capture the language we actually use when speaking. Through some simple automation you could generate variations of the requests and at least partly the same base material could be used for different languages (task given in English, ask person to make the request in Finnish).
If you want people carefully reading books, it is pretty easy to get a hold of that kind of data in the form of audio books and the work of Recording for the Blind and Dyslexic. Sure, it isn't chunked into sentences, but since you have all of the source text you could do a quite reasonable job automating the slicing, throw out places you aren't sure, and still have a near infinite amount of great data. (Note that it isn't like these sentences are perfect anyway, hence the filtering process with volunteers: while I was judging some audio files one of the issues was "person turned off microphone a little too soon".)
Perhaps that's one of the points of using text from books. You can compare how people are speaking compared to someone who was specifically tasked with reading the book out loud for the audio book.
This gives Amazon, Apple and Google a nice advantage since they are able to collect huge sample sets of actual voice commands used by people and to some extent also correlate them with the actual action taken by the person.
How could we collect such dataset? It's a bit chicken-egg problem. I don't want to talk to some open source system unless it has fairly good chance of understanding me. Should we try to half manually (through crowd sourcing) come up with potential requests like "Check news from CNN.com", "Order me quattro stagioni" which could be then fed to platform like Common Voice?
Or should we work on higher level. Come up with task descriptions ("You want to order taxi to get to airport for your morning flight at 7am") and then let people record how they would actually request this from computer with voice. This might more accurately capture the language we actually use when speaking. Through some simple automation you could generate variations of the requests and at least partly the same base material could be used for different languages (task given in English, ask person to make the request in Finnish).