> Hundreds of hours of audio data from your own voice? I should clarify this. As...

> Hundreds of hours of audio data from your own voice?

I should clarify this. As I mentioned, training the neural net part requires tons of audio and the corresponding text (and people should totally contribute[1], the resulting data sets are released to the public). The neural net in DeepSpeech is then used on an audio stream and outputs a stream of characters.

Turning that stream of characters into sentences is what the language model is for.

Training the neural net is very data and compute intensive, but fortunately Mozilla provides pre-trained models.

Generating the language model is relatively cheap. And if your target language shares sounds with English, you may get away with using the English-trained neural net but with a non-English language model.

[1]: https://voice.mozilla.org/