Whisper is awesome, but managing it in production environment is not easy. I am waiting OpenAI (or someone else) to offer a API with a Real Time Factor of < 1. RTF is inference time/duration of the file. We can really use a that.
Doesn't whisper.cpp already get you that? It takes ~6 seconds per 30 second segment on an M1 Max with the Large model. Do you mean you want snappy appearance of words shortly after you say them, rather than having to recognise in 30 second segments?