Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Whisper is awesome, but managing it in production environment is not easy. I am waiting OpenAI (or someone else) to offer a API with a Real Time Factor of < 1. RTF is inference time/duration of the file. We can really use a that.


Getting a server running is easy if you use https://github.com/ahmetoner/whisper-asr-webservice as a guide. It's then a REST API which you post the file to and get the transcription in return.

But I don't know what you consider being "in production". If it's for internal use then it is enough.

Here are some comparisons of running it on GPU vs CPU According to https://github.com/MiscellaneousStuff/openai-whisper-cpu the medium model needs 1.7 seconds to transcribe 30 seconds of audio when run on a GPU.


Doesn't whisper.cpp already get you that? It takes ~6 seconds per 30 second segment on an M1 Max with the Large model. Do you mean you want snappy appearance of words shortly after you say them, rather than having to recognise in 30 second segments?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: