It's a tool that's free as in freedom, and it is incredibly useful. How does dis...

nmfisher · on Feb 21, 2023

Not sure what makes you think I'm diminishing its utility as a tool; like I said, it's an incredible tool and I lean on it very heavily for various speech-processing pipelines.

I'm just pointing out that Whisper definitely hasn't "solved" speech recognition, and there's still a lot more fertile ground to cover from a research perspective.

pixl97 · on Feb 21, 2023

In my personal believe I don't think speech recognition, at least based on any human model is solvable, at best we can get 'as good as an average set of humans that speak the language'

People mumble crap all the time to other humans and need to repeat what they say with proper enunciation.

People have hearing problems, which would correlate to microphone quality/placement issues when dealing with computer systems.

Then there are issues where people say one thing incorrectly, but the person following/listening to the directions knows the procedure and does the correct thing. If you asked the speaker what they said again, they'd say they said the 'correct' thing in the first place.

And this is something that I've done before as an example.

Me: "Click the X button to start the process"

Person writing the notes: "Click the Y button to start the process"

Person writing the notes: You meant click the Y button right.

Me: Yea, that's what I said.

Oops, its when we get in things like this that we run into the unsolvable speech recognition issues because we don't generally understand our own error bars on what we say. The speech quality between a public speaker and the average joe I'm sure has a very wide range.

panarky · on Feb 21, 2023

> at best we can get as good as an average set of humans

Whisper is already superhuman, more accurate than experienced human transcribers.

pessimizer · on Feb 21, 2023

5 seconds of silence isn't a weird use case.