HN2new | past | comments | ask | show | jobs | submitlogin

It's a tool that's free as in freedom, and it is incredibly useful.

How does discovering that it doesn't handle some weird use case diminish its utility as a tool?

Hammers are awesome at driving nails into wood. But if you strike the wood directly without a nail, the hammer puts a dent in the wood. Is that a defect in the hammer? Does it somehow make the hammer any less useful as a tool?



Not sure what makes you think I'm diminishing its utility as a tool; like I said, it's an incredible tool and I lean on it very heavily for various speech-processing pipelines.

I'm just pointing out that Whisper definitely hasn't "solved" speech recognition, and there's still a lot more fertile ground to cover from a research perspective.


In my personal believe I don't think speech recognition, at least based on any human model is solvable, at best we can get 'as good as an average set of humans that speak the language'

People mumble crap all the time to other humans and need to repeat what they say with proper enunciation.

People have hearing problems, which would correlate to microphone quality/placement issues when dealing with computer systems.

Then there are issues where people say one thing incorrectly, but the person following/listening to the directions knows the procedure and does the correct thing. If you asked the speaker what they said again, they'd say they said the 'correct' thing in the first place.

And this is something that I've done before as an example.

Me: "Click the X button to start the process"

Person writing the notes: "Click the Y button to start the process"

Person writing the notes: You meant click the Y button right.

Me: Yea, that's what I said.

Oops, its when we get in things like this that we run into the unsolvable speech recognition issues because we don't generally understand our own error bars on what we say. The speech quality between a public speaker and the average joe I'm sure has a very wide range.


> at best we can get as good as an average set of humans

Whisper is already superhuman, more accurate than experienced human transcribers.


5 seconds of silence isn't a weird use case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: