HN2new | past | comments | ask | show | jobs | submitlogin
Whispers of A.I.’s Modular Future (newyorker.com)
162 points by lispybanana on Feb 21, 2023 | hide | past | favorite | 52 comments



What utilities related to Whisper do you wish existed? What have you had to build yourself?

On the end user application side, I wish there was something that let me pick a podcast of my choosing, get it fully transcribed, and get an embeddings search plus answer q&a on top of that podcast or set of chosen podcasts. I've seen ones for specific podcasts, but I'd like one where I can choose the podcast. (Probably won't build it)

Also on the end user side, I wish there was an Otter alternative (still paid $30/mo, but unlimited minutes per month) that had longer transcription limits. (Started building this, not much interest from users though)

Things I've seen on the dev tool side:

Gladia (API call version of Whisper)

Whisper.cpp

Whisper webservice (https://github.com/ahmetoner/whisper-asr-webservice) - via this thread

Live microphone demo (not real time, it still does it in chunks) https://github.com/mallorbc/whisper_mic

Streamlit UI https://github.com/hayabhay/whisper-ui

Whisper playground https://github.com/saharmor/whisper-playground

Real time whisper https://github.com/shirayu/whispering

Whisper as a service https://github.com/schibsted/WAAS

Improved timestamps and speaker identification https://github.com/m-bain/whisperX

MacWhisper https://goodsnooze.gumroad.com/l/macwhisper

Crossplatform desktop Whisper that supports semi-realtime https://github.com/chidiwilliams/buzz


This demo lets you choose the podcast, and is open-source: https://modal-labs--whisper-pod-transcriber-fastapi-app.moda...

https://github.com/modal-labs/modal-examples/tree/main/06_gp...

Transcribes 1hr of audio in roughly 1min, using parallelisation across CPUs.


On the end user side Revoldiv.com lets you pick any podcast you want and transcribe it


I’d like to be able to search long videos (or podcasts) for a keyword (or even better, semantic search), and be directed to all time stamps in that video that match.

Things like this sort of exist, but nothing that’s really usable.


Add sponsor-skipping into that. Give me a transcript, let me select a series of words, then audio containing that series of words gets skipped on all remaining episodes.


I think AssemblyAI has tools for that.


I love whisper. It is so easy to use. I created a small pipeline that transcribes podcasts within the domain that I'm working in. It helps me and my colleagues to revisit and find podcasts episodes without having to listen to them again. You can check it out on podcasts.farmonapp.com


This is really interesting. I am looking at doing something similar. Do you mind me asking what the Back-end API call is written in? I had looked at Deepgram and might try putting a small project together.


It is all written in Python and it uses the original python bindings. I'm using mkdocs to convert the transcripts into a website.


Aside - I'd love to see similar rust and zig implementations like whisper.cpp.

I'll donate $150 USD to zig and rust foundations as a bounty for respective MIT-licensed implementations of these. Let's keep it simple - scalar instructions, no need for intrinsics/assembly. Ideally there would be some tests.

whisper.cpp looks like a simple-enough-but-very practical application and I think it would help promote these modern languages to have a simple and portable demonstration like this.


What does a Rust/Zig port buy that the current implementation cannot do?


Nothing. I tried to address my desire for these items at the end there - it would serve as an excellent demonstration of the power of these languages (to take on a performance critical task like audio transcription).


Just taking a stab, but if your code is primarily written in rust/zig, it’s really annoying calling c/cpp libraries, because you have to build them and keeping the bindings in sync after updates.



"There could be even larger changes—we talk a lot, and almost all of it goes into the ether. What if people recorded conversations as a matter of course, made transcripts, and referred back to them the way we now look back to old texts or e-mails?"

Minority Report-level awesome futurism

One thing that changes everything


This exists already https://www.rewind.ai/


I had no idea. instantly subscribed. not joking.


The article concludes:

"Eventually, though, someone will release a program that’s nearly as capable as ChatGPT, and entirely open-source. An enterprising amateur will find a way to make it run for free on your laptop."

Well not in the near future... Large language models like ChatGPT are called "large" for a reason, and they're too big for your laptop for the foreseeable future. I think you'd need a computer with several high-end GPUs that each have 128+GB of RAM to be able to run one of those. Maybe a laptop from 2030 will do.

Otherwise a very good article.


I'll preface by saying that Mac is definitely not the platform for deep learning currently. However! The M2 MacBook Pros can optionally be equipped with 96 GB of RAM, all of which can be accessed by the GPU.

Assuming that somebody, somewhere, is working on improving things for Mac, we may very likely already have the hardware to run at least a distilled version of ChatGPT locally on laptops. (And if not the MBP, then the M1 Mac Studio would be a good runner-up with 128 GB of memory, though that's obviously not a laptop)


FlexGen can already run GPT-3 size models on commodity hardware, albeit with high latency and fairly slow throughput (order of 1 token/s).


> to be able to run one of those

It's very important to distinguish the different use cases between training and inference. The amount of memory required to execute the ChatGPT model, once trained, is likely much, much less than 128GB.


Afaik you actually need much more. About 400GB just to load the trained model according to this tweet thread: https://twitter.com/tomgoldsteincs/status/160019698195510069... I am not quite sure how reliable the source is, but it makes sense that you at least need to store the 175 billion parameters that define the model in VRAM. I know that for a short-ish while GPUDirect storage is a thing, so that could help for sure, but it would definitely impact execution time as well.


That thread makes a bunch of assumptions that seem a bit dubious to me. We've known since Chinchilla that you don't need 175B parameters to get GPT-3 quality – a 70B model can outperform GPT3 [1]. And his numbers assume the model is loaded into GPU memory in FP16 (175B*2 = 350GB), but people have shown you can quantize down to 8-bit (and in some cases 4 bit) with almost no performance loss. So in 8-bit precision with a 70B model you need ~70GB of VRAM, which you can get with two A6000s on a desktop (each 48GB).

And finally there are lots of other ways to get this down. Aside from quantization, people have also shown that you can do pruning – getting rid of many of the weights – again without much perf loss. You can also offload the weights to CPU RAM or an NVME and stream them in as needed [2]; it's slower but if you arrange things right the performance is not too bad. There are also ways to speed up inference using techniques like early exit [3], where you can skip running the whole model for some tokens that are easy to predict.

Overall it feels like within a year or two a combination of better quantization/pruning, improved understanding of how to train smaller LLMs, and hardware improvements will put inference for ChatGPT-style models within reach of the average user.

[1] https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b...

[2] https://github.com/FMInference/FlexGen

[3] https://ai.googleblog.com/2022/12/accelerating-text-generati...


Few points:

1. Chinchilla has demonstrated models are currently unnecessarily large, and would benefit more from data scaling.

2. Models can be brought down in size massively by a combination of distillation and quantization.

A GPT-3 equivalent with 50B parameters quantized to fp4 is 200gig, and it could probably be distilled to half of that or less while still being functional for the vast majority of prompts. That means ~100gig memory will be a target for devices in the near future.

Once large language models are the main thing people buy GPUs (and even new computers) for, architectures will be redesigned to improve gpu -> memory bandwidth and latency. I wouldn't be surprised to see GPU integrated motherboards as a future premium tier offering, we're already running into heat and space issues with add-on cards and it should be possible to build a low latency bus to a unified system memory.


>1. Chinchilla has demonstrated models are currently unnecessarily large, and would benefit more from data scaling.

Not convinced. It showed this for the original self supervised task, but it might be true that the spare parameters end up being useful for the later finetuning/RLHF stages.


Many of these models are already being trained in FP16, and FP8 seems likely now that the H100s support it.


You can run this today on a consumer GPU at slow speed, using swapping and 4-bit weights (which works surprisingly well and is the new hot topic now)


You can run giant models on a high-end laptop today... they'll just be slooooowww since you'll be doing things like swapping data in/out and leveraging the CPU. If you don't mind waiting an hour for a prompt response it can work.

It's the same as it's always been. Any general purpose computer is Turing complete. Spending more gets you faster results.


Great article, we are only just beginning to see the impact of Whisper. I hope at least that it will trickle into my Alexa sooner rather than later, but I've been scheming other uses for it too. Dictation to notes so I can think out loud. Make transcripts of talks that I would rather read than listen to. The possibilities are endless.


Whisper is great, but Google's built-in speech-to-text thing in Chrome with https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_... has been available for quite some time and works well. Not sure if Firefox has a implementation, probably one that works locally but not nearly as well.

Obviously not having to send all of the data to Google is a big deal. But practically speaking the Google recognition seems to perform extremely well. So not sure this is really a new capability for people if they were willing to use Google's servers.

It seems like Firefox and Chrome should now ship with Whisper built in. Or they should work on that.. and if Chrome doesn't add it then.. that's suspicious.


> So not sure this is really a new capability for people if they were willing to use Google's servers.

Of course it is. Google is providing the API at their largess. One day they can decide they won't do it anymore, or they will charge you huge piles of money, or they decide they don't like something about your data and they won't process it even for all the money. Or will do it but only send back nonsense to you. It is their endpoint and they do whatever they are pleased with it.

On the other hand you can take Whisper and use it on your own computer. Or rent a computer from anyone who does that kind of service. If you don't like how it performs you can refine it on your own datasets. If you have tens of thousands hours of sound files you want to convert you can just calculate what it will cost you to convert with Whisper. With google? Good luck.

The difference between running Whisper on your own hardware vs using Google's endpoint is the difference between having your own car vs the mercurial rich guy not minding if skaters hold on his taillights to hitch a ride for now.


You may want to read my comment which I posted to another comment, it has a comparison to Google's ASR.


Whisper is great, but I also wouldn't overlook other next-generation models (RNN-T/Zipformer/etc) trained on 50k+ hour datasets. These also perform very well.

That being said, Whisper is clearly a far cry from "intelligence". This should be clear when you feed it 5 seconds of silence and get hallucinated garbage in return. It's much more akin to compressing those huge datasets into something that can feasibly be run on recent hardware. That's not to downplay how impressive that is, just to draw a clear line between "compression" and "intelligence".


It's a tool that's free as in freedom, and it is incredibly useful.

How does discovering that it doesn't handle some weird use case diminish its utility as a tool?

Hammers are awesome at driving nails into wood. But if you strike the wood directly without a nail, the hammer puts a dent in the wood. Is that a defect in the hammer? Does it somehow make the hammer any less useful as a tool?


Not sure what makes you think I'm diminishing its utility as a tool; like I said, it's an incredible tool and I lean on it very heavily for various speech-processing pipelines.

I'm just pointing out that Whisper definitely hasn't "solved" speech recognition, and there's still a lot more fertile ground to cover from a research perspective.


In my personal believe I don't think speech recognition, at least based on any human model is solvable, at best we can get 'as good as an average set of humans that speak the language'

People mumble crap all the time to other humans and need to repeat what they say with proper enunciation.

People have hearing problems, which would correlate to microphone quality/placement issues when dealing with computer systems.

Then there are issues where people say one thing incorrectly, but the person following/listening to the directions knows the procedure and does the correct thing. If you asked the speaker what they said again, they'd say they said the 'correct' thing in the first place.

And this is something that I've done before as an example.

Me: "Click the X button to start the process"

Person writing the notes: "Click the Y button to start the process"

Person writing the notes: You meant click the Y button right.

Me: Yea, that's what I said.

Oops, its when we get in things like this that we run into the unsolvable speech recognition issues because we don't generally understand our own error bars on what we say. The speech quality between a public speaker and the average joe I'm sure has a very wide range.


> at best we can get as good as an average set of humans

Whisper is already superhuman, more accurate than experienced human transcribers.


5 seconds of silence isn't a weird use case.


>, just to draw a clear line between "compression" and "intelligence"

Not disagreeing with you but your sentence reminded of research looking at the link between "compression" and "intelligence" :

https://www.google.com/search?q=compression+is+a+form+of+int...


Don't state of the art commercial systems do something similar? I assume there must be some automatic gain boosting the noise at the frontend of most pipelines, I know I've gotten transcribed voicemails that really just are silence but the transcript shows lots and lots of hallucinated words.

Regardless of "intelligence" it's got real utility.


I occasionally get a single hallucinated word (more like a mis-transcription) where the audio contains a clunk/bang/cough/etc, but I've never had full hallucinated phrases from clean silence.

There are a couple of GitHub discussions on the Whisper repository with various fixes/hacks to deal with it: https://github.com/openai/whisper/discussions/679 https://github.com/openai/whisper/discussions/813

If you get a chance, I encourage you to try out the other newer models I mentioned, I think you'd be very impressed.


I don't see this much different than what commonly happens with humans when we hear our named called when it was some environmental noise.

As for the silence, I wonder why the the model even receives it. I would think a lot of that would be compressed out of existence to save bandwidth.


It’s not that your audio is being amplified, it’s that the VAD classifier is poorly tuned. The noise should never even reach the recognition stage. Whisper’s hallucinations are pretty severe, but are improved by adding VAD to its pipeline.


I have an app on my phone which creates a 1 minute audio file when I press a button. I have a lavalier microphone connected to the phone and use it to record notes while riding my bike. It's always 1 minute because that is usually enough, and if I see that I need more time, I record an overlapping second file.

Last week I set up a Whisper instance on my server and have been feeding it with these files. The result is pretty good. I usually can remember what I was saying when I read the transcription, which usually contains a couple of errors. Then there are those added hallucinations which are entire sentences, like:

----

00:00.000 --> 00:05.000 Also temperaturmäßig ist es recht gut. [So temperature wise, it's pretty good.]

00:05.000 --> 00:09.000 Der eine hat 12 Grad, der andere 10. [One has 12 degrees, the other 10. (I have two temperature sensors mounted on the bike, ESP32 streaming the data to the phone via BLE)]

00:09.000 --> 00:12.000 Also sagen wir mal, 10 Grad. [So let's say 10 degrees.]

00:14.000 --> 00:19.000 Es ist bewölkt und windig. [It's cloudy and windy.]

00:20.000 --> 00:24.000 Aber irgendwie vom Wetter her gut. [But somehow from the weather it's good.]

00:24.000 --> 00:31.000 Ich habe heute überhaupt nichts gegessen und sehr wenig getrunken. [I ate nothing at all today and drank very little.]

00:54.000 --> 00:59.000 Vielen Dank für's Zuschauen! [Thanks for watching!]

Transcribed in 77.2 seconds

----

The last sentence, "Thanks for watching!" is a complete hallucination. There were 30 seconds remaining which were me breathing and the wind blowing into the microphone and it came up with that comment.

I usually comment on the weather because I take note of what I am wearing, and it allows me to better prepare for future rides.

77 seconds for the 60 second file because my server has no GPU, so I'm running the large model on the CPU (in a VM which has 8 cores assigned to it from a Ryzen 9 5950X). I've been considering buying a small PC with a 3060 RTX only for inferencing, but it may be too expensive. I tried Google Speech-To-Text and it is nowhere as good as Whisper under these conditions (having the wind noise and the heavy breathing).

This is Google's result:

----

"Also temperaturmäßig es ist recht gut, der eine hat 12° andere 10. Es ist angemalte 10 Grad. Es ist bewölkt und windig, aber er hat sie vom Wetter her gut, ich wollte überhaupt nichts gegessen und sehr wenig getrunken."

["So temperature-wise it's pretty good, one has 12° other 10. It's painted 10 degrees. It's cloudy and windy, but he has it good from the weather, I did not want to eat anything at all and drank very little."]

----

Also, whisper.cpp doesn't seem to generate the same results, and they appear to be not so good (in this case it was almost just as good). I just tested the same file on whisper.cpp with the large model and it's even funnier:

----

[00:00:00.000 --> 00:00:05.000] also temperaturmäßig ist es recht gut [...]

[00:00:05.000 --> 00:00:09.000] der eine hat 12 Grad, der andere 10 [...]

[00:00:09.000 --> 00:00:12.000] also sagen wir mal so 10 Grad [...]

[00:00:12.000 --> 00:00:19.000] es ist bewölkt und windig [...]

[00:00:19.000 --> 00:00:24.000] aber irgendwie vom Wetter her gut [...]

[00:00:27.000 --> 00:00:31.000] ich habe heute überhaupt nichts gegessen und sehr wenig getrunken [...]

[00:00:31.000 --> 00:00:35.000] das ist der Grund, warum ich so viel auf dem Knie gehe [this is the reason why I go so much on the knee]

[00:00:35.000 --> 00:00:39.000] das war's, bis zum nächsten Mal! [that's it, until next time!]

[00:00:39.000 --> 00:00:59.000] Danke fürs Zuschauen! [Thanks for watching!]

`time` yields 567.63s user 1.99s system 755% cpu 1:15.36 total

----

The first 30 seconds, where the text is clearly understood, is inferenced within ~10-15 seconds. It's the "silence" which makes the AI go crazy on the workload.

The idea behind this is to set up a system which then sends me an email with a map and trail of the ride as well as the transcriptions of the notes.


Instead of setting up a machine for inference, try modal labs (no affiliation): https://modal.com/docs/guide/whisper-transcriber

Pay per second GPU processing, with an example of running whisper over 10 GPUs in parallel.


Interesting. I thought that with these offerings I had to rent a VM with GPU and pay the hourly rate for as long as a VM is running.

So this is really 0 USD when not in use? I'm also intending to use this for transcribing my phone answering machine recordings, so the transcription requests come in at random times which means that the transcription service should be constantly available.


Most are, modal is a very different offering where it's $0 when not in use. They have some other very interesting ideas like charging you for CPU time rather than wall time.

It's a newer business so I guess you should factor that risk in though.


Whisper is awesome, but managing it in production environment is not easy. I am waiting OpenAI (or someone else) to offer a API with a Real Time Factor of < 1. RTF is inference time/duration of the file. We can really use a that.


Getting a server running is easy if you use https://github.com/ahmetoner/whisper-asr-webservice as a guide. It's then a REST API which you post the file to and get the transcription in return.

But I don't know what you consider being "in production". If it's for internal use then it is enough.

Here are some comparisons of running it on GPU vs CPU According to https://github.com/MiscellaneousStuff/openai-whisper-cpu the medium model needs 1.7 seconds to transcribe 30 seconds of audio when run on a GPU.


Doesn't whisper.cpp already get you that? It takes ~6 seconds per 30 second segment on an M1 Max with the Large model. Do you mean you want snappy appearance of words shortly after you say them, rather than having to recognise in 30 second segments?


I really like the Sutton quote. Manually written AI systems show promising early results and then fail when compared to machine learning approaches.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: