HN2new | past | comments | ask | show | jobs | submit | cootsnuck's commentslogin

I've been working solely on voice agents for the past couple years (and have worked at one of the frontier voice AI companies).

The cascading model (STT -> LLM -> TTS), is unlikely to go away anytime soon for a whole lot of reasons. A big one is observability. The people paying for voice agents are enterprises. Enterprises care about reliability and liability. The cascading model approach is much more amenable to specialization (rather than raw flexibility / generality) and auditability.

Organizations in regulated industries (e.g. healthcare, finance, education) need to be able to see what a voice agent "heard" before it tries to "act" on transcribed text, and same goes for seeing what LLM output text is going to be "said" before it's actually synthesized and played back.

Speech-to-Speech (end-to-end) models definitely have a place for more "narrative" use cases (think interviewing, conducting surveys / polls, etc.).

But from my experience from working with clients, they are clamoring for systems and orchestration that actually use some good ol' fashioned engineering and that don't solely rely on the latest-and-greatest SoTA ML models.


Yea, Deepgram Flux is the secret sauce. Doesn't get talked about much.

For anyone curious: https://flux.deepgram.com/


What is the difference between Flux’s end-of-turn detection and Openai's Automatic turn detection Semantic mode?

In OpenAI's own words about semantic_vad:

> Chunks the audio when the model believes based on the words said by the user that they have completed their utterance.

Source: https://developers.openai.com/api/docs/guides/realtime-vad

OpenAI's Semantic mode is looking at the semantic meaning of the transcribed text to make an educated guess about where the user's end of utterance is.

According to Deepgram, Flux's end-of-turn detection is not just a semantic VAD (which inherently is a separate model from the STT model that's doing the transcribing). Deepgram describes Flux as:

> the same model that produces transcripts is also responsible for modeling conversational flow and turn detection.

[...]

> With complete semantic, acoustic, and full-turn context in a fused model, Flux is able to very accurately detect turn ends and avoid the premature interruptions common with traditional approaches.

Source: https://deepgram.com/learn/introducing-flux-conversational-s...

So according to them, end-of-turn detection isn't just based on semantic content of the transcript (which makes sense given the latency), but rather the the characteristics of the actual audio waveform itself as well.

Which Pipecat (open source voice AI orchestration platform) actually does as well seemingly with their smart-turn native turn detection model as well: https://github.com/pipecat-ai/smart-turn (minus the built-in transcription)


Thanks. Then maybe it’s similar to Moshi https://github.com/kyutai-labs/moshi?tab=readme-ov-file

Except an LLM actually is a piece of software. And the brain is not what you said.

Which part of what he said is wrong?

> A brain is a collection of cells that transmit electrical signals and sodium. ...

That it is a collection of cells? Or that they transmit electrical signals and sodium?

Or do you feel that he's leaving out something important about how it works (like generated electrical fields or neural quantum effects)?



"Good" is subjective. But yes, all wealth creation requires working with other people. No one is an island. And most people are increasingly disturbed by the types of decisions required to amass more wealth than sovereign nations.

Yea, it's puzzling to me that this isn't asked of folks like Altman and Amodei in every interview. Maybe it's because Altman would just start shilling his eye scanning orb and start repeating "WORLD COIN" ad nauseum. Either way, they should be getting pressed on this by all media.

It's not puzzling. Journalism was murdered because it asked Nixon too many questions. So now unless you softball interviews, you just don't get to interview anyone, so the only news orgs with content to monetize are the ones just printing Press Releases and being a backboard for "interviews".

It sure is fun how the party who screams about "personal responsibility" seems to get very upset if you ask a responsible person to explain themselves and their actions.


Ed is the anger translator in my head. Good stuff.


Fellow Midwesterner?


Totally agree. Adult life is just mentally taxing. I'm more curious and more eager to learn now in my 30s than I was in any of my schooling. The learning isn't hard but the energy regulation is.

I think it's so easy for people to discount "mental energy" since culturally we don't often acknowledge it as a finite resource the same way we do physical energy. Well maybe the problem is we view them as separate things in the first place.

When I was younger I just didn't have to worry about so much stuff.


This thread is a really good point. I am in my late 50ies now im really good with computer hardware because I started when I was 11. But I started wanting to become a SCI-FI writer at 35 and it has been an up hill battle to get good at for all the reason described in this thread.


Agree immensely. As an adult, I worry about taxes, my health insurance, making doctor’s appointments, etc.


They said they're going to invest like $150B over five years. Which is quite a bit smaller than other big tech firms.

They have their Granite family of models, but they're small language models so surely significantly less resources are going into them.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: