The only models that do what you're poking at hostically are 4o (claimed) and that french company with the 7B one. They're also bleeding edge, either unreleased or released and way wilder, ex. The french one interrupts too much, and screams back in an alien language occasionally.
Until these, you'd use echo cancellation to try and allow interruptible dialogue, and thats unsolved, you need a consistently cooperative chipset vendor for that (read: wasn't possible even at scale, carrots, presumably sticks, and with nuch cajoling. So it works on iPhones consistently.)
The partial results are obtained by running inference on the entire audio so far, and silence is determined by VAD, on every stack I've seen that is described as streaming
I find it hard to believe that Google and Apple specifically, and every other audio stack I've seen, are choosing to do "not the best they can at all"
This is exactly what Google ASR does. Give it a try and watch how the results flow back to you, it certainly is not waiting for VAD segment breaking. I should know.
Streaming used to be something people cared about more. VAD is always part of those systems as well, you want to use it to start segments and to hard cut-off, but it is just the starting off point. It's kind of a big gap (to me) that's missing in available models since Whisper came out, partly I think because it does add to the complexity of using the model, and latency has to be tuned/traded-off with quality.
Thank you for your insight. It confirms some of my suspicions working in this area (you wouldn't happen to know anybody who makes anything more modern than the Respeaker 4-mic array?). My biggest problem is even with AEC, the voice output is triggering the VAD and so it continually thinks it's getting interrupted by a human. My next attempt will be to try to only signal true VAD if there's also sound coming from anywhere but behind, where the speaker is. It's been an interesting challenge so far though.
Re: mic, alas, no, BigCo kinda sucked, I had to go way out of my way to get work on interesting stuff, it never mattered, and even when you did, you never got over the immediate wall of your own org, except for brief moments. i.e. never ever had anyone even close to knowing anything about the microphones we'd be using, they were shocked to hear what AEC was, even when what we were working on was a marketing tentpole for Pixel. Funny place.
I'm really glad you saw this. So, so, so much time and hope was wasted there on the Nth team of XX people saying "how hard can it be? given physics and a lil ML, we can do $X", and inevitably reality was far more complicated, and it's important to me to talk about it so other people get a sense it's not them, it's the problem. Even unlimited resources and your Nth fresh try can fail.
FWIW my mind's been grinding on how I'd get my little Silero x Whisper gAssistant on device replica pulling off something akin to the gpt4o demo. I keep coming back to speaker ID: replace Silero with some newer models I'm seeing hit ONNX. Super handwave-y, but I can't help thinking this does an end-around both AEC being shit on presumably most non-Apple devices, and poor interactions from trying to juggle two things operating differently (VAD and AEC). """Just""" detect when there's >= 2 simultaneous speakers with > 20% confidence --- of course, tons of bits missing from there, ideally you'd be resilient to ex. TV in background. Sigh. Tough problems.
I'm not particularly experienced, but I did have good experiences with picovoice's services. It's a business specialised in programmatically available audio, tts, vad services etc.
They have a VAD that is trained on a 10 second clip of -your- voice, and it is then only activated by -your- voice. It works quite well in my experience, although it does add a little bit of additional latency before it starts detecting your voice (which is reasonably easy to overcome by keeping a 1s buffer of voice ready at all times. If the vad is active, just add the past 100-200ms of the buffer to the recorded audio. Works perfectly fine. It's just that the UI showing "voice detected" or "voice not detected" might lag behind 100-200ms)
Source: I worked on a VAD + whisper + LLM demo project this year and ran into some VAD issues myself too.
You and I agree fully, then. IMHO it's not too much work, at all, 400 LOC and someone else's models. Of course, as in that old saw, the art is knowing exactly those models, knowing what ONNX is, etc. etc., that's what makes it fast.
The non-sequitor is because I can't feel out what's going on from their perspective, the hedging left a huge range where they could have been saying "I saw the gpt4o demo and theres another way that lets you have more natural conversation" and "hey think like an LSTM model, like Silero, there are voice recognizers that let you magically get a state and current transcription out", or in between, "yeah in reality the models are f(audio bytes) => transcription", which appears to be closer to your position, given your "it's not a streaming model, though it can be adapted"
Until these, you'd use echo cancellation to try and allow interruptible dialogue, and thats unsolved, you need a consistently cooperative chipset vendor for that (read: wasn't possible even at scale, carrots, presumably sticks, and with nuch cajoling. So it works on iPhones consistently.)
The partial results are obtained by running inference on the entire audio so far, and silence is determined by VAD, on every stack I've seen that is described as streaming
I find it hard to believe that Google and Apple specifically, and every other audio stack I've seen, are choosing to do "not the best they can at all"