This is a complete non sequitur lol. FYI whisper is not a streaming model though...

refulgentis · on July 19, 2024

You and I agree fully, then. IMHO it's not too much work, at all, 400 LOC and someone else's models. Of course, as in that old saw, the art is knowing exactly those models, knowing what ONNX is, etc. etc., that's what makes it fast.

The non-sequitor is because I can't feel out what's going on from their perspective, the hedging left a huge range where they could have been saying "I saw the gpt4o demo and theres another way that lets you have more natural conversation" and "hey think like an LSTM model, like Silero, there are voice recognizers that let you magically get a state and current transcription out", or in between, "yeah in reality the models are f(audio bytes) => transcription", which appears to be closer to your position, given your "it's not a streaming model, though it can be adapted"