We have not released the weights, but it is fully available to use in your websites or applications. I can see how our wording there could be misconstrued -- sorry about that.
So glad you enjoyed it! We've been able to significantly reduce those text hallucinations with a few tricks, but it seems they haven't been fully squashed. The /imagine command only works with the image at the moment, but we'll think about ways to tie that into the personality and voice. Thanks for the feedback!
I didn't know /imagine could be followed by a prompt, but similarly I asked the avatar about it's appearance and stated it had none. Should probably give it the context of what it's appearance is like, same thing happened for questions like where are you? What are you holding? Who's that behind you? etc etc
This is so obvious now that you say it (* facepalm *). We definitely need to give the LLM context on the appearance (both from the initial image as well as any /imagine updates during the call). Thanks for pointing it out!
We have not released the weights, but it is fully available to use in your websites or applications. I can see how our wording there could be misconstrued -- sorry about that. You can absolutely create a vTuber persona. The link in the post is still live if you want to create one (as simple as uploading an image, selecting a voice, and defining the personality). We even have a prebuilt UI you can embed in a website, just like a youtube video.
Haha, I kind of get that reaction. Convincing the world "this was hard to do" is generally not easy. Re: user uploads, we're operating in good faith at the moment (no built-in IP moderation). This hasn't been an issue so far. Current pricing reflects our operating costs. Each end-user gets a dedicated GPU for the duration of a call, which is expensive. Advancements on the model-side should eventually allow us to parallelize this.
Thank you! Impressive demo with OVA. Still feels very snappy, even fully local. It will be interesting to see how video plays out in that regard. I think we're still at least a year away from the models being good enough and small enough that they can run on consumer hardware. We compared 6 of the major voice providers on TTFB, but didn't try Sesame -- we'll need to give that one a look. https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...
I wonder how it would come across with the right voice. We're focused on building out the video layer tech, but at the end of the day, the voice is also pretty important for a positive experience.
Thanks for the feedback. The current avatars use a STT-LLM-TTS pipeline (rather than true speech-to-speech), which limits nuanced understanding of pronunciations. Speech-to-speech models should solve this problem. (The ones we've tried so far have counterintuitively not been fast enough.)
This isn't natively supported -- we are continuously streaming frames throughout the conversation session that are generated in real-time. If you were building your own conversational AI pipeline (e.g. using our LiveKit integration), I suppose it would be possible to route things like this with your own logic. But it would probably include jump cuts and not look as good.
Thanks! And sorry! I can see how our wording there could be misconstrued. With a real-time model, the streaming infrastructure matters almost as much as the weights themselves. It will be interesting to see how easily they can be commoditized in the future.