Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST (github.com/toverainc)
13 points by kkielhofner on May 23, 2023 | hide | past | favorite | 13 comments
Hey HN!

Willow Inference Server (WIS) is a focused and highly optimized language inference server implementation. Our goal is to "automagically" enable performant, cost-effective self-hosting of released state of the art/best of breed models to enable speech and language tasks:

Primarily targeting CUDA (works on CPU too) with support for low-end (cheap) devices such as the Tesla P4, GTX 1060, and up. Don't worry - it screams on an RTX 4090 too! (See benchmarks on Github).

Memory optimized - all three default Whisper (base, medium, large-v2) models loaded simultaneously with TTS support inside of 6GB VRAM. LLM support defaults to int4 quantization (conversion scripts included). ASR/STT + TTS + Vicuna 13B require roughly 18GB VRAM. Less for 7B, of course!

ASR. Heavy emphasis - Whisper optimized for very high quality as-close-to-real-time-as-possible speech recognition via a variety of means (Willow, WebRTC, POST a file, integration with devices and client applications, etc). Results in hundreds of milliseconds or less for most intended speech tasks. See YouTube WebRTC demo[0].

TTS. Primarily provided for assistant tasks (like Willow!) and visually impaired users.

LLM. Optionally pass input through a provided/configured LLM for question answering, chatbot, and assistant tasks. Currently supports LLaMA deriviates with strong preference for Vicuna (I like 13B). Built in support for quantization to int4 to conserve GPU memory.

Support for a variety of transports. REST, WebRTC, Web Sockets (primarily for LLM).

Performance and memory optimized. Leverages CTranslate2 for Whisper support and AutoGPTQ for LLMs.

Willow support. WIS powers the Tovera hosted best-effort example server Willow users enjoy.

Support for WebRTC - stream audio in real-time from browsers or WebRTC applications to optimize quality and response time. Heavily optimized for long-running sessions using WebRTC audio track management. Leave your session open for days at a time and have self-hosted ASR transcription within hundreds of milliseconds while conserving network bandwidth and CPU!

Support for custom TTS voices. With relatively small audio recordings WIS can create and manage custom TTS voices. See API documentation for more information.

Much like the release of Willow[1] last week this is an early release but we had a great response from HN and are looking forward to hearing what everyone thinks!

[0] - https://www.youtube.com/watch?v=PxCO5eONqSQ

[1] - https://github.com/toverainc/willow



Thanks for this, I've been looking forward to it.

I used Amazon Echo devices during their first 6 months of public availability before I got sufficiently creeped out to pull the plug permanently. Since then, I've wished for something similar that wasn't a 'black box' doing unknown things with my data.

When you posted about Willow here on HN, I immediately purchased an ESP-BOX (glad I didn't wait, they sold out quickly!)

I have a bunch of unused rpi CM4 that I stocked up on a few years back, I loaded Home Assistant onto one of them and connected Willow to it. I didn't have anything to automate yet, so all I got was error messages about missing HA intents. Then finally, last night, some Zigbee stuff got delivered and now, after an 8 year hiatus, I have a voice assistant again, and it doesn't creep me out.

After a couple hours last night messing around with HA and researching, I have some more stuff on the way. I'm going to be able to automate my window-mounted air conditioner using an IR device, and that same device includes an RF component so I can control my 433mhz ceiling fan (broadlink RM4 pro, for any interested reader). I have some temperature sensors on the way to assist with all that.

Home Assistant has a commercial side with a cloud offering that lets me control this setup from anywhere for about $60/year. It can even tie in to my phone to run automation based on when I leave or return. And all of this is open source, with none of my data going anywhere (except to Tovera's inference server, which I will shortly replace with my own)

I also saw your issue comment last night about a Willow Application Server. The idea is exciting, and I hope it happens, I am very interested in that idea.

Thanks again for what you're doing. I hope you see success with this, the entire home computing/home automation ecosystem will benefit in the long term.


Thanks!

One note - if you're going to self-host WIS (YOU SHOULD) I suggest making good use of the more powerful hardware otherwise and putting HA on it.

HA on Raspberry Pi (while popular) is pretty slow when compared to the kinds of response times Willow provides. It's frustrating to see perfect speech recognition come back from Willow + WIS in 200ms (or whatever) and then take HA another 150ms to do something with it :).

We're excited about WAS too - just need to think about it a bit more before we start getting to it!


Thanks for the tips. I do plan to self-host WIS. I only have a single GPU currently, on my home desktop, and I'm going to test with this. Next step is going to be looking into a standalone fanless GPU system and determining whether I want to spend money on that or look at used hardware. The hardware adventure is part of the fun.


WIS has been tested with WSL (if you're running Windows) and supports anything from a GTX 1060 3GB up to H100 so you can certainly throw it on your gaming desktop to start.


Looks fantastic! A question re. benchmarks: the "realtime multiple" means "how many times faster the model is processing the audio compared to real time" here? (I.e. the opposite of "real-time factor" sometimes used in speech recognition contexts?)

Also bigger beam sizes give better quality I guess?

Tnx!


Thanks!

Yes, "realtime multiple" is audio/speech length divided by actual inference time.

You got it! The demo video is showing the slowest response times because it is using the highest quality/accuracy settings available with Whisper (large-v2, beam 5). Willow devices use medium 1 by default for comparison and those responses are measured in the 500 milliseconds or less range (again depending on speech length) across a wide variety of new and old CUDA hardware. Some sample benchmarks here[0].

Applications using WIS (including Willow) can provide model settings on a per-request basis to balance quality vs latency depending on the task.

[0] - https://github.com/toverainc/willow-inference-server#benchma...


Great stuff. Which video card are you running in the demo video?


The fancy: RTX 3090

The not-so-fancy: it's a 10 year old Xeon...

The benchmarks for the other GPUs were across a pretty wide range of hardware. The GTX 1060, for example, is in a REALLY old Xeon that doesn't even support AVX...

I have hardware arriving this week that I purchased from eBay for $320 with shipping and tax - Dell Precision with i7-7700, 16GB RAM, 512GB SSD[0] and GTX 1070[1]. Seems to be the current "best bang for the buck" if you don't have any of this stuff just sitting around.

The idea is because Willow devices are so cost-effective you can buy this hardware to host WIS, HA, other homelab stuff, etc and still come out ahead (even with power) compared to Raspberry Pi:

Qty 5 Willows - $270 with power supplies (support wake word, far-field audio, LCD display, speaker, mics, etc)

WIS hardware - $320 (or less - it's up to you!)

Total cost: $590

Six Raspberry Pis (no wake word, poor audio quality, unusably slow, cumbersome, very DIY):

$720 (retail kit with board, SD, LCD display, mic array, speaker, enclosure). MSRP - can't actually be purchased for that.

I expect this hardware to do well (returning well below 500ms for Willow speech locally with excellent quality) and I'll be documenting the power consumption optimization work I'm doing over the weekend.

[0] - https://www.ebay.com/itm/234908676168

[1] - https://www.ebay.com/itm/115536328587


Brilliant. Willow looks like a game-changer to me, thanks a million!


Thanks! We believe we have a practical, viable approach to provide an Alexa equivalent (or better) experience with self-hosted privacy and control. Not to mention all of the other stuff you can do with WIS :).


It looks amazing, I'll definitely try it out! Quick question: would a GTX 960 also work for inference? I happen to have one lying around and could whip up a system with it. Thanks for the great work, I think especially around the Smart home OSS has a lot to add.


Thank you!

Unfortunately the oldest GPUs we support are Pascal and your GTX 960 is Maxwell. We have this cutoff for GPU hardware support for two reasons:

1) Nvidia doesn't support Maxwell with recent versions of CUDA (fair enough, Maxwell is 9 years old).

2) Anything other than a GTX Titan X doesn't have the required VRAM.

That said, we do support and have recommended hardware configurations for cards such as the Tesla P4, GTX 1070, etc which can be had for roughly $100 on the used market.

WIS does support CPU only configurations but GPUs offer such significant fundamental architecture improvements a $100 six year old GPU will best the fastest CPUs in the world for this application at significantly less cost and power usage. A CPU only configuration is fundamentally incapable of providing our target end-user experience - self-hosted, private Alexa without compromise.

Copied from a comment below:

I have hardware arriving this week that I purchased from eBay for $320 with shipping and tax - Dell Precision with i7-7700, 16GB RAM, 512GB SSD[0] and GTX 1070[1]. Seems to be the current "best bang for the buck" if you don't have any of this stuff just sitting around.

The idea is because Willow devices are so cost-effective you can buy this hardware to host WIS, HA, other homelab stuff, etc and still come out ahead (even with power) compared to Raspberry Pi:

Qty 5 Willows - $270 with power supplies (support wake word, far-field audio, LCD display, speaker, mics, etc)

WIS hardware - $320 (or less - it's up to you!)

Total cost: $590

Six Raspberry Pis (no wake word, poor audio quality, unusably slow, cumbersome, very DIY):

$720 (retail kit with board, SD, LCD display, mic array, speaker, enclosure). MSRP - can't actually be purchased for that.

I expect this hardware to do well (returning well below 500ms for Willow speech locally with excellent quality) and I'll be documenting the power consumption optimization work I'm doing over the weekend.

[0] - https://www.ebay.com/itm/234908676168

[1] - https://www.ebay.com/itm/115536328587


Thanks for the reply - I didn't know about Cuda versions. Makes sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: