Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST

kunwon1 · on May 23, 2023

Thanks for this, I've been looking forward to it.

I used Amazon Echo devices during their first 6 months of public availability before I got sufficiently creeped out to pull the plug permanently. Since then, I've wished for something similar that wasn't a 'black box' doing unknown things with my data.

When you posted about Willow here on HN, I immediately purchased an ESP-BOX (glad I didn't wait, they sold out quickly!)

I have a bunch of unused rpi CM4 that I stocked up on a few years back, I loaded Home Assistant onto one of them and connected Willow to it. I didn't have anything to automate yet, so all I got was error messages about missing HA intents. Then finally, last night, some Zigbee stuff got delivered and now, after an 8 year hiatus, I have a voice assistant again, and it doesn't creep me out.

After a couple hours last night messing around with HA and researching, I have some more stuff on the way. I'm going to be able to automate my window-mounted air conditioner using an IR device, and that same device includes an RF component so I can control my 433mhz ceiling fan (broadlink RM4 pro, for any interested reader). I have some temperature sensors on the way to assist with all that.

Home Assistant has a commercial side with a cloud offering that lets me control this setup from anywhere for about $60/year. It can even tie in to my phone to run automation based on when I leave or return. And all of this is open source, with none of my data going anywhere (except to Tovera's inference server, which I will shortly replace with my own)

I also saw your issue comment last night about a Willow Application Server. The idea is exciting, and I hope it happens, I am very interested in that idea.

Thanks again for what you're doing. I hope you see success with this, the entire home computing/home automation ecosystem will benefit in the long term.

kkielhofner · on May 23, 2023

Thanks!

One note - if you're going to self-host WIS (YOU SHOULD) I suggest making good use of the more powerful hardware otherwise and putting HA on it.

HA on Raspberry Pi (while popular) is pretty slow when compared to the kinds of response times Willow provides. It's frustrating to see perfect speech recognition come back from Willow + WIS in 200ms (or whatever) and then take HA another 150ms to do something with it :).

We're excited about WAS too - just need to think about it a bit more before we start getting to it!

kunwon1 · on May 23, 2023

Thanks for the tips. I do plan to self-host WIS. I only have a single GPU currently, on my home desktop, and I'm going to test with this. Next step is going to be looking into a standalone fanless GPU system and determining whether I want to spend money on that or look at used hardware. The hardware adventure is part of the fun.

kkielhofner · on May 23, 2023

WIS has been tested with WSL (if you're running Windows) and supports anything from a GTX 1060 3GB up to H100 so you can certainly throw it on your gaming desktop to start.

morehamish · on May 23, 2023

Looks fantastic! A question re. benchmarks: the "realtime multiple" means "how many times faster the model is processing the audio compared to real time" here? (I.e. the opposite of "real-time factor" sometimes used in speech recognition contexts?)

Also bigger beam sizes give better quality I guess?

Tnx!

kkielhofner · on May 23, 2023

Thanks!

Yes, "realtime multiple" is audio/speech length divided by actual inference time.

You got it! The demo video is showing the slowest response times because it is using the highest quality/accuracy settings available with Whisper (large-v2, beam 5). Willow devices use medium 1 by default for comparison and those responses are measured in the 500 milliseconds or less range (again depending on speech length) across a wide variety of new and old CUDA hardware. Some sample benchmarks here[0].

Applications using WIS (including Willow) can provide model settings on a per-request basis to balance quality vs latency depending on the task.

[0] - https://github.com/toverainc/willow-inference-server#benchma...

morehamish · on May 23, 2023

Great stuff. Which video card are you running in the demo video?

kkielhofner · on May 23, 2023

The fancy: RTX 3090

The not-so-fancy: it's a 10 year old Xeon...

The benchmarks for the other GPUs were across a pretty wide range of hardware. The GTX 1060, for example, is in a REALLY old Xeon that doesn't even support AVX...

I have hardware arriving this week that I purchased from eBay for $320 with shipping and tax - Dell Precision with i7-7700, 16GB RAM, 512GB SSD[0] and GTX 1070[1]. Seems to be the current "best bang for the buck" if you don't have any of this stuff just sitting around.

The idea is because Willow devices are so cost-effective you can buy this hardware to host WIS, HA, other homelab stuff, etc and still come out ahead (even with power) compared to Raspberry Pi:

Qty 5 Willows - $270 with power supplies (support wake word, far-field audio, LCD display, speaker, mics, etc)

WIS hardware - $320 (or less - it's up to you!)

Total cost: $590

Six Raspberry Pis (no wake word, poor audio quality, unusably slow, cumbersome, very DIY):

$720 (retail kit with board, SD, LCD display, mic array, speaker, enclosure). MSRP - can't actually be purchased for that.

I expect this hardware to do well (returning well below 500ms for Willow speech locally with excellent quality) and I'll be documenting the power consumption optimization work I'm doing over the weekend.

[0] - https://www.ebay.com/itm/234908676168

[1] - https://www.ebay.com/itm/115536328587

morehamish · on May 23, 2023

Brilliant. Willow looks like a game-changer to me, thanks a million!

kkielhofner · on May 23, 2023

Thanks! We believe we have a practical, viable approach to provide an Alexa equivalent (or better) experience with self-hosted privacy and control. Not to mention all of the other stuff you can do with WIS :).

unverbraucht · on May 24, 2023

It looks amazing, I'll definitely try it out! Quick question: would a GTX 960 also work for inference? I happen to have one lying around and could whip up a system with it. Thanks for the great work, I think especially around the Smart home OSS has a lot to add.

kkielhofner · on May 24, 2023

Thank you!

Unfortunately the oldest GPUs we support are Pascal and your GTX 960 is Maxwell. We have this cutoff for GPU hardware support for two reasons:

1) Nvidia doesn't support Maxwell with recent versions of CUDA (fair enough, Maxwell is 9 years old).

2) Anything other than a GTX Titan X doesn't have the required VRAM.

That said, we do support and have recommended hardware configurations for cards such as the Tesla P4, GTX 1070, etc which can be had for roughly $100 on the used market.

WIS does support CPU only configurations but GPUs offer such significant fundamental architecture improvements a $100 six year old GPU will best the fastest CPUs in the world for this application at significantly less cost and power usage. A CPU only configuration is fundamentally incapable of providing our target end-user experience - self-hosted, private Alexa without compromise.

Copied from a comment below: