Reasoning models spend a whole bunch of time reasoning before returning an answer. I was toying with QWQ 32B last night and ran into one question I gave it where it spent 18 minutes at 13tok/s in the <think> phase before returning a final answer. I value local compute but reasoning models aren’t terribly feasible at this speed since you don’t really need to see the first 90% of their thinking output.
Exactly! I run it on my old T7910 Dell workstation (2x 2697A V4, 640GB RAM) that I build for way less than a $1k. But so what, it's about ~2 tokens / s. Just like you said, it's cool that it's run at all, but that's it.
It's meant to be a test/development setup for people to prepare the software environment and tooling for running the same on more expensive hardware. Not to be fast.
I remember people trying to run the game Crysis using CPU rendering. They got it to run and move around. People did it for fun and the "cool" factor. But no one actually played the game that way.
It's the same thing here. CPUs can run it but only as a gimmick.
> It's the same thing here. CPUs can run it but only as a gimmick.
No, that's not true.
I work on local inference code via llama.cpp, on both GPU and CPU on every platform, and the bottleneck is much more ram / bandwidth than compute.
Crappy Pixel Fold 2022 mid-range Android CPU gets you roughly same speed as 2024 Apple iPhone GPU, with Metal acceleration that dozens of very smart people hack on.
Additionally, and perhaps more importantly, Arc is a GPU, not a CPU.
The headline of the thing you're commenting on, the very first thing you see when you open it, is "Run llama.cpp Portable Zip on Intel GPU"
Additionally, the HN headline includes "1 or 2 Arc 7700"
It's both compute and bandwidth constrained - just like trying to run Crysis on CPU rendering.
A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.
However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.
Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.
Okay, let's stipulate LLMs are compute and bandwidth sensitive (of course!)...
#1, should highlight it up front this time: We are talking about _G_PUs :)
#2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.
TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards
Can you share what LLMs do you run on such small devices/what user case they address?
(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)
Excellent and accurate q. You sound like the first person I've talked to who might appreciate full exposition here, apologies if this is too much info. TL;DR is you're def not missing anything, and we're just beginning to turn a corner and see some rays of light of hope, where it's a genuine substitute for remote models in consumer applications.
#1) I put a lot of effort into this and, quite frankly, it paid off absolutely 0 until recently.
#2) The "this" in "I put a lot of effort into this", means, I left Google 1.5 years ago and have been quietly building an app that is LLM-agnostic in service of coalescing a lot of nextgen thinking re: computing I saw that's A) now possible due to LLMs B) was shitcanned in 2020, because Android won politically, because all that next-gen thinking seemed impossible given it required a step change in AI capabilities.
This app is Telosnex (telosnex.com).
I have a couple stringent requirements I enforce on myself, it has to run on every platform, and it has to support local LLMs just as well as paid ones.
I see that as essential for avoiding continued algorithmic capture of the means of info distribution, and believe on a long enough timeline, all the rushed hacking people have done to llama.cpp to get model after model supported will give away to UX improvements.
You are completely, utterly, correct to note that the local models on device are, in my words, useless toys, at best. In practice, they kill your battery and barely work.
However, things did pay off recently. How?
#1) llama.cpp landed a significant opus of a PR by @ochafik that normalized tool handling across models, as well as implemented what the models need individually for formatting
#2) Phi-4 mini came out. Long story, but tl;dr: till now there's been various gaping flaws with each Phi release. This one looked absent of any issues. So I hack support for its tool vagaries on top of what @ochafik landed, and all of a sudden I'm seeing the first local model sub-Mixtral 8x7B that's reliably handling RAG flows (i.e. generate search query, then, accept 2K tokens of parsed web pages and answer a q following directions I give you) and tool calls (i.e. generate search query, or file operations like here: https://x.com/jpohhhh/status/1897717300330926109)
That's because the OP is linking to the quickstart guide. There are benchmark numbers on the github's root page, but it does not appear to include the new deepseek yet:
- 380GB CPU memory for DeepSeek V3/R1 671B INT4 model
- 128GB CPU memory for Qwen3MoE 235B INT4 model
- 1-2 ARC A770 or B580
- 500GB Disk space