Hacker Newsnew | past | comments | ask | show | jobs | submit | colorant's commentslogin

Hardware Requirements:

- 380GB CPU memory for DeepSeek V3/R1 671B INT4 model

- 128GB CPU memory for Qwen3MoE 235B INT4 model

- 1-2 ARC A770 or B580

- 500GB Disk space


The ipex-llm implementation extends llama.cpp and includes additonal CPU-GPU hybrid optimizations for sparse MoE


Currently >8 token/s; there is a demo in this post: https://www.linkedin.com/posts/jasondai_run-671b-deepseek-r1...


This is based on llama.cpp


>8TPS at this moment on a 2-socket 5th Xeon (EMR)



Yes, you are right. Unfortunately HN somehow truncated my original URL link.


Sounds like submission "helper" tools are working about as well as normal :).

Did you have the chance to try this out yourself or did you just run across it recently?


https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

Requirements (>8 token/s):

380GB CPU Memory

1-8 ARC A770

500GB Disk



CPU inference is both bandwidth and compute constrained.

If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.


Prompt length mainly impacts prefill latency (FTFF), not the decoding speed (TPOT)


Decoding speed won't matter one bit if you have to sit there for 5 minutes waiting for the model to ingest a prompt that's two sentences long.


With ~1000 input, the TTFT is ~10 seconds


> 1-8 ARC A770

To get more than 8 t/s, is one Intel Arc A770 enough?


Yes, but the context length will be limited due to VRAM constraint


Anyone got a rough estimate of the cost of this setup?

I’m guessing it’s under 10k.

I also didn’t see tokens per second numbers.



This article keeps getting posted but it runs a thinking model at 3-4 tokens/s. You might as well take a vacation if you ask it a question.

It’s a gimmick and not a real solution.


If you value local compute and don't need massive speed, that's still twice as fast as most people can type.


Human typing speed is magnitudes slower than our eyes scanning for the correct answer.

ChatGPT o3 mini high thinks at about 140 tokens/s by my estimation and I sometimes wish it can return answers quicker.

Getting a simple prompt answer would take 2-3 minutes using the AMD system and forget about longer context.


Reasoning models spend a whole bunch of time reasoning before returning an answer. I was toying with QWQ 32B last night and ran into one question I gave it where it spent 18 minutes at 13tok/s in the <think> phase before returning a final answer. I value local compute but reasoning models aren’t terribly feasible at this speed since you don’t really need to see the first 90% of their thinking output.


Exactly! I run it on my old T7910 Dell workstation (2x 2697A V4, 640GB RAM) that I build for way less than a $1k. But so what, it's about ~2 tokens / s. Just like you said, it's cool that it's run at all, but that's it.


It's meant to be a test/development setup for people to prepare the software environment and tooling for running the same on more expensive hardware. Not to be fast.


I remember people trying to run the game Crysis using CPU rendering. They got it to run and move around. People did it for fun and the "cool" factor. But no one actually played the game that way.

It's the same thing here. CPUs can run it but only as a gimmick.


> It's the same thing here. CPUs can run it but only as a gimmick.

No, that's not true.

I work on local inference code via llama.cpp, on both GPU and CPU on every platform, and the bottleneck is much more ram / bandwidth than compute.

Crappy Pixel Fold 2022 mid-range Android CPU gets you roughly same speed as 2024 Apple iPhone GPU, with Metal acceleration that dozens of very smart people hack on.

Additionally, and perhaps more importantly, Arc is a GPU, not a CPU.

The headline of the thing you're commenting on, the very first thing you see when you open it, is "Run llama.cpp Portable Zip on Intel GPU"

Additionally, the HN headline includes "1 or 2 Arc 7700"


It's both compute and bandwidth constrained - just like trying to run Crysis on CPU rendering.

A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.

However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.

Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.


Okay, let's stipulate LLMs are compute and bandwidth sensitive (of course!)...

#1, should highlight it up front this time: We are talking about _G_PUs :)

#2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.

TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards


1) This system mostly uses normal DDR RAM, not GPU VRAM.

2) M3 Ultra can load Deepseek R1 671B Q4.

Using a very large LLM across the CPU and GPU is not new. It's been done since the beginning of local LLMs.


> Crappy Pixel Fold 2022 mid-range Android CPU

Can you share what LLMs do you run on such small devices/what user case they address?

(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)


Excellent and accurate q. You sound like the first person I've talked to who might appreciate full exposition here, apologies if this is too much info. TL;DR is you're def not missing anything, and we're just beginning to turn a corner and see some rays of light of hope, where it's a genuine substitute for remote models in consumer applications.

#1) I put a lot of effort into this and, quite frankly, it paid off absolutely 0 until recently.

#2) The "this" in "I put a lot of effort into this", means, I left Google 1.5 years ago and have been quietly building an app that is LLM-agnostic in service of coalescing a lot of nextgen thinking re: computing I saw that's A) now possible due to LLMs B) was shitcanned in 2020, because Android won politically, because all that next-gen thinking seemed impossible given it required a step change in AI capabilities.

This app is Telosnex (telosnex.com).

I have a couple stringent requirements I enforce on myself, it has to run on every platform, and it has to support local LLMs just as well as paid ones.

I see that as essential for avoiding continued algorithmic capture of the means of info distribution, and believe on a long enough timeline, all the rushed hacking people have done to llama.cpp to get model after model supported will give away to UX improvements.

You are completely, utterly, correct to note that the local models on device are, in my words, useless toys, at best. In practice, they kill your battery and barely work.

However, things did pay off recently. How?

#1) llama.cpp landed a significant opus of a PR by @ochafik that normalized tool handling across models, as well as implemented what the models need individually for formatting

#2) Phi-4 mini came out. Long story, but tl;dr: till now there's been various gaping flaws with each Phi release. This one looked absent of any issues. So I hack support for its tool vagaries on top of what @ochafik landed, and all of a sudden I'm seeing the first local model sub-Mixtral 8x7B that's reliably handling RAG flows (i.e. generate search query, then, accept 2K tokens of parsed web pages and answer a q following directions I give you) and tool calls (i.e. generate search query, or file operations like here: https://x.com/jpohhhh/status/1897717300330926109)


What a teaser article! All this info for setting up the system, but no performance numbers.


That's because the OP is linking to the quickstart guide. There are benchmark numbers on the github's root page, but it does not appear to include the new deepseek yet:

https://github.com/intel/ipex-llm/tree/main?tab=readme-ov-fi...


Am I missing something ? I see a lot of the small-scale models results but no results for DeepSeek-R1-671B-Q4_K_M on their github repos.


llamafile cannot use Intel GPU (including integrated GPU on your PC)


There is ipex-llm support for Ollama on Intel GPU (https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: