More

colorant · 2025-05-13T00:22:14 1747095734

Hardware Requirements:

- 380GB CPU memory for DeepSeek V3/R1 671B INT4 model

- 128GB CPU memory for Qwen3MoE 235B INT4 model

- 1-2 ARC A770 or B580

- 500GB Disk space

colorant · 2025-03-08T09:21:54 1741425714

The ipex-llm implementation extends llama.cpp and includes additonal CPU-GPU hybrid optimizations for sparse MoE

colorant · 2025-03-06T14:06:18 1741269978

Currently >8 token/s; there is a demo in this post: https://www.linkedin.com/posts/jasondai_run-671b-deepseek-r1...

colorant · 2025-03-06T02:58:20 1741229900

This is based on llama.cpp

colorant · 2025-03-06T02:58:01 1741229881

>8TPS at this moment on a 2-socket 5th Xeon (EMR)

colorant · 2025-03-06T01:50:28 1741225828

See this section https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

colorant · 2025-03-06T01:39:18 1741225158

Yes, you are right. Unfortunately HN somehow truncated my original URL link.

zamadatix · 2025-03-06T01:40:23 1741225223

Sounds like submission "helper" tools are working about as well as normal :).

Did you have the chance to try this out yourself or did you just run across it recently?

colorant · 2025-03-06T01:37:33 1741225053

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

Requirements (>8 token/s):

380GB CPU Memory

1-8 ARC A770

500GB Disk

colorant · 2025-03-06T01:41:24 1741225284

Also see the demo from Jason Dai's post: https://www.linkedin.com/posts/jasondai_with-the-latest-ipex...

aurareturn · 2025-03-06T03:49:02 1741232942

CPU inference is both bandwidth and compute constrained.

If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.

colorant · 2025-03-06T06:45:13 1741243513

Prompt length mainly impacts prefill latency (FTFF), not the decoding speed (TPOT)

moffkalast · 2025-03-06T10:46:18 1741257978

Decoding speed won't matter one bit if you have to sit there for 5 minutes waiting for the model to ingest a prompt that's two sentences long.

colorant · 2025-03-10T12:05:31 1741608331

With ~1000 input, the TTFT is ~10 seconds

GTP · 2025-03-06T09:31:49 1741253509

> 1-8 ARC A770

To get more than 8 t/s, is one Intel Arc A770 enough?

colorant · 2025-03-06T09:46:37 1741254397

Yes, but the context length will be limited due to VRAM constraint

faizshah · 2025-03-06T03:06:04 1741230364

Anyone got a rough estimate of the cost of this setup?

I’m guessing it’s under 10k.

I also didn’t see tokens per second numbers.

ynniv · 2025-03-06T03:12:36 1741230756

It better be! AMD @ $2k: https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...

aurareturn · 2025-03-06T03:49:54 1741232994

This article keeps getting posted but it runs a thinking model at 3-4 tokens/s. You might as well take a vacation if you ask it a question.

It’s a gimmick and not a real solution.

hnuser123456 · 2025-03-06T04:17:58 1741234678

If you value local compute and don't need massive speed, that's still twice as fast as most people can type.

aurareturn · 2025-03-06T05:33:12 1741239192

Human typing speed is magnitudes slower than our eyes scanning for the correct answer.

ChatGPT o3 mini high thinks at about 140 tokens/s by my estimation and I sometimes wish it can return answers quicker.

Getting a simple prompt answer would take 2-3 minutes using the AMD system and forget about longer context.

evilduck · 2025-03-06T14:40:01 1741272001

Reasoning models spend a whole bunch of time reasoning before returning an answer. I was toying with QWQ 32B last night and ran into one question I gave it where it spent 18 minutes at 13tok/s in the <think> phase before returning a final answer. I value local compute but reasoning models aren’t terribly feasible at this speed since you don’t really need to see the first 90% of their thinking output.

miklosz · 2025-03-06T05:42:29 1741239749

Exactly! I run it on my old T7910 Dell workstation (2x 2697A V4, 640GB RAM) that I build for way less than a $1k. But so what, it's about ~2 tokens / s. Just like you said, it's cool that it's run at all, but that's it.

walrus01 · 2025-03-06T04:25:37 1741235137

It's meant to be a test/development setup for people to prepare the software environment and tooling for running the same on more expensive hardware. Not to be fast.

aurareturn · 2025-03-06T05:34:56 1741239296

I remember people trying to run the game Crysis using CPU rendering. They got it to run and move around. People did it for fun and the "cool" factor. But no one actually played the game that way.

It's the same thing here. CPUs can run it but only as a gimmick.

refulgentis · 2025-03-06T05:42:24 1741239744

> It's the same thing here. CPUs can run it but only as a gimmick.

No, that's not true.

I work on local inference code via llama.cpp, on both GPU and CPU on every platform, and the bottleneck is much more ram / bandwidth than compute.

Crappy Pixel Fold 2022 mid-range Android CPU gets you roughly same speed as 2024 Apple iPhone GPU, with Metal acceleration that dozens of very smart people hack on.

Additionally, and perhaps more importantly, Arc is a GPU, not a CPU.

The headline of the thing you're commenting on, the very first thing you see when you open it, is "Run llama.cpp Portable Zip on Intel GPU"

Additionally, the HN headline includes "1 or 2 Arc 7700"

aurareturn · 2025-03-06T05:46:57 1741240017

It's both compute and bandwidth constrained - just like trying to run Crysis on CPU rendering.

A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.

However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.

Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.

refulgentis · 2025-03-06T06:04:46 1741241086

Okay, let's stipulate LLMs are compute and bandwidth sensitive (of course!)...

#1, should highlight it up front this time: We are talking about _G_PUs :)

#2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.

TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards

aurareturn · 2025-03-06T06:14:03 1741241643

1) This system mostly uses normal DDR RAM, not GPU VRAM.

2) M3 Ultra can load Deepseek R1 671B Q4.

Using a very large LLM across the CPU and GPU is not new. It's been done since the beginning of local LLMs.

xoranth · 2025-03-06T07:56:52 1741247812

> Crappy Pixel Fold 2022 mid-range Android CPU

Can you share what LLMs do you run on such small devices/what user case they address?

(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)

refulgentis · 2025-03-06T20:42:47 1741293767

Excellent and accurate q. You sound like the first person I've talked to who might appreciate full exposition here, apologies if this is too much info. TL;DR is you're def not missing anything, and we're just beginning to turn a corner and see some rays of light of hope, where it's a genuine substitute for remote models in consumer applications.

#1) I put a lot of effort into this and, quite frankly, it paid off absolutely 0 until recently.

#2) The "this" in "I put a lot of effort into this", means, I left Google 1.5 years ago and have been quietly building an app that is LLM-agnostic in service of coalescing a lot of nextgen thinking re: computing I saw that's A) now possible due to LLMs B) was shitcanned in 2020, because Android won politically, because all that next-gen thinking seemed impossible given it required a step change in AI capabilities.

This app is Telosnex (telosnex.com).

I have a couple stringent requirements I enforce on myself, it has to run on every platform, and it has to support local LLMs just as well as paid ones.

I see that as essential for avoiding continued algorithmic capture of the means of info distribution, and believe on a long enough timeline, all the rushed hacking people have done to llama.cpp to get model after model supported will give away to UX improvements.

You are completely, utterly, correct to note that the local models on device are, in my words, useless toys, at best. In practice, they kill your battery and barely work.

However, things did pay off recently. How?

#1) llama.cpp landed a significant opus of a PR by @ochafik that normalized tool handling across models, as well as implemented what the models need individually for formatting

#2) Phi-4 mini came out. Long story, but tl;dr: till now there's been various gaping flaws with each Phi release. This one looked absent of any issues. So I hack support for its tool vagaries on top of what @ochafik landed, and all of a sudden I'm seeing the first local model sub-Mixtral 8x7B that's reliably handling RAG flows (i.e. generate search query, then, accept 2K tokens of parsed web pages and answer a q following directions I give you) and tool calls (i.e. generate search query, or file operations like here: https://x.com/jpohhhh/status/1897717300330926109)

utopcell · 2025-03-06T03:23:55 1741231435

What a teaser article! All this info for setting up the system, but no performance numbers.

yvdriess · 2025-03-06T10:36:08 1741257368

That's because the OP is linking to the quickstart guide. There are benchmark numbers on the github's root page, but it does not appear to include the new deepseek yet:

https://github.com/intel/ipex-llm/tree/main?tab=readme-ov-fi...

utopcell · 2025-03-06T15:40:56 1741275656

Am I missing something ? I see a lot of the small-scale models results but no results for DeepSeek-R1-671B-Q4_K_M on their github repos.

colorant · on Feb 14, 2025

llamafile cannot use Intel GPU (including integrated GPU on your PC)

colorant · on Feb 3, 2025

There is ipex-llm support for Ollama on Intel GPU (https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...)