Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The model absolutely can be run at home. There even is a big community around running large models locally

IMO 1tln parameters and 32bln active seems like a different scale to what most are talking about when they say localLLMs IMO. Totally agree there will be people messing with this, but the real value in localLLMs is that you can actually use them and get value from them with standard consumer hardware. I don't think that's really possible with this model.



Local LLMs are just LLMs people run locally. It's not a definition of size, feature set, or what's most popular. What the "real" value is for local LLMs will depend on each person you ask. The person who runs small local LLMs will tell you the real value is in small models, the person who runs large local LLMs will tell you it's large ones, those who use cloud will say the value is in shared compute, and those who don't like AI will say there is no value in any.

LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large.


> LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large.

I agree. My point was that most aren't thinking of models this large when they're talking about local LLMs. That's what I said, right? This is supported by the download counts on hf: the most downloaded local models are significantly smaller than 1tln, normally 1 - 12bln.

I'm not sure I understand what point you're trying to make here?


Mostly a "We know local LLMs as being this, and all of the mentioned variants of this can provide real value regardless of which is most commonly referenced" point. I.e. large local LLMs aren't only something people mess with, they often provide a lot of value for a relative few people rather than a little value for a relative lot of people as small local LLMs do. Who thinks which modality and type brings the most value is largely a matter of opinion of the user getting the value, not just the option which runs on consumer hardware or etc alone.

You're of course accurate that smaller LLMs are more commonly deployed, it's just not the part I was really responding to.


32B active is nothing special, there's local setups that will easily support that. 1T total parameters ultimately requires keeping the bulk of them on SSD. This need not be an issue if there's enough locality in expert choice for any given workload; the "hot" experts will simply be cached in available spare RAM.


When I've measured this myself, I've never seen a medium-to-long task horizon that would have expert locality such that you wouldn't be hitting the SSD constantly to swap layers (not to say it doesn't exist, just that in the literature and in my own empirics, it doesn't seem to be observed in a way you could rely on it for cache performance).

Over any task that has enough prefill input diversity and a decode phase thats more than a few tokens, its at least intuitive that experts activate nearly uniformly in the aggregate, since they're activated per token. This is why when you do something more than bs=1, you see forward passes light up the whole network.


> hitting the SSD constantly to swap layers

Thing is, people in the local llm community are already doing that to run the largest MoE models, using mmap such that spare-RAM-as-cache is managed automatically by the OS. It's a drag on performance to be sure but still somewhat usable, if you're willing to wait for results. And it unlocks these larger models on what's effectively semi-pro if not true consumer hardware. On the enterprise side, high bandwidth NAND Flash is just around the corner and perfectly suited for storing these large read-only model parameters (no wear and tear issues with the NAND storage) while preserving RAM-like throughput.


I've tested this myself often (as an aside: I'm in said community, I run 2x RTX Pro 6000 locally, 4x 3090 before that), and I think what you said re: "willing to wait" is probably the difference maker for me.

I can run Minimax 2.1 in 5bpw at 200k context fully offloaded to GPU. The 30-40 tk/s feels like a lifetime for long horizon tasks, especially with subagent delegation etc, but it's still fast enough to be a daily driver.

But that's more or less my cutoff. Whenever I've tested other setups that dip into the single and sub-single digit throughput rates, it becomes maddening and entirely unusable (for me).


What is bpw?


Bits per weight, its an average precision across all the weights. When you quantize these models, they don't just used a fixed precision size across all model layers/weights. There's a mix and it varies per quant method. This is why you can get bit precision that arent "real" in a strict computing sense.

e.g. A 4-bit quant can have half the attention and feed forward tensors in Q6, and the rest in Q4. Due to how block-scaling works, those k-quant dtypes (specifically for llama.cpp/gguf) have larger bpw than they suggest in their name. Q4 is around ~4.5 bpw, and Q6 is ~6.5.


I never said it was special.

I was trying to correct the record that a lot of people will be using models of this size locally because of the local LLM community.

The most commonly downloaded local LLMs are normally <30b (e.g. https://huggingface.co/unsloth/models?sort=downloads). The things you're saying, especially when combined together, make it not usable by a lot of people in the local LLM community at the moment.


do you guys understand that different experts are loaded PER TOKEN?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: