Hacker Newsnew | past | comments | ask | show | jobs | submit | throwdbaaway's commentslogin

90% of what you pay in agentic coding is for cached reads, which are free with local inference serving one user. This is well known in r/LocalLLaMA for ages, and an article about this also hit HN front page few weeks ago.

What about the VRAM requirement for KV cache? That may matter more than memory bandwidth. With these GPUs, there are more compute capacity than memory bandwidth than VRAM.

DeepSeek got MLA, and then DSA. Qwen got gated delta-net. These inventions allow efficient inference both at home and at scale. If Anthropic got nothing here, then their inference cost can be much higher.

DeepSeek also got https://github.com/deepseek-ai/3FS that makes cached reads a lot cheaper with way longer TTL. If Anthropic didn't need to invent and uses some expensive solution like Redis, as indicated by the crappy TTL, then that also contributes to higher inference cost.


Yours is the only benchmark that puts 35B A3B above 27B. Time for human judgement to verify? For example, if you look at the thinking traces, there might be logical inconsistencies in the prompts, which then tripped up the 27B more when reasoning. This will also be reflected in the score when thinking is disabled, but we can sort of debug with the thinking traces.

I inspected manually and indeed the 27B is doing worse, but I believe it could be due to the exact GGUF in the ollama repository and/or with the need of adjusting the parameters. I'll try more stuff.

Isn’t llama.cpp’s implementation of Qwen 3.5 better, or am I misinformed?

There was a recent fix by ollama and I used it.

There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.

Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters?

Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.


AFAIK post-training and distillation techniques advanced a lot in the past couple of years. SOTA big models get new frontier and within 6 months it trickles down to open models with 10x less parameters.

And mind the source pre-training data was not made/written for training LLMs, it's just random stuff from Internet, books, etc. So there's a LOT of completely useless an contradictory information. Better training texts are way better and you can just generate & curate from those huge frontier LLMs. This was shown in the TinyStories paper where GPT-4 generated children's stories could make models 3 orders of magnitude smaller achieve quite a lot.

This is why the big US labs complain China is "stealing" their work by distilling their models. Chinese labs save many billions in training with just a bunch of accounts. (I'm just stating what they say, not giving my opinion).


There's diminishing returns bigly when you increase parameter count.

The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.

Anyways your perception of a model's "quality" is determined by careful post-training.


Interesting. I see papers where researchers will finetune models in the 7 to 12b range and even beat or be competitive with frontier models. I wish I knew how this was possible, or had more intuition on such things. If anyone has paper recommendations, I’d appreciate it.

They're using a revolutionary new method called "training on the test set".

So, curve fitting the training data? So, we should expect out of sample accuracy to be crap?

Yeah, that's usually what tends to happen with those tiny models that are amazing in benchmarks.

More parameters improves general knowledge a lot, but you have to quantize more in order to fit in a given amount of memory, which if taken to extremes leads to erratic behavior. For casual chat use even Q2 models can be compelling, agentic use requires more regularization thus less quantized parameters and lowering the total amount to compensate.

The short answer is that there are more things that matter than parameter count, and we are probably nowhere near the most efficient way to make these models. Also: the big AI labs have shown a few times that internally they have way more capable models

Considering the full fat Qwen3.5-plus is good, but barely Sonnet 4 good in my testing (but incredibly cheap!) I doubt the quantised versions are somehow as good if not better in practice.

I think it depends on work pattern.

Many do not give Sonnet or even Opus full reign where it really pushes ahead of over models.

If you're asking for tightly constrained single functions at a time it really doesn't make a huge difference.

I.e. the more vibe you do the better you need the model especially over long running and large contexts. Claude is heading and shoulders above everyone else in that setting.


>I.e. the more vibe you do the better you need the model especially over long running and large contexts

For sure, but the coolest thing about qwen3.5-plus is the 1mil context length on a $3 coding plan, super neat. But the model isn't really powerful enough to take real advantage of it I've found. Still super neat though!


When you say Sonnet 4, do you mean literally 4, or 4.6?

It's not as capable as Sonnet 4.6 in my usage over the past couple days, through a few different coding harnesses (including my own for-play one[0], that's been quite fun).

[0] https://github.com/girvo/girvent/


What is the benefit of writing your own harness? I am asking because I need to get better at using AI for programming. I have used Cursor, Gemini CLI, Antigravity quite a bit and have had a lot of difficulties getting them do what I want. They just tend to "know better."

I’m not an expert but I started with smaller tasks to get a feel for how to phrase things, what I need to include. It’s more manageable to manually fix things it screwed up than giving it full reign.

You may want to look at the AGENTS.md file too so you can include your stock style things if it’s repeatedly screwing up in the same way.


Purely as an exercise to see how they operate, and understand them better. Then additionally because I was curious how much better one could make something like qwen3.5-plus with its 1 mil context window despite its weaker base behaviour, if I was to give it something very focused on what I want from it

The Pi framework is probably right up your alley btw! Very extensible


I think it's the same instinct as making your own Game Engine. You start off either because you want to learn how they work or because you think your game is special and needs its own engine. Usually, it's a combination of both.

It doesn’t. I’m not sure it outperforms chatgpt 3

You are not being serrious, are you? even 1.5 years old Mistral and Meta models outperform ChatGPT 3.

3 not 3.5? I think I would even prefer the qwen3.5 0.8b over GPT 3.

With MoE models, if the complete weights for inactive experts almost fit in RAM you can set up mmap use and they will be streamed from disk when needed. There's obviously a slowdown but it is quite gradual, and even less relevant if you use fast storage.

any good packages you recommend for this?

Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run?

On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.


Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context.

35B A3B is faster but didn't do too well in my limited testing.


with regular llama.cpp on a 3070ti I get 60tok/s TG with the 9B model, it's quite impressive.

The 27B is rated slightly higher for SWE-bench.

27B needs less memory and does better on benchmarks, but 35B-A3B seems to run roughly twice as fast.

Don't sleep on the 9B version either, I get much faster speeds and can't tell any difference in quality. On my 3070ti I get ~60tok/s with it, and half that with the 35B-A3B.

Say more please if you can. How/why is ik_llama.cpp faster then mainline, for the 27B dense? I'd like to be able to run 27B dense faster on a 24GB vram gpu, and also on an M2 max.

ik_llama.cpp was about 2x faster for CPU inference of Qwen3.5 versus mainline until yesterday. Mainline landed a PR that greatly increased speed for Qwen3.5, so now ik_llama.cpp is only 10% faster on token generation.

I don't quite get the low temperature coupled with the high penalty. We get thinking loop due to low temperature, and we then counter it with high penalty. That seems backward.

For Qwen3.5 27B, I got good result with --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.2, without penalty. It allows the model to explore (temp, top-p, top-k) without going off the rail (min-p) during reasoning. No loop so far.


The guidelines are a little hard to interpret. At https://huggingface.co/Qwen/Qwen3.5-27B Qwen says to use temp 0.6, pres 0.0, rep 1.0 for "thinking mode for precise coding tasks" and temp 1.0, pres 1.5, rep 1.0 for "thinking mode for general tasks." Those parameters are just swinging wildly all over the place, and I don't know if printing potato 100 times is considered to be more like a "precise coding task" or a "general task."

When setting up the batch file for some previous tests, I decided to split the difference between 0.6 and 1.0 for temperature and use the larger recommended values for presence and repetition. For this prompt, it probably isn't a good idea to discourage repetition, I guess. But keeping the existing parameters worked well enough, so I didn't mess with them.


We are all reasonable people here, and while you are (mostly) correct, I think we can all agree that Anthropic documentation sucks. If I have to infer from the doc:

* Haiku 4.5 by default doesn't think, i.e. it has a default thinking budget of 0.

* By setting a non-zero thinking budget, Haiku 4.5 can think. My guess is that Claude Code may set this differently for different tasks, e.g. thinking for Explore, no thinking for Compact.

* This hybrid thinking is different from the adaptive thinking introduced in Opus 4.6, which when enabled, can automatically adjust the thinking level based on task difficulty.


For 27B, just get a used 3090 and hop on to r/LocalLLaMA. You can run a 4bpw quant at full context with Q8 KV cache.


I would say 27B matches with Sonnet 4.0, while 397B A17B matches with Opus 4.1. They are indeed nowhere near Sonnet 4.5, but getting 262144 context length at good speed with modest hardware is huge for local inference.

Will check your updated ranking on Monday.


Can you describe a bit more how this works? I suppose the speed remains about the same, while the experience is more pleasant?

(Big fan of SQLAlchemy)


Not the user you're responding to, but I feel like I do something similar

I describe what I want roughly on the level I could still code it by hand, to the level of telling Claude to create specific methods, functions and classes (And reminding it to use them, because models love pointless repetition)

Is it faster? Sure, being this specific has the added benefit of greatly reduced hallucinations (Still, depends on the model, Gemini is still more prone to want to do more things, even when uncalled for)

I also don't need to fine comb everything, Logic and interaction I'll check, but basic everyday stuff is usually already pretty well explained in the repo and the model usually picks up on it


From a quick testing on simple tasks, adaptive thinking with sonnet 4.6 uses about 50% more reasoning tokens than opus 4.6.

Let's see how long it will take for DeepSeek to crack this.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: