Yeah but *that* strong?

2ndorderthought · 2026-05-12T12:11:10 1778587870

Yes that strong. Its only lacking in context length, but it's not that small there and it gets caught in circles more often then say a 1t parameter model does.

That's why a lot of people have been freaking out about local LLMs since april. There's finally a decent model that runs locally on a GPU or two that can do agentic programming at a reasonable enough tokens per second.

johndough · 2026-05-12T16:07:15 1778602035

> it gets caught in circles more often then say a 1t parameter model does.

I've found that the Q5+ quants are less loopy than Q4. Still not perfect, but noticeably better.

> reasonable enough tokens per second

The speed has been amazing. I've been running the recent llama.cpp MTP branch with an uncensored variant of Qwen3.6-35B-A3B on my RTX 3090 over 170 tokens per second and it was able to turn a buffer overflow into a reliable shell exploit in just a few seconds (with reasoning disabled). Still a bit loopy though. Hopefully, the Qwen team will pay more attention to those looping issues. It feels like their models are especially susceptible.

2ndorderthought · 2026-05-12T17:23:01 1778606581

Is that on a single 3090? I need to change my settings it sounds like

johndough · 2026-05-12T20:05:53 1778616353

Yes, single RTX 3090 with this model https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-h... following these https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF instructions (should add "-j 8" to last cmake command for parallel build) and llama-server with --reasoning off

Note that the MTP PR https://github.com/ggml-org/llama.cpp/pull/22673 is still under development, so things might be broken.