i think that a wasteful but good solution would be to tag each token, not use opening/closing tags.
whatever n-dimensional space the tokens occupy, manually add more dimensions, to reflect user/agent, trusted/untrusted input.
it should be much harder for the LLM to fuck up this way if every single word it reads screams "suspicion" or "trust". with tag tokens at the start it can just forget
I had to fall back to that to deliver anything recently - but the last two months were really comfy with me just saying "do x" and just going on a walk and coming back to a working project.
Claude is still useful now, but it feels more like a replacement for bashing on a keyboard, rather than a thinking machine now.
That's like saying there's no point in attending a lecture on "how to get the best out of your time at University" because University courses are taught in spoken language so you could just ask the professors.
The idea that AI can write code like a seasoned software developer but not being able to use its own tooling that can be learned through 11 chapters tutorial doesn't make any sense.
On that topic, anyone here got a decent local coding AI setup for a 12GB VRAM system? I have a Radeon 6700 XT and would like to run autocomplete on it. I can fit some models in the memory and they run quick but are just a tad too dumb. I have 64GB of system ram so I can run larger models and they are at least coherent, but really slow compared to running from VRAM.
Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.
I have a 9070 XT, which has 16GB of VRAM.
My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.
I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.
From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.
The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.
Which model have you tried locally? Also, out of curiosity, what is your host configuration?
For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m.
The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).
Either that or just load up Qwen3.5-35B-A3B-Q4_K_S
I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).
I am definitely looking forward to TurboQuant. Makes me feel like my current setup is an investment that could pay over time. Imagine being able to run models like MiniMax M2.5 locally at Q4 levels. That would be swell.
Considering how my parents still refer to that area of the world as Yugoslavia, I'm pretty sure the postal system will know how to route it. Will probably be escalated to a human for labeling though.
I found the project on YouTube[1] and wanted to share it - but decided to find something that's text for HN, and in the rush to post I failed to check if the post is even complete. I should've posted the video instead.
I think that eventually, Win32/WoW64 will be the stable common API for Linux programs - or at least games. I won't be surprised if it outlasts Windows.
It is a solution. Once you do it, your problem is solved, that makes it the solution.
If you aren't willing to go with that, you can stay with Windows and just accept the constant abuse.
As for gaming, I've been on Linux for two years now and I haven't had a single game not work.
And as for a better solution, Teach kids. Once I'm an ornery PTA parent I'm going to push for programming and *nix of some sort to be taught to the school, even if I have to volunteer to do it myself.
whatever n-dimensional space the tokens occupy, manually add more dimensions, to reflect user/agent, trusted/untrusted input.
it should be much harder for the LLM to fuck up this way if every single word it reads screams "suspicion" or "trust". with tag tokens at the start it can just forget
reply