Hacker Newsnew | past | comments | ask | show | jobs | submit | kamranjon's commentslogin

Can you go into more detail about what you did with your investments? How did you invest in the international market?

Not op, I however, I also did that. I used vanguard's international index fund.

Pretty much this

In my experience qwen 3 coder next is better. I ran quite a few tests yesterday and it was much better at utilizing tool calls properly and understanding complex code. For its size though 3.5 35B was very impressive. coder next is an 80b model so i think its just a size thing - also for whatever reason coder next is faster on my machine. Only model that is competitive in speed is GLM 4.7 flash

What do you use as the orchestrator? By this I mean opencode, or the like. Is that the right term?

I use the term "harness" for those - or just "coding agent". I think orchestrator is more appropriate for systems that try to coordinate multiple agents running at the same time.

This terminology is still very much undefined though, so my version may not be the winning definition.


I'm basically using the agentic features of the Zed editor: https://zed.dev/agentic

It's really easy to setup with any OpenAI compatible API and I self host Qwen Coder 3 Next on my personal MBP using LM Studio and just dial in from my work laptop with Zed and tailscale so i can connect from wherever i might be. It's able to do all sorts of things like run linting checks and tests and look for issues and refactor code and create files and things like this. I'm definitely still learning, but it's a pretty exciting jump from just talking to a chat bot and copying and pasting things manually.


Another vote in favour of "harness".

I'm aligning on Agent for the combination of harness + model + context history (so after you fork an agent you now have two distinct agents)

And orchestrator means the system to run multiple agents together.


This has also been my understanding of all of these terms so far

I use Qwen 3 Coder Next daily on my mac as my main coding agent. It is incredibly capable and its strange how you are painting this picture as if its a fringe use case, there are whole communities that have popped up around running local models.

Can I doubt your claim? I have had such terrible luck with AI coding on <400B models. Not to mention, I imagine your codebase is tiny. Or you are working for some company that isnt keeping track of your productivity.

I am trying super hard to use cheap models, and outside SOTA models, they have been more trouble than they are worth.


Yesterday, I got Qwen-Coder-Next to build a python script that reads a Postman collection, pulls the data from it to build a request to one of the endpoints, download a specific group of files whose URLs were buried in the JSON payload in that endpoint, then transform then all to a specific size of PNG, all without breaking a sweat. I didn't even have to tell it to use Pillow, but it did everything to a T.

Use case means everything. I doubt this model would fare well on a large codebase, but this thing is incredible.


Absolutely. So my codebase is huge, it's a monolith. But my work is in very specific parts of the codebase, I don't pull the entire code base into context (and I don't think that is common practice even with claude) - I start at a specific point with a specific task and work with the agent to achieve something clearly defined, for example writing tests, extracting things into separate files, refactoring or even scaffolding a new feature. You have to periodically start new threads, because you'll start hitting the limits of the context, but I max it out at over 200k because I have the memory overhead on my 128gb mbp to do that, so I can get quite a lot done.

I really recommend trying the Qwen models - 3 coder next is really incredible. GLM 4.7 flash is also incredibly performant on modest hardware. Important things to consider is setting the temperature and top_p and top_k values etc based on what is recommended by the provider of the model - a thing as simple as that could result in a huge difference in performance.

The other big leap for me was switching to Zed editor and getting its agent stuff just seamlessly integrated. If you run LM Studio on your local machine it's super easy and even setting it up on a remote machine and calling out to LM Studio is dead simple.


I'm not sure if you're just unaware or purposefully dense. It's absolutely possible to get those numbers for certain models in a m4 max and it's averaged over many tokens, I was just getting 127tok/s for 700 token response on a 24b MoE model yesterday. I tend to use Qwen 3 Coder Next the most which is closer to 65 or 70 tok/s, but absolutely usable for dev work.

I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense.


what inference runtime are you using? You mentioned mlx but I didn't think anyone was using that for local llms

LM Studio (which prioritizes MLX models if you're on Mac and they are available) - I have it setup with tailscale running as a server on my personal laptop. So when I'm working I can connect to it from my work laptop, from wherever I might be, and it's integrated through the Zed editor using its built in agent - it's pretty seamless. Then whenever I want to use my personal laptop I just unload the model and do other things. It's a really nice setup, definitely happy I got the 128gb mbp because I do a lot of video editing and 3d rendering work as a hobby/for fun and it's sorta dual purpose in that way, I can take advantage of the compute power when I'm not actually on the machine by setting it up as a LLM server.

LM Studio has had an MLX engine and models since 2024.

I have always wondered if the neural engine could be used for training - pretty excited for part 3 of this to see if the juice is actually worth the squeeze

In principle most if not all inference hardware should be usable for training.

Efficiency is the question.


This is a great idea, but the models seem pretty outdated - it's recommending things like qwen 2.5 and starcoder 2 as perfect matches for my m4 macbook pro with 128gb of memory.

I thought this part of the write-up was interesting:

"This is, I think, in contradiction with the idea that LLMs are memorizing the whole training set and uncompress what they have seen. LLMs can memorize certain over-represented documents and code, but while they can extract such verbatim parts of the code if prompted to do so, they don’t have a copy of everything they saw during the training set, nor they spontaneously emit copies of already seen code, in their normal operation."

Can't things basically get baked into the weights when trained on enough iterations, and isn't this the basis for a lot of plagiarism issues we saw with regards to code and literature? It seems like this is maybe downplaying the unattributed use of open source code when training these models.


I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.


Not immediately, but it's not a much larger amount of work for llama than a new foundational model which typically has a tweaked compute graph.


It would be pretty incredible if they could host an embedding model on this same hardware, I would pay for that immediately. It would change the type of things you could build by enabling on the fly embeddings with negligible latency.


I haven't quite figured out if the open weights they released on huggingface amount to being able to run the (realtime) model locally - i hope so though! For the larger model with diarization I don't think they open sourced anything.


The HF page suggests yes, with vllm.

> We've worked hand-in-hand with the vLLM team to have production-grade support for Voxtral Mini 4B Realtime 2602 with vLLM. Special thanks goes out to Joshua Deng, Yu Luo, Chen Zhang, Nick Hill, Nicolò Lucchesi, Roger Wang, and Cyrus Leung for the amazing work and help on building a production-ready audio streaming and realtime system in vLLM.

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

https://docs.vllm.ai/en/latest/serving/openai_compatible_ser...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: