Hacker Newsnew | past | comments | ask | show | jobs | submit | joshuaisaact's commentslogin

This was a really interesting paper but there's a massive gap in what they didn't try, which is inference-time temperature changes based on the fork/lock distinction.

Maybe I'll try that myself, because it feels like it could be a great source of improvements. It would be really useful to see adaptive per-token sampling as an additional decode-only baseline.


Is this some kind of calibration then? I'd expect that the probabilities automatically adjust during training, such that in "lock" mode, for example, syntax-breaking tokens have a very low probability and would not be picked even wich higher temperature.

ASML is European and is arguably the most strategically important company in the entire semiconductor supply chain.


ASML Holding is dominating the chip technology with their machines. It's not the lack of invention or intellectual capability that holds Europe back from the digital industry, it's the lack of willing long term European invenstors. If you want to scale your digital tech startup in Europe the most viable way is to look to the US for investors.


ASML is not a chip maker, it is a chip-maker maker. Still important though.

Europe should request a discount for ASML machines in a EU factory.


The hardware is in europe. The IP and control is in USA


Have you heard of a little company called Arm Holdings?

It was a travesty that the UK government let it be sold, admittedly.


UK isn't European. They made that clear when they voted for Brexit.


We are. Do you think someone dragged the whole country to new location?

Norway, Iceland and Switzerland are also not in the EU. Are they also on a different continent?


UK is European. Membership of EU is unnecessary for that criterion to be met.


The UK is not in the EU, but it is surely european.


The UK is no longer in the EU; The UK is still in Europe and is very much European.


I've been exploring Petri nets as a formalism for AI agent safety, specifically, proving properties like termination and human-gate enforcement exhaustively across every reachable state, rather than testing them on sample inputs. This post benchmarks the approach against n8n and ReAct on the same agent. Tomorrow I'm open-sourcing the engine as a declarative rules DSL.


I don't think you need two separate models for this - I get similarly good results re-prompting with Claude. Well, not re-prompting, I just have a skill that wipes the context then gets Claude to review the current PR and make improvements before I review it.


Couldn't disagree with this article more. I think the future of software engineering is more T-shaped.

Look at the 'Product Engineer' roles we are seeing spreading in forward-thinking startups and scaleups.

That's the future of SWE I think. SWEs take on more PM and design responsibilities as part of the existing role.


I agree. In many cases it's probably easier for a developer to become more of a product person, than for a product person to become a dev. Even with LLM's you still need to have some technical skills & be able to read code to handle technical tasks effectively.

Of course things might look different when the product is something that requires really deep domain knowledge.


I don't think the two are mutually exclusive! e.g. a T-shaped product engineer on one side and a T-shaped SRE on the other. Both will kind of compact what used to be multiple roles/responsibilities together. The good news (and my prediction) IMO is the engineering won't be going away as much as the other roles.


Or architects, someone has to draw the nice diagrams and spec files for the robots.

However, like in automated factories, only a small percentage is required to stay around.


This has been pretty comprehensively disproven:

https://arxiv.org/abs/2311.10054

Key findings:

-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions

-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added

-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection

-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random

Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.


Persona’s is not the same thing as a role. The point of the role is to limit what the work of the agent, and to focus it on one or two behaviors.

What the paper is really addressing is does key words like you are a helpful assistant give better results.

The paper is not addressing a role such as you are system designer, or you are security engineer which will produce completely different results and focus the results of the LLM.


Aside from what you said about applicability, the paper actually contradicts their claim!

In the domain alignment section:

> The coefficient for “in-domain” is 0.004(p < 0.01), suggesting that in-domain roles generally lead to better performance than out-domain roles.

Although the effect size is small, why would you not take advantage of it.


I would be interested in an eval that checked both conditions: you are an amazing x Vs. you are a terrible x. also there have been a bunch of papers recently looking at whether threatening the llm improves output, would like to see a variation that tries carrot and stick as well.


How well does such llm research hold up as new models are released?


Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.


In a discussion about LLMs you link to a paper from 2023, when not even GPT-4 was available?

And then you say:

> comprehensively disproven

? I don't think you understand the scientific method


Fair point on the date - the paper was updated October 2024 with Llama-3 and Qwen2.5 (up to 72B), same findings. The v1 to v3 revision is interesting. They initially found personas helped, then reversed their conclusion after expanding to more models.

"Comprehensively disproven" was too strong - should have said "evidence suggests the effect is largely random." There's also Gupta et al. 2024 (arxiv.org/abs/2408.08631) with similar findings if you want more data points.


A paper’s date does not invalidate its method. Findings stay useful only when you can re-run the same protocol on newer models and report deltas. Treat conclusions as conditional on the frozen tasks, criteria, and measurement, then update with replication, not rhetoric.


...or even how fast technology is evolving in this field.


One study has “comprehensively disproven” something for you? You must be getting misled left right and centre if that’s how you absorb study results.


This feels like massively overengineering something very simple.

Agents are stateless functions with a limited heap (context window) that degrades in quality as it fills. Once you see it that way, the whole swarm paradigm is just function scoping and memory management cosplaying as an org chart:

Agent = function

Role = scope constraints

Context window = local memory

Shared state file = global state

Orchestration = control flow

The solution isn't assigning human-like roles to stateless functions. It's shared state (a markdown file) and clear constraints.


I basically always handled claude code in this way, by asking it to spawn subagents as much as possible to handle self contained tasks (heard there are hacks to make subagents work with codex). But claude code new tasks seem to go further, they let subagents coordinate with a common file to avoid stepping on each other toes (by creating a dependency graph)


I don’t follow. You said it’s over engineering and then proposed what appears to be functionally the exact same thing?

Isn’t a “role” just a compact way to configure well-known systems of constraints by leveraging LLM training?

Is your proposal that everybody independently reinvent the constraints wheel, so to speak?


Fair push back. The distinction I'm drawing is between:

A. Using a role prompt to configure a single function's scope ("you are a code reviewer, focus on X") - totally reasonable, leverages training

B. Building an elaborate multi-agent orchestration layer with hand-offs, coordination protocols, and framework abstractions on top of that

I'm not arguing against A. I'm arguing that B often adds complexity without proportional benefit, especially as models get better at long-context reasoning.

Fairly recent research (arXiv May 2025: "Single-agent or Multi-agent Systems?" - https://arxiv.org/abs/2505.18286) found that MAS benefits over single-agent diminish as LLM capabilities improve. The constraints that motivated swarm architectures are being outpaced by model improvements. I admit the field is moving fast, but the direction of travel appears to be that the better the models get, the simpler your abstractions need to be.

So yes, use roles. But maybe don't reach for a framework to orchestrate a PM handing off to an Engineer handing off to QA when a single context with scoped instructions would do.


Thanks for clarifying. I’ve queued up that paper.

I’m building an agentic solution to a problem (monitoring social media and news sources, then building world views of different participants).

A single agent would have insufficient context window size to achieve this in one API call, which means I need parallel agents. Then I have to consolidate the parallel outputs in a way that correctly updates state. I feel like multi-agent is the only way to solve this.

Effectively I’m treating the agents as threads and the roles as functions, with one agent managing writing state to avoid shared state surprises. Thinking of it with the actor model (= mailboxes) makes orchestration fairly straightforward and not really much more complex than the way we already build distributed/multi-threaded applications so I was wondering if I was missing something about why this would be an issue just because the implementation is an LLM prompt instead of a typical programming language.


This is a brilliant library, thanks so much for sharing it


I may have misread your comment, but I don't think soft skills are a 'narrow thing' at all. Effective communication, building trust, bringing people along with you - these are fundamental to being an effective human, not some niche pivot.


"Effective communication, building trust, bringing people along with you" That's a David Brent powerpoint presentation.


Fair. I'll retire 'bringing people along with you' before it ends up on a motivational poster with a stock photo of a rowing team.

Though you're right that there's no I in team. There is one in AI though, which probably tells us something.


Not fair on you. I did not mean to have a dig. I get where you are coming from, and should have elaborated. I've worked with those one or two engineers who were rude by default. Who had an extraordinary knack of vaguely describing the problem set, and then having a full on meltdown, always in front of other people, when the solutions did not match the problem in their head.*

*Goldman Sachs(sorry for invoking that name here) did a report on their high turnover, and the above framing was why many quit.


Look, if we zoom in, then "learning to code" is also quite a broad range of skills that someone needs to master before they can meaningfully carve out a career in a competitive marketplace.

The point is that if you zoom out, it's just a thin slice that can be automated by machines. People keep saying "I'll tell you in my experience, no UAV will ever trump a pilot's instinct, his insight, the ability to look into a situation beyond the obvious and discern the outcome, or a pilot's judgment"... https://www.youtube.com/watch?v=ZygApeuBZdk

But as you can see, they're all wrong. By narrow here I meant a thin layer that thinks it's indispensable as they remove all the other layers. Until the system comes for this layer too.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: