Hacker Newsnew | past | comments | ask | show | jobs | submit | lsb's commentslogin

The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...

> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.


That article is literally a definition of TDD that has been around for years and years. There's nothing novel there at all. It's literally test driven development.

If Agent is writing the tests itself, does it offer better correctness guarantees than letting it write code and tests?

In my experience the agent regularly breaks some current features while adding a new one - much more often than a human would. Agents too often forget about the last feature when adding the next and so will break things. Thus I find Agent generated tests important as they stop the agent from making a lot of future mistakes.

It is definitely not foolproof but IMHO, to some extent, it is easier to describe what you expect to see than to implement it so I don't find it unreasonable to think it might provide some advantages in terms of correctness.

That definitely depends upon the situation. More often than not, properly testing a component takes me more time than writing it.

In my experience, this tends to be more related to instrumentation / architecture than a lack of ability to describe correct results. TDD is often suggested as a solution.

Given the issues with AWS with Kiro and Github, We already have just a few high-profile examples of what happens when AI is used at scale and even when you let it generate tests which is something you should absolutely not do.

Otherwise in some cases, you get this issue [0].

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...


Don't "let it" generate tests. Be intentional. Define them in a way that's slightly oblique to how the production code approaches the problem, so the seams don't match. Heck, that's why it's good to write them before even thinking about the prod side.

The linked article does not speak of tests, it speaks of a team that failed to properly review an LLM refactor then proceeds to blame the tooling.

LLMs are good at writing tests in my experience.


TDD == Prompt Engineering, for Agentic coding tasks.

Wild it’s taken people this long to realize this. Also lean tickets / tasks with all needed context to complete the task, including needed references / docs, places to look in source, acceptance criteria, other stuff.

It’s named after the multi-decade data compression test image https://en.wikipedia.org/wiki/Lenna

Buy the book! https://qntm.org/vhitaos


Just sharing that I bought Valuable Humans in Transit some years ago and I concur that it's very nice. It's a tiny booklet full of short stories like Lena that are way out there. Maximum cool per gram of paper.


[flagged]


If you read the original text, what happens in that story is also grossly inappropriate. Maybe that's the parallel.


that's kind of the point


could you be more specific?


[flagged]


The woman herself says she never had a problem with it being famous. The actual test image is obviously not porn, either. But anything to look progressive, I guess.


From the link above

> Forsén stated in the 2019 documentary film Losing Lena, "I retired from modeling a long time ago. It's time I retired from tech, too... Let's commit to losing me."


It's a ridiculous idea that once you retire all depictions must be destroyed.

Should we destroy all movies with retired actors? All the old portraits, etc.

It's such a deep disrespect to human culture.


That's of course not the meaning of that message. No one is suggesting that.


Everybody knows that. The GP's reaction is what perplexes me. Are they saying the name of the story is inappropriate? I think it's very appropriate.


> Lena is no longer used as a test image because it's porn.

The Lenna test image can be seen over the text "Click above for the original as a TIFF image." at [0]. If you consider that to be porn, then I find your opinion on what is and is not porn to be worthless.

The test image is a cropped portion of porn, but if a safe-for-work image would be porn but for what you can't see in the image, then any picture of any human ever is porn as we're all nude under our clothes.

For additional commentary (published in 1996) on the history and controversy about the image, see [1].

[0] <http://www.lenna.org/>

[1] <https://web.archive.org/web/20010414202400/http://www.nofile...>


Nudity is not pornography. Intent matters.


I agree that not all nudity is porn - nudity is porn if the primary intent of that nudity is sexual gratification. When the nudity in question was a Playboy magazine centerfold, the primary intent is fairly obvious.


I can't see how that would it be porn either, it's nudity. There's nudity in the Sixtine chapel and I would find it hilarious if it was considered porn.


It's interesting because where I'm from, there was "erotica" and there was "porn". This image would at best be erotica. It would not be considered porn.

Like in US supreme Court "I know it when I see it", definition isn't straight forward but it has elements of "is it depiction of a sexual act or simply nudity ", as well as any artistic quality. Generally, erotica has high production values and porn less so.

Anyhoo! What a weird place for discussion to end up :-). The story is excellent and very hacker news appropriate, but his entire opus is pretty good. There's a bit of deus ex machine in some of qntm's work, but generally they have the right mix of surreal and puzzling and cryptic and interesting to engage a computer geek's mind :-).


the "porn" angle is very funny to me, since there is nothing pornographic or inapropriate about the image. when I was young, I used to think it was some researcher's wife whom he loved so much he decide to use her picture absolutely everywhere.

it's sufficient to say that the person depicted has withdrawn their consent for that image to be used, and that should put an end to the conversation.


is that how consent works? I would have expected licenses would override that. although it's possible that the original use as a test image may have violated whatever contract she had with her producer in the first place.


tl;dr yes it is

she did not explicitly consent for that photo to be used in computer graphics research or millions of sample projects. moreover, the whole legality of using that image for those purposes is murky because I doubt anyone ever received proper license from the actual rights-holder (playboy magazine). so the best way to go about this is just common-sense good-faith approach: if the person depicted asks you to please knock it off, you just do it, unless you actively want to be a giant a-hole to them.


That's nonsense. If Carrie Fisher "withdrawn consent" of her depiction in Star Wars, should we destroy the movies, all Princess Leia fan art, etc?


No, because the replacement value of those things to others is very high, and generally outweighs Carrie Fisher's objection. But we should take her objection into consideration going forwards. The Lena test image is very easy to replace, and it's not all that culturally significant: there's no reason to keep using it, unless we need to replicate historical benchmarks.


I'm using Sonnet with 1M Context Window at work, just stuffing everything in a window (it works fine for now), and I'm hoping to investigate Recursive Language Models with DSPy when I'm using local models with Ollama


The New York Times has said that the US president has reported capturing the president of Venezuela https://www.nytimes.com/live/2026/01/03/world/trump-united-s...

Source about aviation: primary (I am at an airport now) and also there are no flights going into or out of JFK right now https://www.jfkairport.com/flight-tracker?view=VIEW_DEPARTUR...


This is super interesting!

Apache Arrow is trying to do something similar, using Flatbuffer to serialize with zero-copy and zero-parse semantics, and an index structure built on top of that.

Would love to see comparisons with Arrow


Arrow has a different use case I think. Lite3 / TRON is effectively more efficient JSON. Arrow uses an array per property. This allows zero copy per property access across TB scale datasets amongst other useful features - it’s more like the core of a database.

A closer comparison would be to FlatBuffers which is used by Arrow IPC, a major difference being TRON is schemaless.


My threshold for “does not need to be smaller” is “can this run on a Raspberry Pi”. This is a helpful benchmark for maximum likely useful optimization.

A Pi has 4 cores and 16GB of memory these days, so, running Qwen3 4B on a pi is pretty comfortable: https://leebutterman.com/2025/11/01/prompt-optimization-on-a...


Happy to answer any questions you have :)


Curious about comparisons with Apache Arrow, which uses flatbuffers to avoid memory copying during deserialization, which is well supported by the Pandas ecosystem, and which allows users to serialize arrays as lists of numbers that have hardware support from a GPU (int8-64, float)


Apache Arrow is more of a memory format than a general‑purpose data serialization system. It’s great for in‑memory analytics and GPU‑friendly columnar storage.

Apache Fory, on the other hand, has its own wire‑stream format designed for sending data across processes or networks. Most of the code is focused on efficiently converting in‑memory objects into that stream format (and back) — with features like cross‑language support, circular reference handling, and schema evolution.

Fory also has a row format, which is a memory format, and can complement or compete with Arrow’s columnar format depending on the use case.


fast.ai (some of the authors of this) was transformative for me, and the community was super nice. Cannot recommend looking into this highly enough.


This is halfbakery! I love it!

(For example, a recent half baked idea there is a perpetually burning flag. https://www.halfbakery.com/idea/Perpetually_20Burning_20Flag... )


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: