Hacker Newsnew | past | comments | ask | show | jobs | submit | nyrikki's commentslogin

The Automation bias issue is something that has been raised by many people like myself but mostly ignored. The better models get the worse that problem with get, but IMHO the implications of the claims are not on the code generation side.

The sandwich story in the model card is the bigger issue.

LLMs have always been good at finding a needle in a haystack, if not a specific needle, it sounds like they are claiming a dramatic increase in that ability.

This will dramatically change how we write and deliver software, which has traditionally been based on the idea of well behaved non-malfeasant software with a fix as you go security model.

While I personally find value in the tools as tools, they specifically find a needle and fundamentally cannot find all of the needles that are relevant.

We will either have to move to some form of zero trust model or dramatically reduce connectivity and move to much stronger forms of isolation.

As someone who was trying to document and share a way of improving container isolation that was compatible with current practices I think I need to readdress that.

VMs are probably a minimum requirement for my use case now, and if verified this new model will dramatically impact developer productivity due to increased constraints.

Due to competing use cases and design choice constraints, none of the namespace based solutions will be safe if even trusted partners start to use this model.

How this lands in the long run is unclear, perhaps we only allow smaller models with less impact on velocity and with less essential complexity etc…

But the ITS model of sockets etc.. will probably be dead for production instances.

I hope this is marketing or aspirational to be honest. It isn’t AGI but will still be disruptive if even close to reality.


Came here to say this

X-Accel-Redirect (Nginx sendfile), if supported by Haskell is the way, it is zero copy and will dramatically help in many cases.

If you are modifying the body is one of the cases where it doesn’t work.


XFS, Ext4, btrfs etc… all support sparse files, so any app can cause problems you can try it with:

    dd if=/dev/zero of=sparse_file.img bs=1M count=0 seek=1024

If you add conv=sparse to the dd command with a smaller block size it will sparsify what you copy too, use the wrong cp command flags and they will explode.

Much harder problem than the file system layers to deal with because the stat size will look smaller usually.


Creating sparse files requires the application to purposefully use special calls like fallocate() or seek beyond EOF, like dd with conv=sparse does. You won't accidentally create a sparse file just by filling a file with zeros.

It is an observability issue, even zabbix tracked reserve space and inodes 20 years ago.

Will dedupe,compression,sparse files you simply don’t track utilization by clients view, which is what du does.

The concrete implementation is what matters and what is, as this case demonstrates, is what you should alert on.

Inodes, blocks, extents etc.. are what matters, not the user view of data size.

Even with rrdtool you could set reasonable alerts, but the heuristics of someone exploding a sparse file with a non-sparse copy makes that harder.

Rsync ssh etc… will do that by default.


> The agent inside the container runs with bypassPermissions — it can use Bash, write files, do whatever it wants. But "whatever it wants" is constrained by what the OS lets it see. No application-level permission checks needed.

While containers can be useful for reducing privileges, that assumption isn’t safe, remember that the only thing namespaces away is that which supports namespaces and that by themselves, namespaces are not security features.

A super critical part I didn’t see or missed is the importance of changing UID, the last line of [0] will show one reason.

Remember that the container users has elevated privileges unless you the user explicitly drop this privileges.

I applaud the effort at hardening, but containers have mostly been successful because the most popular apps like nginx operate under a traditional cohosting system and take responsibility for privilege dropping.

There are tons of kernel calls, ldpreload tricks etc… that are well known and easily to find with exploration.

Even dropping elevated privileges and setting no new priv, still isn’t a jail.

Without using separate UIDs don’t expect any real separation at all.

[0] https://www.kernel.org/doc/html/latest/admin-guide/namespace...


The linked to blog post in the OP explains this better IMHO [0]:

   If the data stream encodes values with byte order B, then the algorithm to decode the value on computer with byte order C should be about B, not about the relationship between B and C.
One cannot just ignore the big/little data interchange problem MacOS[1], Java, TCP/IP, Jpeg etc...

The point (for me) is not that your code runs on a s390, it is that you abstract your personal local implementation details from the data interchange formats. And unfortunately almost all of the processors are little, and many of the popular and unavoidable externalization are big...

[0] https://commandcenter.blogspot.com/2012/04/byte-order-fallac... [1] https://github.com/apple/darwin-xnu/blob/main/EXTERNAL_HEADE...


To cope with data interchange formats, you need a set of big endian data types, e.g. for each kind of signed or unsigned integer with a size of 16 bits or bigger you must have a big endian variant, e.g. identified with a "_be" suffix.

Most CPUs (including x86-64) have variants of the load and store instructions that reverse the byte order (e.g. MOVBE in x86-64). The remaining CPUs have byte reversal instructions for registers, so a reversed byte order load or store can be simulated by a sequence of 2 instructions.

So the little-endian types and the big-endian data types must be handled identically by a compiler, except that the load and store instructions use different encodings.

The structures used in a data-exchange format must be declared with the correct types and that should take care of everything.

Any decent programming language must provide means for the user to define such data types, when they are not provided by the base language.

The traditional UNIX conversion functions are the wrong way to handle endianness differences. An optimizing compiler must be able to recognize them as special cases in order to be able to optimize them away from the machine code.

A program that is written using only data types with known endianness can be compiled for either little-endian targets or big-endian targets and it will work identically.

All the problems that have ever existed in handling endianness have been caused by programming languages where the endianness of the base data types was left undefined, for fear that recompiling a program for a target of different endianness could result in a slower program.

This fear is obsolete today.


Having different types seems wrong to me because endianess issues disappears after serialization, so it would make more sense to slap an annotation on the data field so just the serializer knows how to load/store it.

Nah, that's a terrible way to handle endian-ness. Your "big endian" types infect your entire program. And you pay a cost with every computation you do with them.

Just treat the data on disk / on the wire as if it were in some encoded format. Parse on load. Encode back out to the expected format when you save it. Within your program, just use your language's native int formats.

For example, in C I use something like this:

    uint32_t read_be_u32(uint8_t data[4]) {
        return ((uint32_t)data[0] << 24) |
            ((uint32_t)data[1] << 16) |
            ((uint32_t)data[2] << 8)  |
            ((uint32_t)data[3]);
    }
... And the equivalent for little endian data. Modern optimizers will happily turn that into the right instructions - either a noop or bswap - as appropriate depending on the target architecture.

You can do the same thing in Rust, Go, or any other language. No special type definitions or macros necessary.

https://godbolt.org/z/746EaYx4r


MacOS "was" big-endian due to 68k and later PPC cpu's (the PPC Mac's could've been little but Apple picked big for convenience and porting).

Their x86 changeover moved the CPU's to little-endian and Aarch64 continues solidifies that tradition.

Same with Java, there's probably a strong influence from SPARC's and with PPC, 68k and SPARC being relevant back in the 90s it wasn't a bold choice.

But all of this is more or less legacy at this point, I have little reason to believe that the types of code I write will ever end up on a s390 or any other big-endian platform unless something truly revolutionizes the computing landscape since x86, aarch64, risc-v and so on run little now.


Can you please explain the use here? I tried the demo, and cat, cp, echo, etc... seem to do the exact same thing without the cost.

Their demo even says:

   `Paste any code or text below. Our model will produce an AI-generated, byte-for-byte identical output.`

Unless this is a parody site can you explain what I am missing here?

Token echoing isn't even to the lexeme/pattern level, and not even close to WSD, Ogden's Lemma, symbol-grounding etc...

The intentionally 'Probably approximately complete' statistical learning model work, fundamentally limits reproducibility for PAC/Stastical methods like transformers.

CFG inherently ambiguity == post correspondence problem == halt == open domain frame-problem == system identification problem == symbol-grounding problem == entscheidungsproblem

The only way to get around that is to construct a grammar that isn't. It will never exist for CFGs, programs, types, etc... with arbitrary input.

I just don't see why placing a `14-billion parameter identity transformer` that just basically echos tokens is a step forward on what makes the problem hard.

Please help me understand.


It's satire - just see the About page.

I started using the ports and adapters pattern and protocol for any packages that have replacements or concerns.

Basically treating HTTP requests as an orthogonal, or cross-cutting concern.

It is sometimes hard to tell if these upstream packages are stable or abandoned.

I should probably document my methodology so it can help others or at least have the chance to find out what mistakes or limitations they might have.


That’s why I either started with the ports and adapters pattern or quickly refactored into it on spikes.

You don’t have to choose what flavor of DDD/Clean/… you want to drink, just use some method that keeps domains and use cases separate from implementation.

Just with shapes and domain level tests, the first pass on a spec is easier (at least for me) and I also found feedback was better.

I am sure there are other patterns that do the same, but the trick is to let the problem domain drive, not to choose any particular set of rules.

Keeping the core domain as a fixed point does that for me.


For binary operations, NaN values compare as unordered.

The IEEE 754 Specification requires that >,<,= evaluate to False.

Saying that two incomparable objects become comparable let alone gain equally would break things.

We use specific exponents and significands to represent NaNs but they have no numerical meaning.

I am actually surprised python got this correct, often NaN behavior is incorrect out of convenience and causes lots of issues and side effects.


Even with classic compilation, it is only the semantic behavior that is preserved.

What the Church–Rosser property/confluence is in term rewriting in lambda calculus is a possible lens.

To have a formally verified spec, one has to use some decidable fragment of FO.

If you try to replace code generation with rewriting things can get complicated fast.[2]

Rust uses affine types as an example and people try to add petri-nets[0] but in general petri-net reachability is Ackerman-complete [1]

It is just the trade off of using a context free like system like an LLM with natural language.

HoTT and how dependent types tend to break isomorphic ≃ equal Is another possible lens.

[0] https://arxiv.org/abs/2212.02754v3

[1] https://arxiv.org/abs/2212.02754v3

[2] https://arxiv.org/abs/2407.20822


First, it's not a question of decidability but of tractability. Verifying programs in a language that has nothing but boolean variables, no subroutines, and loops at depth of at most 2 - far, far, from Turing-completeness - is already intractable (reduction from TQBF).

Second, it's very easy to have some specs decided tractably, at least in many practical instances, but they are far too weak to specify most correctness properties programs need. You mentioned the Rust type system, and it cannot specify properties with interleaved quantifiers, which most interesting properties require.

And as for HoTT - or any of the many equivalent rich formalisms - checking their proofs is tractable, but not finding them. The intractability of verification of even very limited languages (again TQBF) holds regardless of how the verification is done.

I think it's best to take it step by step, and CodeSpeak's approach is pragmatic.


I think there is a bit of the map territory relation here.

> First, it's not a question of decidability but of tractability

The question of decidability is a form of many-to-one, reduction. In fact RE-complete is defined by many-to-one reductions.

In a computational complexity sense, tractability is a far stronger notion. Basically an algorithm efficient if its time complexity at most PTIME for any size n input. A problem is "tractable" if there is an efficient algorithm that solves it.

You are correct, if you limit your expressiveness to PTIME, where because P == co-P, PEM/tight apartness/Omniscience principles hold.

But the problem is that Church–Rosser Property[0] (proofs ~= programs) and Brouwer–Heyting–Kolmogorov Interpretation[1] (Propositions at types) are NOT binary SAT, and you have concepts like mere propositions[3] that are very different than just BSAT.

But CodeSpeak doesn't have formal specifications, so this is irrelevant. Their example code output produced code with path traversal/resource exhaustion risks and correctness issues and is an example.

My personal opinion is that we will need to work within the limitations of the systems, and while it is trivial to come up with your own canary, I would recommend playing with [3] before the models directly target it.

Generating new code from a changed spec will be less difficult, specifically when the mess of real world specs comes into play. You can play with the example on CodeSpeak's front page, trying to close the various holes the software has with malformed/malicious input, giving the LLM the existing code base and you will see that "brown m&m"[3] problem arise quickly. At least for me if I prompt it to look at the changed natural language spec, generating new code it was more successful.

But for some models like the qwen3 coder next, the style resulted in far less happy path protections, which that model seems to have been trained on to deliver by default in some cases.

[0] https://calhoun.nps.edu/entities/publication/015f1bab-6642-4... [1] https://www.cs.cornell.edu/courses/cs6110/2017sp/lectures/le... [2] https://www.cambridge.org/core/journals/journal-of-functiona... [3] https://codemanship.wordpress.com/2025/10/03/llms-context-wi...


> But the problem is that Church–Rosser Property[0] (proofs ~= programs) and Brouwer–Heyting–Kolmogorov Interpretation[1] (Propositions at types) are NOT binary SAT, and you have concepts like mere propositions[3] that are very different than just BSAT.

The computational limits imposed on program verification are independent of the logical theory used, and depend only on its expressive strength. Many if not most interesting program properties require interleaved properties (forall-exists or forall-exists-forall etc.) which are intractable to verify.

> But CodeSpeak doesn't have formal specifications, so this is irrelevant.

The lack of formalism also doesn't matter to the limitations on correctness. If you wish to know, with certainty, that a program satisfies some property, that knowledge has a cost. But both humans and LLMs write programs at least in part based on inductive rather than deductive reasoning, which cannot be rigorously given a level of confidence. That may not be what we want, but that's what computational complexity says we can have, so we're forced to work in this way.

We should complain about what we don't have, but demanding things we can't have isn't going to help. There's no fundamental reason why AI shouldn't, someday, be able to program as well as humans, but there are fundamental limitations to producing the software we wish we had. Humans can rigorously use deductive methods, with mechanical help, and AI could possibly use it, too. But there's no reason to believe AI could break the size barrier of such problems. People have been able to rigorously verify only very small programs, and maybe AI could do a little better if only because of its tenacity, but if we expect to produce perfect programs by any means, we'll be waiting for a long time.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: