If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).
So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.
While it gives me hope, I am going to play it by the ear. Otherwise it’s going to be - Gemini for world knowledge/general intelligence/R&D and Opus/Sonnet 4.6 to finish it off.
UPDATE: I may have spoken too soon.
> Fixing Truncated Array Syncing Bug
> I traced the missing array items to a typo I made earlier!
> When fixing the GC cast crash, I accidentally deleted the assignment..
> ..effectively truncating the entire array behind it.
These errors should not be happening! They are not the result of missing knowledge or a bad hunch. They are coming from an incorrect find/replace, which makes them completely avoidable!
For me it's Opus 4.6 for researching code/digging through repos, gpt 5.3 codex for writing code, gemini for single hardcore science/math algorithms and grok for things the others refuse to answer or skirt around (e.g. some security/exploitability related queries). Get yourself one of those wrappers that support all models and forget thinking about who has the best model. The question is who has the best model for your problem. And there's usually a correct answer, even if it changes regularly.
I agree. HM or bidirectional typing works best when used optionally, allowing type hints only where needed.
Generics and row polymorphism already cover most structural patterns. The real problem is semantic ambiguity. If algebraic types or unions are not used, the type system cannot tell meaningful differences.
For example, if both distance and velocity are just float, the compiler has no way to know they represent different things and will allow them to mix. For this to be treated as a compile time error, defining the types and sincerely using them for different semantic meanings throughout is needed.
PRs are just that: requests. They don't need to be accepted but can be used in a piecemeal way, merged in by those who find it useful. Thus, not every PR needs to be reviewed.
Of course, but when you add enough noise you lose the signal and as a consequence no PRs gets merged anymore because it's too much effort to just find the ones you care about.
Don't allow PR's from people who aren't contributors, problem solved. Closing your doors to the public is exactly how people solved the "dark forest" problem of social media and OSS was already undergoing that transition with humans authoring garbage PRs for reasons other than genuine enthusiasm. AI will only get us to the destination faster.
I don't think anything of value will be lost by choosing to not interact with the unfettered masses whom millions of AI bots now count among their number.
That would be a huge loss IMO. Anyone being able to contribute to projects is what makes open source so great. If we all put up walls, then you're basically halfway to the bad old days of closed source software reigning supreme.
Then there's the security concerns that this change would introduce. Forking a codebase is easy, but so are supply chain attacks, especially when some projects are being entirely iterated on and maintained by Claude now.
> Anyone being able to contribute to projects is what makes open source so great. If we all put up walls, then you're basically halfway to the bad old days of closed source software reigning supreme.
Exaggeration. Is SQLite halfway to closed source software?
Open-source is about open source. Free software is about freedom to do things with code. None is about taking contributions from everyone.
For every cathedral (like SQLite) there are 100s of bazaars (like Firefox, Chrome, hundreds of core libraries) that depend on external (and especially first-time) contributors to survive (because not everyone is getting paid to sling open-source).
Is there a reason that you chose SQLite for your counterpoint? My hot take: I would say that SQLite is halfway to closed source software. Why? The unit tests are not open source. You need to pay to see them. As a result, it would be insanely hard to force SQLite in a sustainable, safe manner. Please don't read this opinion as disliking SQLite for their software or commercial strategy. In hindsight, it looks like real genius to resist substantial forks. One of the biggest "fork threats" to SQLite is the advent of LLMs that can (1) convert C code to a different langugage, like Rust, and (2) write unit tests. Still, a unit test suite for a database while likely contain thousands (or millions) of edge case SQL queries. These are still probably impossible to recreate, considering the 25 year history of bug fixing done by the SQLite team.
And how does one become a maintainer, if there's no way to contribute from outside? Even if there's some extensive "application process", what is the motivation for a relatively new user to go through that, and how do they prove themselves worthy without something very much like a PR process? Are we going to just replace PRs with a maze of countless project forks, and you think that will somehow be better, for either users or developers?
If I wanted to put up with software where every time I encounter a bug, I either have no way at all to report it, or perhaps a "reporting" channel but little likelihood of convincing the developers that this thing that matters to me is worthy of attention among all of their competing priorities, then I might as well just use Microsoft products. And frankly, I'd rather run my genitals though an electric cheese grater.
You get in contact with the current maintainers and talk to them. Real human communication is the only shibboleth that will survive the AI winter. Those soft skills muscles are about to get a workout. Tell them about what you use the software for and what kinds of improvements you want to make and how involved you'd like your role to be. Then you'll either be invited to open PRs as a well-known contributor or become a candidate for maintainership.
Github issues/prs are effectively a public forum for a software project where the maintainers play moderator and that forum is now overrun with trolls and bots filling it with spam. Closing up that means of contributing is going to be the rational response for a lot of projects. Even more will be shunted to semi-private communities like Discord/Matrix/IRC/Email lists.
The point was that you can also just reject an PR on the basis of what it purports to implement, or even just blanket ignore all PRs. You can't pull in what you don't... pull in.
If a PR claims to solve a problem that I don't need, then I can skip its review because I'll never merge it.
I don't think every PR needs reviewing. Some PRs we can ignore just by taking a quick look at what the PR claims to do. This only requires a quick glance, not a PR review.
You didn't see the latest AI grifter escalation? If you reject their PRs, they then get their AI to write hit pieces slandering you:
"On 9 February, the Matplotlib software library got a code patch from an OpenClaw bot. One of the Matplotlib maintainers, Scott Shambaugh, rejected the submission — the project doesn’t accept AI bot patches. [GitHub; Matplotlib]
The bot account, “MJ Rathbun,” published a blog post to GitHub on 11 February pleading for bot coding to be accepted, ranting about what a terrible person Shambaugh was for rejecting its contribution, and saying it was a bot with feelings. The blog author went to quite some length to slander Mr Shambaugh"
I am very strongly convinced that the person behind the agent prompted the angry post to the blog because they didn't get the gratification they were looking for by submitting an agent-generated PR in the first place.
I agree. But even _that_ was taking advantage of LLMs ability to generate text faster than humans. If the person behind this had to create that blog post from scratch by typing it out themselves, maybe they would have gone outside and touched grass instead.
I've been following Daniel from the Curl project who's speaking out widely about slop coded PRs and vulnerability reports. It doesn't sound like they have ever had any problem keeping up with human generated PRs. It's the mountain of AI generated crap that's now sitting on top of all the good (or even bad but worth mentoring) human submissions.
At work we are not publishing any code or part of the OSS community (except as grateful users of other's projects), but even we get clearly AI enabled emails - just this week my boss has forwarded me two that were pretty much "Him do you have a bug bounty program? We have found a vulnerability in (website or app obliquely connected to us)." One of them was a static site hosted on S3!
There's always been bullshitters looking to fraudulently invoice your for unsolicited "security analysis". But the bar for generating bullshit that looks plausible enough to have to have someone spend at least a few minutes to work out if it's "real" or not has become extremely low, and the velocity with which the bullshit can be generated then have the victim's name and contact details added and vibe spammed to hundreds or thousands of people has become near unstoppable. It's like SEO spammers from 5 or 10 years back but superpowered with OpenAI/Anthropic/whoever's cocaine.
My hot take: reviewing code is boring, harder than writing code, and less fun (no dopamine loop). People don’t want to do it, they want to build whatever they’re tasked with. Making reviewing code easier (human in the loop etc) is probably a big rock for the new developer paradigm.
- insist on disclosure of LLM origin
- review what they want, when they can
- reject what they can't review
- use LLMs (yes, I know) to triage PRs
and pick which ones need the most
human attention and which ones can be
ignored/rejected or reviewed mainly
by LLMs
There are a lot of options.
And it's not just open source. Guess what's happening in the land of proprietary software? YUP!! The same exact thing. We're all becoming review-bound in our work. I want to get to huge MR XYZ but I've to review several other people's much larger MRs -- now what?
Well, we need to develop a methodology for working with LLMs. "Every change must be reviewed by a human" is not enough. I've seen incidents caused by ostensibly-reviewed but not actually understood code, so we must instead go with "every change must be understood by humans", and this can sometimes involve a plain review (when the reviewer is a SME and also an expert in the affected codebase(s), and it can involve code inspection (much more tedious and exacting). But also it might involve posting transcripts of LLM conversations for developing and, separately, reviewing the changes, with SMEs maybe doing lighter reviews when feasible, because we're going to have to scale our review time. We might need to develop a much more detailed methodology, including writing and reviewing initial prompts, `CLAUDE.md` files, etc. so as to make it more likely that the LLM will write good code and more likely that LLM reviews will be sensible and catch the sorts of mistakes we expect humans to catch.
> Maintainers can...insist on disclosure of LLM origin
On the internet, nobody knows you're a dog [1]. Maintainers can insist on anything. That doesn't mean it will be followed.
The only realistic solution you propose is using LLMs to review the PRs. But at that point, why even have the OSS? If LLMs are writing and reviewing the code for the project, just point anyone who would have used that code to an LLM.
Claiming maintainers can (do things while still take effort and time away from their OSS project's goals) is missing the point when the rate of slop submissions is ever increasing and malicious slop submitters refuse to follow project rules.
The Curl project refuse AI code and had to close their bug bounty program due to the flood of AI submissions:
"DEATH BY A THOUSAND SLOPS
I have previously blogged about the relatively new trend of AI slop in vulnerability reports submitted to curl and how it hurts and exhausts us.
This trend does not seem to slow down. On the contrary, it seems that we have recently not only received more AI slop but also more human slop. The latter differs only in the way that we cannot immediately tell that an AI made it, even though we many times still suspect it. The net effect is the same.
The general trend so far in 2025 has been way more AI slop than ever before (about 20% of all submissions) as we have averaged in about two security report submissions per week. In early July, about 5% of the submissions in 2025 had turned out to be genuine vulnerabilities. The valid-rate has decreased significantly compared to previous years."
The total number of people surveyed was ~6 I believe. That's a really small and insignificant sample for any reasonable deduction. So yes, these audiophiles had a hard time deducing the original source - the result cannot be generalized beyond that. From the point of scientific rigor, even for an amateur experiment, the experiment falls short. Interesting idea. I wonder if LLMs can tell any difference.
The field of medicine - pharmacology and drug discovery, is an optimized version of that. It works a bit like this:
Instead of brute-forcing with infinite options, reduce the problem space by starting with some hunch about the mechanism. Then the hard part that can take decades: synthesize compounds with the necessary traits to alter the mechanism in a favourable way, while minimizing unintended side-effects.
Then try on a live or lab grown specimen and note effectiveness. Repeat the cycle, and with every success, push to more realistic forms of testing until it reaches human trials.
Many drugs that reach the last stage - human trials - often end up being used for something completely other than what they were designed for! One example of that is minoxidil - designed to regular blood pressure, used for regrowing hair!
I bought the Gemini Ultra to try for a month (at the discounted price). I have been using it non-stop for Opus 4.6 Thinking, which is much better than Gemini 3 Pro (High) and it's been a blast. The most I've managed to consume is 60% of my 5 hourly quota. That was with 2-3 instances in parallel.
I hope too many of us won't be doing this and cause Google to add limits! My hope is Google sees the benefit in this and goes all in - continues to let people decide which Google hosted model to use, including their own.
Getting CC to work with other models is quite straightforward -- setting a few env vars, and a thin proxy that rewrites the requests/responses to be in the expected format.
Not OP, but I am pretty sure they are using Opencode with a certain antigravity plugin. Not going to link it, since it technically allows breaking TOS. If you‘re not using Opencode yet, I wholeheartedly recommend the switch.
I love this so much! It got me thinking about the future we’re heading towards, that took me down a rabbit hole.
As agents become the dominant code writers, the top concerns for a “working class” programming language would become reducing errors and improving clarity. I think that will lead to languages becoming more explicit and less fun for humans to write, but great for producing code that has a clear intent and can be easily modified without breaking. Rust in its rawest form with lifetimes and the rigmarole will IMO top the charts.
The big question that I still ponder over: will languages like Hoot have a place in the professional world? Or will they be relegated to hobbyists, who still hand-type code for the love of the craft. It could be the difference between having a kitchen gardening hobby vs modern farming…
I have been wondering what an AI first programming language might look like and my closest guess is something like Scheme/Lisp. Maybe they get more popular in the long run.
I think the bitter lesson has an answer to that question. The best AI language is whichever one has the largest corpus of high quality training data. Perhaps new language designers will come up with new ways to create large, high quality corpi in the future, but for the foreseeable future it looks like the big incumbents have an unassailable advantage.
I'm working on what I hope is an AI-first language now, but I'm taking the opposite approach: something like Swift/DartTypeScript with plenty of high level constructs that compactly describe intent.
I'm focusing on very high-quality feedback from the compiler, and sandboxing via WASM to be able to safely iterate without human intervention - which Hoot has as well.
Smalltalk offers several excellent features for LLM agents:
- Very small methods that function as standalone compilation units, enabling extremely fast compilation.
- Built-in, fast, and effective code browsing capabilities (e.g., listing senders, implementors, and instance variable users...). This makes it easy for the agent to extract only the required context from the system.
- Powerful runtime reflectivity and easily accessible debugging capabilities.
- A simple grammar with a more natural, language-like feel compared to Lisp.
Edit: I suppose the next step would be to teach an LLM about "moldable exceptions", https://arxiv.org/pdf/2409.00465 (PDF), have it create its own debuggers.
Haven’t read the article (wouldn’t load for me) but what type of content you watch makes a difference too. I watch funny cats and dogs videos with my daughter all the time and they 100% make us feel better. But finding those said videos on social media is a “process” - it’s like going through a pile of rotting fruits to find something to feed your kid.
I can give an hour long monologue on YouTube’s continued exploitation of children. Their half assed attempts to fix this (by some well intentioned Googler’s, who I’m sure must have had a lot of pushback) aren’t enough. Just try unblocking a channel for your kid’s account (you can’t - the only option is to unblock EVERYTHING).
So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.
While it gives me hope, I am going to play it by the ear. Otherwise it’s going to be - Gemini for world knowledge/general intelligence/R&D and Opus/Sonnet 4.6 to finish it off.
UPDATE: I may have spoken too soon.
These errors should not be happening! They are not the result of missing knowledge or a bad hunch. They are coming from an incorrect find/replace, which makes them completely avoidable!On a lighter note, every time it happens, I think about this Family Guy: https://youtu.be/HtT2xdANBAY?si=QicynJdQR56S54VL&t=184
reply