Like others have mentioned, I think the premise of looking at the most popular few projects (pypi.org currently lists 771,120 projects) on pypi as any sort of proxy for AI coding is terribly misguided/unrepresentative and that almost no one is going to be packaging up their vibe-coded projects for distribution on pypi.
That being said, I've personally put 3 up recently (more than I've published in total). I'm sure they have close to zero downloads (why would they? they're brand new, solve my own problems, I'm not interested in marketing them or supporting them, they're just shared because they might be useful to others) so they wouldn't show up in their review. 2 of these are pretty meaty projects that would have taken weeks if not months of work but instead have been largely just built over a weekend or a few days. I'd say it's not just the speed, but that w/o the lowered effort, these projects just wouldn't ever have crossed the effort/need bar of ever being started.
I've probably coded 50-100X more AI-assisted code that will never go to pypi, even as someone that has released pypi packages before (which already puts me in a tiny minority of programmers, much less regular people that would even think about uploading a pypi project).
For those interested in the scope of the recent projects:
https://pypi.org/project/realitycheck/ - first pypi: Jan 21 - 57K SLoC - "weekend" project that kept growing. It's a framework that leverages agentic coding tools like Codex/Claude Code to do rigorous, systematic analysis of claims, sources, predictions, and argument chains.It has 400+ tests, and does basically everything I want it to do now. The repo has 20 stars and I'd estimate only a handful of people are using it.
https://pypi.org/project/tweetxvault/ - first pypi: Mar 16 - 29K SLoC - another weekend project (followup on a second weekend). This project is a tool for archiving your Twitter/X bookmarks, likes, and tweets into a local db, with support for importing from archives and letting you search through them. I actually found 3 or 4 other AI-coded projects that didn't do quite what I wanted so it I built my own. This repo has 4 stars, although a friend submitted a PR and mentioned it solved exactly their problem and saved them from having to build it themselves, so that was nice and justifies publishing for me.
https://pypi.org/project/batterylog/ - first pypi: Mar 22 - 857 SLoC - this project is actually something I wrote (and have been using daily) 3-4 years ago, but never bothered to properly package up - it tracks how much battery is drained by your laptop when asleep and it's basically the bare minimum script/installer to be useful. I never bothered to package it
up b/c quite frankly, manual pypi releases are enough of a PITA to not bother, but LLMs now basically make it a matter of saying "cut a release," so when I wanted to add a new feature, I packaged it up as well, which I would never have done this otherwise. This repo has 42 stars and a few forks, although probably 0 downloads from pypi.
(I've spent the past couple years heavily using AI-assisted workflows, and only in the past few months (post Opus 4.6, GPT-5.2) would I have even considered AI tools reliable enough to consider trusting them to push new packages to pypi.)
Funy that you mention multi-monitor since it's one of the reasons I eventually moved to Wayland. The only way to support different DPI monitors in X was to do janky scaling or even jankier multiple X servers.
I don't use KDE (or GNOME anymore) but while I had to deal with a lot of initial speedbumps a couple years ago, these days instead of a full DE, I'm using a Niri setup and it's worked out great for me.
For my laptop, I have my own monitor-detection/wl-mirror script for example that is faster and more reliable for plugging into projectors/meeting room HDMI than even my old Macs.
The funny thing about this myth is that wayland does not even try to support Mixed DPI setups, the only thing it supports is, as you put it, janky scaling. Not that X is any better in the end but at least it has the data available if any application wants to try to do correct Mixed dpi (nobody does)
So in yet another case of worse is better, wayland has the reputation of supporting mixed DPI environments, but not because it has any support for actual mixed DPI but because it is better at faking it (fractional scaling).
Myth or not - it is absolutely much better on wayland. I really don't care or know how to tweak linux so i've been using straight install Fedora for years. I also have 4 screens. When Fedora switched to wayland it got much better and it keeps getting better.
I use a docked ThinkPad with the lid closed and two external monitors. Here are my config bits.
set $laptop eDP-1
set $landscape 'Hewlett Packard HP ZR24w CNT037144C'
set $portrait 'Hewlett Packard HP ZR24w CNT03512JN'
bindswitch --reload --locked lid:on output $laptop disable
bindswitch --reload --locked lid:off output $laptop enable
### Output configuration
output $laptop bg $HOME/pictures/wallpaper/1529004448340.jpg fill
output $landscape bg $HOME/pictures/wallpaper/1529004448340.jpg fill
output $portrait bg $HOME/pictures/wallpaper/portrait/DYabJ0FV4AACG69.jpg fill
# pos args are x coords and y coords, transform is degrees of rotation counter-clockwise
# set $portrait as left monitor and rotate it counterclockwise
output $portrait pos 0 1200 transform 270
I am not a theoretical CS or math expert by any means, but I have been wrangling coding agents for a while and reading the paper and the problems Stapper had with dealing w/ Claude (context management, instruction following, etc) decided to see if I could replicate with a slightly better harness. The results were pretty interesting: https://github.com/lhl/claudecycles-revisited
- My original setup left traces of the PDF paper and after GPT 5.3-Codex xhigh reached an impasse it went looking for it and found it!
- I went and did cleanroom (basically one-shot) passes for GPT 5.2 xhigh, GPT 5.3-Codex xhigh, and Claude Opus 4.6 ultrathink and 5.2/5.3 found alternate solutions for odd m >= 5 , Opus 4.6 did not find any proofs but tried more approaches to solving.
I've also included the session traces and analysis in the repo branches. Also, the AGENTS.md was pretty simple, but that harness produced consistent process outcomes across all three models:
I was a bit interested to do a replication and see if better harness could avoid some of the problems they ran w/ context management, poor instruction following, etc and it looks like yes, it's definitely possible.
I used Codex w/ 5.2 xhigh and a relatively simple AGENTS.md - I have some session-analysis as well. The original replication was 47 minutes, then another 30 minutes of gap filling, and finally about 30 minutes of writing an extension to take the work a bit further, with Claude Code Opus 4.6 doing some documentation cleanup and verification.
As described in the readme of your repo (did you read it?) your agent found the Knuth paper located one directory level above its working directory.
So, you didn't produce a replication in 47 minutes, it just took around 30 minutes for your agent to find that you had the answer in a PDF in a nearby directory.
I wonder how common of a problem this will be in the future. The experiment will fail due to improper setup, the human will at best glance over the logs and declare victory, and everyone just believes.
Yes, I read it and specifically pointed it out (that's why there are 3 hours of interactive logs). There are 4 other runs pushed now so you can see what actual clean room runs for 5.2 xhigh, 5.3-Codex xhigh, 5.4 xhigh, and Opus 4.6 ultrathink look like: https://github.com/lhl/claudecycles-revisited/blob/main/COMP... as well as the baseline.
omg this is so cool.
because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things
thx for sharing your test setup, i really appreciate the time you took. this will help me so much
Anecdotally, I will say that for my toughest jobs GPT-5+ High in `codex` has been the best tool I've used - CUDA->HIP porting, finding bugs in torch, websockets, etc, it's able to test, reason deeply and find bugs. It can't make UI code for it's life however.
Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...
Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.
We do live in an age of frontier LLMs... For fun, I'll just use Kimi K2 (on Kagi Assistant).
> Can you explain what this means and its significance? Assume that I'm a layperson with no familiarity with LLM jargon so explain all of the technical terms, references, names. https://github.com/MoonshotAI/Kimi-Linear
Imagine your brain could only “look at” a few words at a time when you read a long letter. Today’s big language models (the AI that powers chatbots) have the same problem: the longer the letter gets, the more scratch paper they need to keep track of it all. That scratch paper is called the “KV cache,” and for a 1 000 000-word letter it can fill a small library.
Kimi Linear is a new way for the AI to read and write that throws away most of that scratch paper yet still understands the letter. It does this by replacing the usual “look at every word every time” trick (full attention) with a clever shortcut called linear attention. The shortcut is packaged into something they call Kimi Delta Attention (KDA).
What the numbers mean in plain English
51.0 on MMLU-Pro: on a 4 000-word school-test set, the shortcut scores about as well as the old, slow method.
84.3 on RULER at 128 000 words: on a much longer test it keeps the quality high while running almost four times faster.
6 × faster TPOT: when the AI is writing its reply, each new word appears up to six times sooner than with the previous best shortcut (MLA).
75 % smaller KV cache: the scratch paper is only one-quarter the usual size, so you can fit longer conversations in the same memory.
Key pieces explained
Full attention: the old, accurate but slow “look back at every word” method.
KV cache: the scratch paper that stores which words were already seen.
Linear attention: a faster but traditionally weaker way of summarising what was read.
Gated DeltaNet: an improved linear attention trick that keeps the most useful bits of the summary.
Kimi Delta Attention (KDA): Moonshot’s even better version of Gated DeltaNet.
Hybrid 3:1 mix: three layers use the fast KDA shortcut, one layer still uses the old reliable full attention, giving speed without losing smarts.
48 B total, 3 B active: the model has 48 billion total parameters but only 3 billion “turn on” for any given word, saving compute.
Context length 1 M: it can keep track of about 1 000 000 words in one go—longer than most novels.
Bottom line
Kimi Linear lets an AI read very long documents or hold very long conversations with far less memory and much less waiting time, while still giving answers as good as—or better than—the big, slow models we use today.
I'd previously been giving Hyprland a try, but after lots of customization work, there were still a bunch of things I wasn't happy with and ended back on GNOME as a "just let me get work done" thing (I use multiple workspaces, have always have dozens or hundreds of browser windows open, depend on a bunch tray extensions). That being said, GNOME just updated versions and broke all my extensions again so I've decided to recommit to work on fixing anything that isn't working for my workflow and ditching GNOME forever (I was previously much happier on Openbox, but well, Wayland).
With this latest go I gave River, QTile, and Niri a try. After a bit of swapping back and forth, I've settled on Niri and am slowly adding functionality I'm missing.
- I like multiple dynamic workspaces (grouped by function) and don't see much point beyond a split or two so Niri worked pretty well, and I was able to largely config all the keyboard shortcuts to something that made sense to me
- I'm using waybar and swaync for my other DE bits
I've also been using long running Claude Code/Codex in a workspace to build a number of custom scripts:
- niri-workspaces - dynamically generate a workspace display on my waybar showing windows, activity
- niri-workspace-names - integrate w/ fuzzel to let me rename workpaces
- niri-alttab - getting app cycling working in a way that makes sense to me, this is a larger project probably if I want live thumbnails and the like
- niri-terminal-below - I often want to have a new vertical terminal split and it's a bit hacky but works (have to punch out a new terminal, then bring it below, and move back if on the right side)
I haven't gone through all the docs, done much looking around, but one nice thing with these new coding agents is that they can just go and do a passable job to tweak as I want.
In Linux, you can set it as high as you want, although you should probably have a swap drive and still be prepared for you system to die if you set it to 128GiB. Here's how you'd set it to 120GiB:
# This is deprecated, but can still be referenced
options amdgpu gttsize=122800
# This specifies GTT by # of 4KB pages:
# 31457280 * 4KB / 1024 / 1024 = 120 GiB
options ttm pages_limit=31457280
RDNA3 CUs do not have FP8 support and its INT8 runs at the same speed as FP16 so Strix Halo's max theoretical is basically 60 TFLOPS no matter how you slice it (well it has double INT4, but I'm unclear on how generally useful that is):
Note, even with all my latest manual compilation whistles and the latest TheRock ROCm builds the best I've gotten mamf-finder up to about 35 TFLOPS, which is still not amazing efficiency (most Nvidia cards are at 70-80%), although a huge improvement over the single-digit TFLOPS you might get ootb.
If you're not training, your inference speed will largely be limited by available memory bandwidth, so the Spark token generation will be about the same as the 395.
On general utility, I will say that the 16 Zen5 cores are impressive. It beats my 24C EPYC 9274F in single and multithreaded workloads by about 25%.
That being said, I've personally put 3 up recently (more than I've published in total). I'm sure they have close to zero downloads (why would they? they're brand new, solve my own problems, I'm not interested in marketing them or supporting them, they're just shared because they might be useful to others) so they wouldn't show up in their review. 2 of these are pretty meaty projects that would have taken weeks if not months of work but instead have been largely just built over a weekend or a few days. I'd say it's not just the speed, but that w/o the lowered effort, these projects just wouldn't ever have crossed the effort/need bar of ever being started.
I've probably coded 50-100X more AI-assisted code that will never go to pypi, even as someone that has released pypi packages before (which already puts me in a tiny minority of programmers, much less regular people that would even think about uploading a pypi project).
For those interested in the scope of the recent projects:
https://pypi.org/project/realitycheck/ - first pypi: Jan 21 - 57K SLoC - "weekend" project that kept growing. It's a framework that leverages agentic coding tools like Codex/Claude Code to do rigorous, systematic analysis of claims, sources, predictions, and argument chains.It has 400+ tests, and does basically everything I want it to do now. The repo has 20 stars and I'd estimate only a handful of people are using it.
https://pypi.org/project/tweetxvault/ - first pypi: Mar 16 - 29K SLoC - another weekend project (followup on a second weekend). This project is a tool for archiving your Twitter/X bookmarks, likes, and tweets into a local db, with support for importing from archives and letting you search through them. I actually found 3 or 4 other AI-coded projects that didn't do quite what I wanted so it I built my own. This repo has 4 stars, although a friend submitted a PR and mentioned it solved exactly their problem and saved them from having to build it themselves, so that was nice and justifies publishing for me.
https://pypi.org/project/batterylog/ - first pypi: Mar 22 - 857 SLoC - this project is actually something I wrote (and have been using daily) 3-4 years ago, but never bothered to properly package up - it tracks how much battery is drained by your laptop when asleep and it's basically the bare minimum script/installer to be useful. I never bothered to package it up b/c quite frankly, manual pypi releases are enough of a PITA to not bother, but LLMs now basically make it a matter of saying "cut a release," so when I wanted to add a new feature, I packaged it up as well, which I would never have done this otherwise. This repo has 42 stars and a few forks, although probably 0 downloads from pypi.
(I've spent the past couple years heavily using AI-assisted workflows, and only in the past few months (post Opus 4.6, GPT-5.2) would I have even considered AI tools reliable enough to consider trusting them to push new packages to pypi.)
reply