Hacker Newsnew | past | comments | ask | show | jobs | submit | extr's commentslogin

Wow, quite surprising results. I have been working on a personal project with the astral stack (uv, ruff, ty) that's using extremely strict lint/type checking settings, you could call it an experiment in setting up a python codebase to work well with AI. I was not aware that ty's gaps were significant. I just tried with zuban + pyright. Both catch a half dozen issues that ty is ignoring. Zuban has one FP and one FN, pyright is 100% correct.

Looks like I will be converting to pyright. No disrespect to the astral team, I think they have been pretty careful to note that ty is still in early days. I'm sure I will return to it at some point - uv and ruff are excellent.


This is the way. For now, pyright it's also 100% pyright for me. I can recommend turning on reportMatchNotExhaustive if you're into Python's match statements but would love the exhaustiveness check you get in Rust. Eric Traut has done a marvellous job working on pyright, what a legend!

But don't get me wrong, I made an entry in my calendar to remind me of checking out ty in half a year. I'm quite optimistic they will get there.


For big codebases pyright can be pretty slow and memory hungry. Even though ty is still a WIP, I'm adopting it at work because of how fast it is and some other goodies (e.g. https://docs.astral.sh/ty/features/type-system/#intersection...)

Say what you will about Microsoft, but their programming language people consistently seem to make very solid decisions.

Microsoft started as a programming language company (MS-BASIC) and they never stopped delivering serious quality software there. VB (classic), for all its flaws, was an amazing RAD dev product. .NET, especially since the move to open-source, is a great platform to work with. C# and TS are very well-designed languages.

Though they still haven't managed to produce a UI toolkit that is both reliable, fast, and easy to use.


I assume this is pretty rare, but ty sometimes finds real issues that are actually allowed by the spec, like:

  def foo(a: float) -> str:
    return a.hex()

  foo(false)
is correct according to PEP 484 (when an argument is annotated as having type float, an argument of type int is acceptable) but this will lead to a runtime error. mypy sees no type error here, but ty does.

You probably just don't have the hang of it yet. It's very good but it's not a mind reader and if you have something specific you want, it's best to just articulate that exactly as best you can ("I want a test harness for <specific_tool>, which you can find <here>"). You need to explain that you want tests that assert on observable outcomes and state, not internal structure, use real objects not mocks, property based testing for invariants, etc. It's a feedback loop between yourself and the agent that you must develop a bit before you start seeing "magic" results. A typical session for me looks like:

- I ask for something highly general and claude explores a bit and responds.

- We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.

- It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.

- After it's done, I skim the broad strokes of the code and point out any code/architectural smells.

- I ask it to review it's own work and then critique that review, etc. We write tests.

Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.


I absolutely have the hang of Claude and I still find that it can make those ridiculous mistakes, like replicating logic into a test rather than testing a function directly, talking to a local pg that was stale/ running, etc. I have a ton of skills and pre-written prompts for testing practices but, over longer contexts, it will forget and do these things, or get confused, etc.

You can minimize these problems with TLC but ultimately it just will keep fucking up.


Don't know what to tell you. Sounds like you're holding it wrong. Based on the current state of things I would try to get better at holding it the right way.

I can't tell if you're joking?

My favorite is when you need to rebuild/restart outside of claude and it will "fix the bug" and argue with you about whether or not you actually rebuilt and restarted whatever it is you're working on. It would rather call you a liar than realize it didn't do anything.

this is a pretty annoying problem -- i just intentionally solve it by asking claude to always use the right build command after each batch of modifications, etc

"That's an old run, rebuild and the new version will work" lol

With the back and forth refining I find it very useful to tell Claude to 'ask questions when uncertain' and/or to 'suggest a few options on how to solve this and let me choose / discuss'

This has made my planning / research phase so much better.


Yes pretty much my workflow. I also keep all my task.md files around as part of the repo, and they get filled up with work details as the agent closes the gates. At the end of each one I update the project memory file, this ensures I can always resume any task in a few tokens (memory file + task file == full info to work on it).

Pretty good workflow. But you need to change the order of the tests and have it write the tests first. (TDD)

I mean I’ve been using AI close to 4 years now and I’ve been using agents off and on for over a year now. What you’re describing is exactly what I’m doing.

I’m not seeing anyone at work either out of hundreds of devs who is regularly cranking out several thousand lines of pretty good working code in 30-45 minutes.

What’s an example of something you built today like this?


Fair, that's optimistic, and it depends what you're doing. Looking at a personal project I had a PR from this week at +3000 -500 that I feel quite good about, took about 2 nights of about an hour each session to shape it into what I needed (a control plane for a polymarket trading engine). Though if I'm being fair, this was an outlier, only possible because I very carefully built the core of the engine to support this in advance - most of the 3K LoC was "boilerplate" in the sense I'm just manipulating existing data structures and not building entirely new abstractions. There are definitely some very hard-fought +175 -25 changes in this repo as well.

Definitely for my day job it's more like a few hundred LoC per task, and they take longer. That said, at work there are structural factors preventing larger changes, code review, needing to get design/product/coworker input for sweeping additions, etc. I fully believe it would be possible to go faster and maintain quality.


Those numbers are much more believable, but now we’re well into maybe a 2-3x speed up. I can easily write 500 LOC in an hour if I know exactly what I’m building (ignoring that LOC is a terrible metric).

But now I have to spend more time understanding what it wrote, so best case scenario we’re talking maybe a 50% speed up to a part of my job that I spent maybe 10-20% on.

Making very big assumptions that this doesn’t add long term maintenance burdens or result in a reduction of skills that makes me worse at reviewing the output, it’s cool technology.

On par with switching to a memory managed language or maybe going from J2EE to Ruby on Rails.


Thinking in terms of a "speed up multiplier" undersells it completely. The speed up on a task I would have never even attempted is infinite. For my +3000 PR recently on my polymarket engine control plane, I had no idea how these types of things are typically done. It would have taken me many hours to think through an implementation and hours of research online to assemble an understanding on typical best practices. Now with AI I can dispatch many parallel agents to examine virtually all all public resources for this at once.

Basically if it's been done before in a public facing way, you get a passable version of that functionality "for free". That's a huge deal.


1. You think you have something following typical best practices. You have no way to verify that without taking the time to understand the problem and solution yourself.

2. If you’d done 1, you’d have the knowledge yourself next time the problem came up and could either write it yourself or skip the verifications step.

I’m not saying there aren’t problems out there where the problem is hard to solve but easy to verify. And for those use cases LLMs are terrific.

But many problems have the inverse property. And many problems that look like the first type are actually the second.

LLMs are also shockingly good at generating solutions that look plausible, independent of correctness or suitability, so it’s almost always harder to do the verification step than it seems.


The control plane is already operational and does what I need. Copying public designs solved a few problems I didn't even know I had (awkward command and control UX) and seems strictly superior to what I had before. I could have taken a lot longer on this - probably at least a week, to "deeply understand the problem and solution". But it's unclear what exactly that would have bought me. If I run into further issues I will just solve them at that time.

So what is the issue exactly? This pattern just seems like a looser form of using a library versus building from scratch.


Hard to read due to LLM generated prose.

Yeah, it's quite bad. Just some of the classics:

- "Why This Matters"

- "That's accurate, but it's only half the answer — and the less interesting half"

- "this isn't an edge case. It's routine."

I'm at the point, I would just rather read something somebody actually wrote even if it's not grammatically perfect and has lots of spelling mistakes.


Unfortately the expectation of readers, and algorithms, at large is perfection.

If this contained various grammer mystaeks, but interesting content, it wouldn't have been flagged. As usual with LLM, it is based on other content. Show me the source, we used to say to binaries... ¿Que pasa?

So the upvotes were for? Anyway, we disagree — thats normal.

> As usual with LLM, it is based on other content.

Show me where else on the internet someone waxed poetic about a conceptual separation of transport and function regarding WireGuard. I dare you.

Show me another client library like the one in the article? That’s the double-dare.

Did you even read it?


Since you didn't think it was worth writing it yourself, I don't see how you can expect others to think it's worth spending their time to read.

So no, then? Thanks for your thoughtful engagement.

> So the upvotes were for?

People getting tricked? Who knows?

> Did you even read it?

I quit when I figured it was written by an LLM. I'm not interested in reading LLM 'content' without it providing a source.

I am willing to generate some of my own sauce with a prompt, and then requesting the sources. That way, I know at least some parameters of the input and output.

But with your article, I do not know which sources were used as reference, I do not know which prompt you used.

As for HN, they're busy with tackling the LLM problem. They know it is a problem.


Again, this was novel content. If you find a source of anything similar let me know. I'm belaboring this point for one important reason: content matters. I want to see new thoughts, not repetitive mindless drivel in personal "voice".

There has to be a balance.


One thing I've seen before is people being upfront about using LLMs (at the top of the content). That way, those who dislike it will feel less tricked.

The balance at least on this site is strongly in favour of humans writing things.

You’re belabouring the point because you don’t believe that by filling the internet with slop you’re doing anything wrong when actually it’s antisocial and wrecks the commons.

If you think content matters so much then just invest the time in writing it yourself rather than trying to convince others that it is ok that you didn’t.


The pot calling the kettle black, methinks. How are you improving the internet by vilifying new ideas?

No. It’s authenticity instead of llm-generated blogvertising.

When I ask an LLM, one that’s vaunted here for it’s skill on code, to “clean up obvious errors and improve readability” how is that “LLM generated”?

Yes it’s advertising in that I believe in my product and write about it.


Dude. Give it a rest. You had the LLM write an article, you posted it here. You got called out.

Just write your own blog and this won't happen in future.


Sigh. I did write it, then I used an LLM to clean it up. Seriously, if you can find anything else out there making a similar point or providing a similar library I'd love to hear about it.

You're absolutely right!

This is and has always been trivially configurable. Just put `Task` as a disallowed tool.


Part of the issue with legal weed is it's much like if all alcohol was sold as minorly different varieties of Everclear at 150+ ABV, and brands primary boast was just how potent and alcoholic their mix is. It doesn't encourage appropriate usage and IIRC many of these cases of psychosis are from consuming high THC products 24/7 for weeks/months/years on end.

If anyone is curious, check out brands like Rove, Dompen, Care By Design, which offer THC pens at very low dosage. They're frustratingly undermarketed and understocked, but as a CA resident I buy and use pens that are ~4% THC (rather than 90%+). A single puff occasionally after the kids go to sleep - the effect is marginally psychoactive, scratches the itch for "relaxation without impairment", helps me sleep restfully.

Completely different experience to high THC products. If you compare the literal amount of THC consumed, it's an almost 20x reduction. It's literally the equivalent to having a half glass of wine instead of lining up 10 shots.


I use gummies, ~4-5mg THC (ideally with some of the other TH- chemicals in it), deliberately kept my tolerance low so it doesn’t get more expensive (and I almost only use it for sleep, purely “fun” use is maybe a couple days a year). Take in the evening, start an MST3K episode about an hour later, really enjoy the back half of it, go to bed and fall asleep instantly, wake up feeling like a million bucks. Perfect evening.


I see a lot of people using weed for better sleep, but isn't weed supposed to interfere with REM states? I thought that weed would have the opposite effect that you say. Do you dream if you use weed before bed?


I rarely dream either way (unless I start focusing on that specifically, then my recall will improve quickly). When I was younger and would go to bed severely stoned I would wake up groggy and lethargic - clearly not optimal sleep. On 3-4% THC I usually wake up spontaneously and feel well rested. It mostly just helps me fall asleep and stay asleep. YMMV obviously.


It’s a pretty low dose, doesn’t exactly send me into space—heavy users might need 10x or more that dose to even feel it—just enough to make my brain shut up so I can fall asleep. I think a lot of folks who have a bad time when they try it start at far too high a dose (I wouldn’t even start at 5mg, maybe shoot for like 2), I also don’t much enjoy being properly high, anything past what you’d call a heavyish buzz I find unpleasant (and my standard nighttime dose doesn’t even quite get me to the heavier end of a buzz, that’s more the 7-10mg range for me, though I’d caution that some gummies seem more potent and some nominal-5s do get me closer to that than others)

I dunno about sleep quality effects, but it’s definitely better than even a couple beers (for me, these days) and it’s way better than lying awake until 3am… for the third night in a row. For most of the night it should be mostly worn-off, again, I’m not taking a ton and it takes longer to work through you in edible form than smoking, but we’re still talking less than half the night, especially as I usually time it so it hits just a little while before bed (I don’t want to get in bed without it having hit yet).

I don’t remember having had dreams most nights anyway, so I don’t know about that. Even with some help I’m typically a bit under the low side of the amount of sleep I ought to be getting, over a week. Lucky if I break the eight-hour mark two days of the seven, usually in the 6.5-7.5 range the rest (I don’t take a gummy every single night, either, gotta keep that tolerance at bay). I think I dream (or, at least, remember it) more when I get the rare series of several days of 8+ hours, but I don’t track it so can’t say for sure, and yeah, no idea the effect of weed on that.

I can vouch that at my dose level I get way better sleep than I did the one time I tried a prescription sleep aid, which was Lunesta. If I didn’t get a solid 9 hours on that I’d wake up feeling hung-over, weed doesn’t give me extra trouble like that if I fail to get a full 8+ hours. Hell, even a “good” night on lunesta didn’t leave me feeling awesome in the morning. Other downsides: it mixes worse with other things, had a glass of wine with dinner? Better think twice about the lunesta, at least according to the label. On some decongestant medicine (in addition to antibiotics) for a sinus infection, and the sinus infection is wrecking your ability to sleep so you could really use it? Might not be able to take it with the other stuff. Weed’s so much better for those cases especially, bump the dose slightly and nothing short of something that’s gonna hospitalize me will be able to keep me from sleeping, and it famously doesn’t interact badly with very many other drugs, so it removes the very worst thing about most common illnesses like that (for me, anyway) which is the extreme sleep disruption.


You get what you pay for imo.


My answer was (for which it did zero thinking and answered near-instantaneously):

"Drive. You're going there to use water and machinery that require the car to be present. The question answers itself."

I tried it 3 more times with extended thinking explicitly off:

"Drive. You're going to a car wash."

"Drive. You're washing the car, not yourself."

"Drive. You're washing the car — it needs to be there."

Guess they're serving you the dumb version.


I guess I'm getting the dumb one too. I just got this response:

> Walk — it's only 50 meters, which is less than a minute on foot. Driving that distance to a car wash would also be a bit counterproductive, since you'd just be getting the car dirty again on the way there (even if only slightly). Lace up and stroll over!


Sonnet 4.6 gives me the fairly bizarre:

> Walk! It would be a bit counterproductive to drive a dirty car 50 meters just to get it washed — and at that distance, walking takes maybe 30–45 seconds. You can simply pull the car out, walk it over (or push it if it's that close), or drive it the short distance once you're ready to wash it. Either way, no need to "drive to the car wash" in the traditional sense.

I struggle to imagine how one "walks" a car as distinct from pushing it....

EDIT: I tried it a second time, still a nonsense response. I then asked it to double-check its response, and it realized the mistake.


I got almost the same reply, including the "push it" nonsense:

> Walk! It would be a bit counterproductive to drive a dirty car 50 meters just to get it washed — and the walk will take you less than a minute. You can simply pull the car out and push or walk it over, or drive it the short distance once you're ready to wash it. Either way, no need to "drive" in any meaningful sense for just 50 meters.


You can walk a dog down the street, what's the difference?


GP’s car just isn’t trained well enough


lmao I love how stupid that response is.


I got this: Drive. Getting the car wet while walking there defeats the purpose.

Gotta keep the car dry on the way!


I guess that it generally has 50/50 chance of drive/walk, but some prompts nudge it toward one or the other.

Btw explanations don't matter that much. Since it writes the answer first, the only thing that matters is what it will decide for the first token. If first token is "walk" (or "wa" or however it's split), it has no choice but to make up an explanation to defend the answer.


Same, I haven't been able to get gemini or claude to tell me to walk a single time and I've even tried changing the distance in the prompt, etc.


I get the Anthropic models to screw up consistently. Change the prefix. Say in the preamble that you are going after supper or something. Change the scenario eveey time. They are caching something across requests. Once you correct it, it fixes its response until you mess with the prompt again


Maybe Claude knows that they've been trying to increase their step count and lose some weight


FWIW I mentioned this in the thread (I am the guy in the big GH issue who actually used verbose mode and gave specific likes/dislikes), but I find it frustrating that ctrl+o still seems to truncate at strange boundaries. I am looking at an open CC session right now with verbose mode enabled - works pretty well and I'm glad you're fixing the subagent thing. But when I hit ctrl+o, I only see more detailed output for the last 4 messages, with the rest hidden behind ctrl+e.

It's not an easy UI problem to solve in all cases since behavior in CC can be so flexible, compaction, forking, etc. But it would be great if it was simply consistent (ctrl+o shows last N where N is like, 50, or 100), with ctrl+e revealing the rest.


Yes totally. ctrl+o used to show all messages, but this is one of the tricky things about building in a terminal: because many terminals are quite slow, it is hard to render a large amount of output at once without causing tearing/stutter.

That said, we recently rewrote our renderer to make it much more efficient, so we can bump up the default a bit. Let me see what it feels like to show the last 10-20 messages -- fix incoming.


Terminals already solved how to do this decades ago: pagers.

Write the full content to a file and have less display it. That's a single "render" you do once and write to a file.

Your TUI code spawns `less <file>` and waits. Zero rendering loop overhead, zero tearing, zero stutter. `less` is a 40-year-old tool that exists precisely to solve this problem efficiently.

If you need to stream new content in as the session progresses, write it to the file in the background and the user can use `less +F` (follow mode, like tail -f) to watch updates.


thanks dude. you are living my worst nightmare which is that my ultra cool tech demo i made for cracked engineers on the bleeding edge with 128GB ram apple silicon using frontier AI gets adopted by everyone in the world and becomes load bearing so now it needs to run on chromebooks from 2005. and if it doesn't work on those laptops then my entire company gets branded as washed and not goated and my cozy twitter account is spammed with "why didn't you just write it in rust lel".

o7


Your worst nightmare. For me this is the cool part.


Just tell people to install a fast terminal if they somehow happen to have a slow one?

Heck, simply handle the scrolling yourself a la tmux/screen and only update the output at most every 4ms?

It's so trivial, can't you ask your fancy LLM to do it for you? Or you guys lost the plot at his point and forgot the most basics of writing non pessimized code.


> It's so trivial, can't you ask your fancy LLM to do it for you?

They did. And the result was a React render loop that takes 16ms to output a hundred characters to screen and tells them it will take a year to rewrite: https://x.com/trq212/status/2014051501786931427


What's extra funny is that curses diffs a virtual "current screen" to "new screen" to produce the control codes that are used to update the display. Ancient VDOM technology, and plenty fast enough.


I'm with you on this one. "Terminals are too slow to support lots of text so we had to change this feature in unpopular ways" is just not a plausible reason, as terminals have been able to dump ~1Mb per second for decades.

The real problem is their ridiculous "React rendering in the terminal" UI.


> because many terminals are quite slow, it is hard to render a large amount of output at once without causing tearing/stutter.

Only if you use React as your terminal renderer. You're not rendering 10k objects on screen in a few milliseconds. You're outputting at best a few thousand characters. Even the slowest terminal renderer is capable of doing that.


Why would you tailor your product for people that don’t know how to install a good terminal? Just tell them to install whatever terminal you recommend if they see tearing.


Do you have any examples of slow terminals, and what kind of maximum characters per second they have?


I tried this today. It's good - but it was significantly less focused and reliable than Opus 4.5 at implementing some mostly-fleshed-out specs I had lying around for some needed modifications to an enterprise TS node/express service. I was a bit disappointed tbh, the speed via fireworks.ai is great, they're doing great work on the hosting side. But I found the model had to double-back to fix type issues, broken tests, etc, far more than Opus 4.5 which churned through the tasks with almost zero errors. In fact, I gave the resulting code to Opus, simply said it looked "sloppy" and Opus cleaned it up very quickly.


Had this problem awhile ago of my zsh startup being slow. Just opened claude code and told it to benchmark my shell start and then optimize it. Took like 5 minutes and now it's ultra fast. Hardly any idea what it did exactly but worked great.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: