I used gemini to look up a relative with a connection to a famous event. The relative himself is obscure, but I have some of his writings and I've heard his story from other relatives. Gemini fabricated a completely false narrative about my relative that was much more exciting than what actually happened. I spent a bunch of time looking at the sources that Gemini supplied trying to verify things and although the sources were real, the story Gemini came up with was completely made up.
Yup. I've had Gemini create fake citations to papers. I've also had it hallucinate the contents of paywalled papers, so I know I can't trust anything it writes, though I am getting better at using it recursively to verify things.
I am certain I read article that was posted on YN a month or so ago about some researchers that were caught using false citations in their research.
If I remember correctly, some group used an AI tool to sniff for AI citations in other's works. What I remember most was how abhorrent some of the sources that the AI sniffer caught. One of the citation's authors was literally cited as "FirstName LastName" -- didn't even sub in a fake name lol.
My company recently hired a contractor. He submits multi-thousand line PRs every day, far faster than I can review them. This would maybe be OK if I could trust his output, but I can't. When I ask him really basic questions about the system, he either doesn't know or he gets it wrong. This week, I asked for some simple scripts that would let someone load data in a a local or staging environment, so that the system could be tested in various configurations. He submitted a PR with 3800 lines of shell scripts. We do not have any significant shell scripts anywhere else in our codebase. I spent several hours reviewing it with him - maybe more time than he spent writing it. His PR had tons and tons of end-to-end tests of the system that didn't actually test anything - some said they were validating state, but passed if a get request returned a 200. There were a few tests that called a create API. The tests would pass if the API returned an ID of the created object. But they would ALSO pass if the test didn't return an ID. I was trying to be a good teacher, so I kept asking questions like "why did you make this decision", etc, to try to have a conversation about the design choices and it was very clear that he was just making up bullshit rationalizations - he hadn't made any decisions at all. There was one particularly nonsensical test suite - it said it was testing X but included API calls that had nothing to do with X. I was trying to figure out how he had come up with that, and then I realized - I had given him a Postman export with some example API requests, and in one of the API requests I had gotten lazy and modified the request to test something but hadn't modified the name in Postman. So the LLM had assumed that the request was related to the old name and used it when generating a test suite, even though these things had nothing to do with each other. He had probably never actually read the output so he had no idea that it made no sense.
When he was first hired, I asked him to refactor a core part of the system to improve code quality (get rid of previous LLM slop). He submitted a 2000+ line PR within a day or so. He's getting frustrated because I haven't reviewed it and he has other 2000+ line PRs waiting on review. I asked him some questions about how this part of the system was invoked and how it returned data to the rest of the system, and he couldn't answer. At that point I tried to explain why I am reluctant to let him commit his refactor of a core part of the system when he can't even explain the basic functionality of that component.
I'm observing pretty much the same pattern in my job. The sad truth is, people -especially non-technical- get too easily impressed by vibe-coded projects or contributions made in a few hours, because it's shiny and it gives the impression of a productivity boost. Don't you dare asking how that is supposed to scale, if it's secure or even extensible, or you'll be the one killing the mood in the room. Even though that's precisely the hard part of the job.
This sums up the inherent friction between hype and reality really well.
CEOs and hype men want you to believe that LLMs can replace everyone. In 6 months you can give them the keys to the kingdom and they'll do a better job running your company then you did. No more devs. No more QA. No more pesky employees who needs crazy stuff like sleep, and food, and time off to be a human.
Then of course we run face first into reality. You give the tool to an idiot (or a generally well meaning person not paying enough attention) and you end up with 2k PRs that are batshit insane, production data based deleted, malicious code downloaded and executed on just machines, email archives deleted, and entire production infrastructure systems blown away. Then the hype men come back around and go "well yeah, it's not the tools fault, you still need an expert at the wheel, even though you were told you don't".
LLMs can do amazing things, and I think there's a lot of opportunities to improve software products if used correctly, but reality does not line up with the hype, and it never will
The volume is different. Someone submitted a PR this week that was 3800 lines of shell script. Most of it was crap and none of it should have been in shell script. He's submitting PRs with thousands of lines of code every day. He has no idea how any of it actually works, and it completely overwhelms my ability to review.
Sure, he could have submitted a ill-considered 3800 line PR five years ago, but it would have taken him at least a week and there probably would have been opportunities to submit smaller chunks along the way or discuss the approach.
It’s harder when the person doing what you describe has the ability to have you fired. Power asymmetry + irresponsible AI use + no accountability = a recipe for a code base going right to hell in a few months.
I think we’re going to see a lot of the systems we depend on fail a lot more often. You’d often see an ATM or flight staus screen have a BSOD - I think we’re going to see that kind of thing everywhere soon.
This happened to me yesterday.
I give a junior engineer a project. He turns it around really quickly with Cursor. I review the code, get him to fix some things (again turned around really quickly with Cursor) and he merges it. I then try a couple test cases and the system does the wrong thing on the second one I try. I ask him to fix it. He puts into cursor a prompt like "fix this for xyz case" and submits a PR. But when I look at the PR, it's clearly wrong. The model completely misunderstood the code. So I leave a detailed comment explaining exactly what the code does.
He's moving so fast that he's not bothering to learn how the system actually works. He just implicitly trusts what the model tells him. I'm trying to get him to do end-to-end manual testing using the system itself (log into the web app in a local or staging environment and go through the actions that the user would go through), he just has the AI generate tests and trusts the output. So he completely misses things that would be clear if you learned the system at a deep level and could see how the individual project you're working on fit in with the larger system
I see this with all the junior engineers on my team. They've never learned how to use a debugger and don't care to learn. They just ask the model. Sometimes they think critically about the system and the best way to do something, but not always. They often aren't looking that critically at the model's output.
Senior engineers must become more comfortable giving quick, broad feedback that matches the minimal time put into the PR. "This doesn't fit how the system works; please research and write a more detailed prompt and redo this" is the advice they need. It feels taboo to do it to a significant diff, but diff size no longer has much correlation to thought or effort in these situations.
> Coding agents had collapsed the barrier to entry for launching a delivery app. A competent developer could deploy a functional competitor in weeks, and dozens did, enticing drivers away from DoorDash and Uber Eats by passing 90-95% of the delivery fee through to the driver. Multi-app dashboards let gig workers track incoming jobs from twenty or thirty platforms at once, eliminating the lock-in that the incumbents depended on. The market fragmented overnight and margins compressed to nearly nothing.
This doesn't make a ton of sense to me. The barrier to entry isn't the app, it's the network of drivers and restaurants, and all the money that apps like DoorDash poured into marketing. Just having a functioning app doesn't really do very much.
I believe you're referring to Syria, not Iran. And I don't think you're describing the situation accurately at all. The Syrian civil war is incredibly complex, and there are many parties involved. The groups that led the offensive were supported by Turkey at various points, but not by the United States. US forces in Syria didn't really have much to do with that offensive.
> Islamists and communists. Guess which one was helped by USA? :-)
Neither was helped by the USA. The Shah was helped by the USA.
What the USA did is the same thing it does in all of the Islamic dictatorships that it props up - it used its intelligence and its cash to help its dictator exterminate all of his secular opposition. Actually kill. What was left was religious fundamentalist opposition that it couldn't touch, and that the Shah himself partially relied on to stay in power. That meant that when the general population was finally at the point of exasperation, the only institutions that were 1) prepared to be the vehicle of that exasperation and 2) had an government in waiting that could take charge after the government had fallen were the religious ones.
Same thing that happened in Egypt after decades of helping Mubarak kill members of the secular opposition and destroy their organizations. When the government was overthrown spontaneously by a public driven to their limit, the only people prepared to take over, and supported by the public, were fundamentalists. The US saw another Iran coming and quickly stepped in to destroy the popular will and install another dictator that they could control.
There's some truth to what you're saying, but it's a huge exaggeration. It's absolutely incorrect to say that the US helped the Shah kill all of his secular political opponents. It's generally true that SAVAK had neutered the communist opposition, but there were many secular opponents of the Shah who contributed to the Iranian Revolution. Many of them had been imprisoned at various points, but not killed. Take Shapour Bakhtiar or Mehdi Bazargan for example. There were many, many secular people or moderate Islamists who opposed the Shah during the Iranian Revolution.
What happened is that Khomeini consolidated power after the revolution and eliminated these people.
I've actually read quite a lot about the fall of the shah and what you are saying is bullshit. See, for instance, Scott Anderson's recent book King of Kings which goes into a great deal of detail about the US government's understanding and decision-making during the Iranian Revolution.
There's an old interview on C-SPAN's BookTV with a CIA polygrapher. He seems to genuinely believe in the validity of the polygraph, but watching the interview, I was convinced that the only value comes from intimidation and stress.
(all-caps bad transcription)
> THE ESSENCE OF A POLYGRAPH TEST IS IF YOU HAVE SOMETHING TO LOSE BY FAILING A POLYGRAPH TEST IF YOU WILL, OR SOMETHING TO GAIN BY PASSING IT, THAT IS WHAT MAKES THE POLYGRAPH EFFECTIVE. WITHOUT THE FEAR OF DETECTION IT IN A SIMPLE WAY AS I CAN PUT IT THAT IS WHAT MAKES IT WORK. YOU HAVE TO BE AFRAID. IF YOU HAVE NOTHING TO LOSE BY TAKING THE POLYGRAPH TEST THAN THE PRESSURE IS NOT ON YOU. BUT AS I SAID THAT IS WHAT MAKES YOU WORK. IT HAS TO BE PROTECTION MORE THAN GILTS. NOW YOU MAY FEEL GUILTY, BUT FEAR OF DETECTION IS THE OVERRIDING CONCERN IN IN A POLYGRAPH TEST
Maybe Reformation religions require belief, but the paganism was a set of rituals known to work (by virtue of having worked before), sort of a like a spiritual experimental science. Belief was not required.
Religions don't necessarily work because people believe in it, either. There are a number of religious sects that started with end of the world prophecies.
I think that religions work the opposite way: people believe in them because they work. Since the purpose of religion is generally to explain the nature of reality and how to flourish in it, it needs to work for you. If it doesn't, you either just go through the motions, or quit and find a different religion (or swear off religion, which is sort of the same thing).
Reminds me of Julius Caesar describing the druids. Part of his political career meant precisely performing important orthopraxy. He probably didn’t meet a druid, but amazingly described them playing the same role he did as Pontifex Maximus.
The orthopraxy requiring those precision rituals, take Rome and Greece, had little or maybe no mandatory beliefs. City-state-sized gods in Mesopotamia probably functioned the same way. Traditions still have precise orthopraxy today. But we talk about differences in belief whereas Caesar doesn’t even acknowledge any.
Charitable read, would suggest slight touch of tongue in a cheek.
A bit of spelling it out
Point-1. People just interpreted that paganism works.
E.g. Somebody made offering to gods, and year later won a war - proof.
Point-2 paganism had this transactional notion with gods giving and taking based on your offerings.
While christianity on the other hand does not promise anything good in this life (the only promise being: bear all the bad things in this life, you will be rewarded in the afterlife), so there can’t be proof.
That's the point though. The testers wouldn't actually abuse their victims without the conviction of doing something righteous. Or they would, accidentally or intentionally, spill the secrets.
But if you make even the instruction material lie, then there is nothing that could be leaked and "expose" the system.
I'll second this. An external recruiter was under the (incorrect) impression that we are a 996 company. We found out because she said that no senior people she talked to were willing to work those hours.
Ultimately you can make a lot of short-term progress with 23-year-olds who are willing to live 5 minutes away from the office, have no life outside of work, and work 72 hour weeks. But you also end up with a product that was built by people who have no idea what they're doing.
I am only seeing that if the person writing the prompts knows what a quality solution looks like at a technical level and is reviewing the output as they go. Otherwise you end up with an absolute mess that may work at least for "happy path" cases but completely breaks down as the product needs change. I've described a case of this in some detail in another comment.
> the person writing the prompts knows what a quality solution looks like at a technical level and is reviewing the output as they go
That is exactly what I recommend, and it works like a charm. The person also has to have realistic expectations for the LLM, and be willing to work with a simulacrum that never learns (as frustrating as it seems at first glance).
I'm trying to work with vibe-coded applications and it's a nightmare. I am trying to make one application multi-tenant by moving a bunch of code that's custom to a single customer into config. There are 200+ line methods, dead code everywhere, tons of unnecessary complexity (for instance, extra mapping layers that were introduced to resolve discrepancies between keys, instead of just using the same key everywhere). No unit tests, of course, so it's very difficult to tell if anything broke. When the system requirements change, the LLM isn't removing old code, it's just adding new branches and keeping the dead code around.
I ask the developer the simplest questions, like "which of the multiple entry-points do you use to test this code locally", or "you have a 'mode' parameter here that determines which branch of the code executes, which of these modes are actually used? and I get a bunch of babble, because he has no idea how any of it works.
Of course, since everyone is expected to use Cursor for everything and move at warp speed, I have no time to actually untangle this crap.
The LLM is amazing at some things - I can get it to one-shot adding a page to a react app for instance. But if you don't know what good code looks like, you're not going to get a maintainable result.
reply