What I missed from the writeup were some specific cases and how did you test that all this orchestration delivers worthwhile data (actionable and full/correct).
E.g. you have a screenshot of the AI supply chain - more of these would be useful, and also some info about how you tested that this supply chain agrees with reality.
Unless the goal of the project was to just play with agent architecture - then congrats :)
For demo purpose and to attract attention, i was primarily picking some cases with cool visuals (like the screenshot of the AI supply chain you mentioned). we have some internal eval and will try to add more cases in the public repo for reference.
More signs of the AI bubble. Completely unprofessional behavior ("cool visuals" not "real results"). And don't give me that "hacker culture" bullshit, these people are targeting Wall Street as paying customers.
would it be more professional in your opinion if i am claiming i make $xxxxx via this tool? I thought i have clearly stated that cool visuals is for >demo purpose and to attract attention. I do not want to post any dramatic statement to trick people using it. This is an early stage open source project to help investors and traders organize their thoughts, not an auto money making machine that guarantee profit. its the mind who use the tool decide if they will profit from market.
>And don't give me that "hacker culture" bullshit
I couldn’t help but be genuinely curious: if you believe AI is a bubble and aren’t a fan of hacker culture, then why are you here on Hacker News?
First of all this project is great and finance is ready for a disruption like this. I'm sure a lot of good research and development went into this.
Quality research indeed doesn't always make money, so I agree that it doesn't make sense to present these type of metrics. But at the same type, it will be hard to trust this sort of thing immediately without having a way to validate its output.
At the very least I would like to know that the financial metrics it calculates (esp those based on 20/30 data points) are correct. Looks like there is some transparency build in and that's a good thing.
But people that are not a pro in investment research wouldn't know that it messed up a certain metric and therefore the output is different from what it tells me. Or maybe it is not messing up entirely, but a certain sector-specific detail doesn't get picked up making a signal less strong than the output made you believe. Maybe you already have it but if not maybe you could get some sort of validation layer added, that could also serve as some sort of customisable calculation engine, I'd use it right away.
In this case the reason for dropping support is most likely that the only DRM they can support on that older hardware has been broken. There's no technical reason why it can't be supported, and I doubt it would cost them much (or even anything) to continue support.
Meanwhile, I can still read physical books I've had since I was a child, 40 years ago. The Kindle is undeniably more convenient than physical books, but this is absolutely an unnecessary sunset of these devices.
My Kindle 4 hardware works great, I still read it nightly. Since it doesn’t feel like it’s obsolete (in fact it has physical buttons so may be slightly better than a modern Kindle), it feels like a blatant cash grab by Amazon to get us to buy new devices that probably are laden with ads or other revenue generators.
Since the November/December Opus and Claude Code, I found I don't need to read the code any more. Architecture overview sure, and testing yes, but not reading the code directly any more.
Me (and my friends similarly) inspect code indirectly now - telling agents to write reports about certain aspects of the code and architecture etc.
I do regularly read the code that Claude outputs. And about 25% of the time the tests it writes will reimplement the code under test in the test.
Another 25% of the time the tests are wrong in some other way. Usually mocking something in a way that doesn't match reality.
And maybe 5% of the time Claude does some testing that requires a database, it will find some other database lying around and try to use that instead of what it's supposed to be doing.
And even if Claude writes a correct test, it will general have it skip the test if a dependency isn't there--no matter how fervently I tell it not to.
If you're not looking the code at all, you're building a house of cards. If you not reading the tests you're not even building you're just covering the floor in a big sloppy pile of runny shit.
> I do regularly read the code that Claude outputs
You probably could have s/Claude/Human/ in your rant and been just as accurate. I don't know how many times I've flagged these issues in code reviews. And that's only assuming the human even bothered to write tests...
What I find is that when I ask AI to write tests it writes too many, and I agree with you that a lot of them are useless. But then I just tell it that, and it agrees with me and cleans it up. Much faster feedback loop and much better final result.
I feel like people that look at a poor result and stop there and conclude it's useless have made up their mind and don't want to see the better results that are right in front of them if they just spend an extra 5 seconds trying.
How do you know whether the tests it’s spits out are bad if you don’t read the tests.
We’re not dealing AGI here. Tests aren’t strictly necessary for humans. They are for AI. AI requires guardrails to keep from spinning out. That’s essentially the entire premise of the agentic workflow.
I’m pretty sure they just meant they do testing not that they read the tests and that’s what everyone else who responded interpreted that as well.
You can get Claude to write good tests but based on what I’m seeing at work that’s not what’s happening. They always look plausible even when they’re wrong, so people either don’t read them, skim them very quickly, or read the first few assume the rest work and commit.
I think Claude is great for testing because setting test data and infrastructure is such a boring slog. But it almost always takes a lot of back and forth and careful handholding to get it right.
I read the tests, it also is really really good to have Claude verify that removing the changes in question break the tests. This brings the quality way way up for me.
I'd understand not reading the code of the system under test, but you don't even read the tests? I'd do that if my architecture and design were very precise, but at this point I'd have spent too much time designing rather than implementing (and possibly uncovering unknown unknowns in the process).
> Me (and my friends similarly) inspect code indirectly now - telling agents to write reports about certain aspects of the code and architecture etc.
Doesn't this take longer than reading the code?
I can see how some of this is part of the future (I remember this article talking about python modules having a big docstring at the top fully describing the public functions, and the author describing how they just update this doc, then regenerate the code fully, never reading it, and I find this quite convincing), but in the end I just want the most concise language for what I'm trying to express. If I need an edge case covered, I'd rather have a very simple test making that explicit than more verbose forms. Until we have formal specifications everywhere I guess.
But maybe I'm just not picturing what you mean exactly by "reports".
I've seen the code these models produce without a human programmer going over the results with care. It's still slop. Better slop than in the past, but slop none the less. If you aren't at minimum reading the code yourself and you're shipping a significant amount of it, you're either effectively the first person to figure out the magic prompt to get the models to produce better code, or you're shipping slop. Personally, I wouldn't bet on the former.
Yeah, these models have definitely become more useful in the last months, but statements like "I don't need to read the code any more" still say more about the person writing that than about agents.
Did you travel in Europe? Even without crisis, gas stations are often way busier on the cheaper country's border than more expensive.
My friends living in Switzerland (near the border) always go to Germany to fuel up. And, even without a crisis, gas stations on the cheaper sides of borders are often way more crowded than on the other side.
Also, keep in mind that Slovenia is roughly the size of Los Angeles. Or not much wider than Long Island. If there fuel was 30% cheaper on one side of Long Island, than on the other, I'm sure plenty of people wouldn't think twice about that.
Aside from a cost? It's also managing the actual human being, and making sure they have enough work. If the place has 5-10 calls a day, then it's pointless to hire receptionist that will do nothing for 1 hour, and then get 2 minutes chat. It used to be pointless to build software to do that, but since claude code it's cheap enough to make sense.
receptionist as a service has been a thing for like... forever. You are never going to solve the problem of accurately estimating and quoting with AI or an answering service, so pay for someone to answer the phone and take down the details; have a mechanic or trained service rep review and estimate. Cheap code that doesn't solve the problem is not cheap.
Yes, of course. The bot can request information and the customer can provide it if they feel like it, and then someone qualified can call them back when they have their hands free.
But there's no bot, per se, needed at all. An answering machine from 1993 can do this same information-gathering job. :)
So update the device from 1993's new-fangled digital answering machine to 2009's Google Voice, and have it do the transcription from voicemail to text.
Someone will still have to call Bill back about his Honda (which is actually the Kia he bought for his daughter -- Bill is not a very technical guy these days[1] and he confuses such concepts regularly) in order to get any trading of money for services done.
It doesn't take an LLM to get there, and Bill would probably prefer to avoid being frustrated by the bot's insistent nature.
Look, you‘re kicking an open door.
I think LLMs applied like this are just a layer of complexity that os mostly replacing lower level programming solutions that could do the same thing
The transcription + callback loop is honestly underrated.
Most of the value here is just capturing intent accurately
("Honda" vs "Kia" aside) so the mechanic can prioritize
callbacks. A dumb voicemail-to-text pipeline handles that
fine. The LLM layer adds complexity without solving the
actual bottleneck, which is someone qualified picking up
the phone.
But I'm not sure that a bot can be trusted to make good decisions about priority, either. So even if it makes good decisions based on context (which it can increasingly-often do, but does not always do), it lacks the context that is necessary to form the basis of good decisions.
Suppose a message comes into the box with this form: "This is Wendy, can you call me? My car is making that noise again."
The bot might deprioritize that call because it lacks actionable contextual information. "My job as a bot is to get more jobs into the shop. This call does not have enough data to do that, so I'll shove to the bottom of list of callbacks behind more-actionable jobs."
But the mechanic? The mechanic knows Wendy's Ford very well, and he also knows Wendy. She's a been a good customer for over a decade. The mechanic also knows the noise, and that Wendy has 3 little kids and that she's vacationing 900 miles away on a road trip with those kids in that Ford. The context is all there inside of the mechanic's brain to combine and mean that this might be the highest-priority call he gets all week.
Wendy may not have actively relayed any urgency in her message, but the urgency is real and she needs called back right away. She needs answers about what to do (keep driving and look into it when she gets back? pull over immediately and get a tow to a decent local shop? maybe she even needs help finding such a shop?) pretty much immediately. Not because it means more business today, but because it means more business for years.
The mechanic can spot this from a list of transcripts in an instant and give her a ring back Right Now. The bot is NFG at this.
The addition of the bot only adds noise to the process, and that noise only works to Wendy's detriment. When the bot adds detrimental noise to Wendy's situation, it also adds detriment to the shop's longevity.
The presence of the bot -- even as a prioritizing sorting mechanism -- asymptotically shifts the state from an excellent shop that knows their customers very well to a bot-driven customer-averse hellscape.
(And no, the answer isn't to make the bot into an all-knowing oracle that actively gets fed all context. The documentation burden would be more expensive, time-wise (and thus money-wise) than hiring a competent human receptionist who answers the phone, handles the front door traffic, and absorbs context from their surroundings. A person who chatted with Wendy last Thursday right before she left for her trip is always going to be superior to a bot.)
If someone put on their website and voicemail that they were available for calls only from 8-10am (for example), or that they would return my call at that time, I'd make a point to call them then. It's reasonable that people are busy too.
instead of asking „what’s next”, a good question to ask is „what jobs are now feasible that previously were cinstrained by the cost of producing software”?
I liked the apple II, and the TRS 80 as I rather like basic. And then I didn’t hate DOS, and then I actively hated the graphical shell of Windows 3, but could not afford a Macintosh -so suffered through it where I had to, but mainly used DOS. Then I discovered UNIX, and did almost all of my work on a timeshare - in the early 90s!
Then Windows 95 came out and I actively hated it, but did think it was amazingly pretty - somehow this was the impetus for me to get a pc again, which I put Windows NT on. Which was profitable for freelance gigs in college. Soon after that, I dual booted it to Linux and spent most of my time in Slackware.
After that, I graduated and had enough money to buy a second rig, which I installed OS/2 warp on - which was good for side gigs. And I really liked. A lot. But my day job required that I have a Windows NT box to shell into the Solaris servers as we ran. Then I got a better class of employer and the next several let me run a Linux box to connect to our solaris (or Aix) servers.
Next my girlfriend at the time got a PowerBook G4 and installed OS X on it. It was obviously amazing. Windows XP came out, and it was once again so much worse than Windows NT - and crashed so much more - which was odd as it was based on Windows NT. (yes 98 was before this but it was really bad). Anyhow, right about here the Linux box I was running at home, died. And it was obvious that I was not going to buy an XP box, so I bought my first Mac.
And it’s been the same for the last 25 years - every time I look at a Windows box it’s horrible. I pretty much always have a Linux box headless somewhere in the house, and one rented in the cloud, and a Mac for interacting with the world.
And like the parent I actively dislike windows. And that’s interesting because I’ve liked most other operating systems I’ve used in my life, including MS-DOS. Modern windows is uniquely bad.
I use windows and absolutely hate the mac UI. Having the current window title bar always at the top of the screen doesn't make any sense when you have a very big monitor. It only made sense with the tiny monitors available when the mac UI was originally created.
Yeah, that is an annoyance for me too but for a different reason. I have set the menu bar to be only in the internal display (to avoid issues with my OLED external monitor) so when I have a window in the external monitor, I have to move the mouse to the internal monitor screen space if I want to open something that is in the app's title bar.
On the other hand, it is actually useful that there is mostly a specific place you find settings etc, as in windows/linux it tends to vary depending on the app where to find those (is there a bar on top of the window? Is there a button to expand a menu somewhere? Something else? Who knows).
What I missed from the writeup were some specific cases and how did you test that all this orchestration delivers worthwhile data (actionable and full/correct).
E.g. you have a screenshot of the AI supply chain - more of these would be useful, and also some info about how you tested that this supply chain agrees with reality.
Unless the goal of the project was to just play with agent architecture - then congrats :)
reply