Hacker Newsnew | past | comments | ask | show | jobs | submit | niklassheth's commentslogin

So many problems with this:

The benchmark is totally useless. It measures single prompts, and only compares output tokens with no regard for accuracy. I could obliterate this benchmark with the prompt "Always answer with one word"

This line: "If a user corrects a factual claim: accept it as ground truth for the entire session. Never re-assert the original claim." You're totally destroying any chance of getting pushback, any mistake you make in the prompt would be catastrophic.

"Never invent file paths, function names, or API signatures." Might as well add "do not hallucinate".


Prompt engineering is back? I think not: I got no better results for one or two years now using meta-prompts that are generic and/or from the internet.

“Make no mistakes”

Nice! Your comparison site is probably the best one out there for image models


I put the output from this tool into GPT-5-thinking. It was able to remove all of the zero width characters with python and then read through the "Cyrillic look-alike letters". Nice try!


This is more evidence that Cognition's SWE-1.5 is a GLM-4.6 finetune


Can you provide more context for this? (eg Was SWE-1.5 released recently? Is it considered good? Is it considered fast? Was there speculation about what the underlying model was? How does this prove that it's a GLM finetune?)


People saw chinese characters in generations made by swe-1.5 (windsurfs model) and also in the one made by cursor. This led to suspicions that the models are finetunes of chinese models (which makes sense, as there aren't many us/eu strong coding models out there). GLM4.5/4.6 are the "strongest" coding models atm (with dsv3.2 and qwen somewhat behind) so that's where the speculation came from. Cerebras serving them at roughly the same speeds kinda adds to that story (e.g. if it'd be something heavier like dsv3 or kimik2 it would be slower).


Really appreciate this context. Thank you!


I suspect they are referencing the 950tok/s claim on Cognition's page.


Ah. Thx. Blogpost for others: https://cognition.ai/blog/swe-1-5

Takeaway is that this is sonnet-ish model at 10x the speed.


Not at all. Any model with somewhat-similar architecture and roughly similar size should run at the same speed on Cerabras.

It's like saying Llama 3.2 3B and Gemma 4B are fine tunes of each other because they run at similar speeds on NVidia hardware.


It seems like the repo is mostly if not entirely LLM generated; not a great sign.


I know some consumer cards have artificially limited FP64, but the AI focused datacenter cards have physically fewer FP64 units. Recently, the GB300 removed almost all of them, to the point that a GB300 actually has less FP64 TFLOPS than a 9 year old P100. FP32 is the highest precision used during training so it makes sense.


The majority of phones in the US are iPhones, especially in big cities where phone theft is most common.


I've also found it to be good at digging deep on things I'm curious about, but don't care enough to spend a lot of time on. As an example, I wanted to know how much sugar by weight is in a coffee syrup so I could make my own dupe. My searches were drowned out by marketing material, but ChatGPT found a datasheet with the info I wanted. I would've eventually found it too, but that's too much effort for an unimportant task.

However, the non-thinking search is total garbage. It searches once, and then gives up or hallucinates if the results don't work out. I asked it the same question, and it says that the information isn't publicly available.


Don’t sleep on Gemini Deep Research feature either. I use it for my car work and it beats ChatGPT’s offering at that price point every time.


I dunno, I use Deep Research from Claude, ChatGPT, and Gemini, and Gemini is the only one that ignores my requests and always produces the most inane high school student wannabe management consultant "report" with introduction and restatement of the problem and background and all that. Its "voice" (the prose, I mean, not text to speech) is so irritating I've stopped using it.

The other ones will do the thing I want: search a bunch, digest the results, and give me a quick summary table or something.


Gemini is high on hallucination. When I ask it about my own software it not only changes my own name to a similar one common in my language but also makes up stuff about our team saying some stranger works with us (he works in the same niche but that's about it).

It's annoying when it's so confident making up nonsense.

Imo Chat GPT is just a league above when it comes to reliability.


I just end up using both for research type things. They both end up doing better on certain topics or types of work. For $20/mo why not both :)

I like ChatGPT as a product more, but Gemini does well on many things that ChatGPT struggles with a little more. Just my anecdotes.


>Imo Chat GPT is just a league above when it comes to reliability.

Which is in my option, the #1 metric an LLM should strive for. It can take quite some time to get anything out of an LLM. If the model turns out to be unreliable/untrustworthy, the value of its output is lost.

It's weird that modern society (in general) so blindly buys in to all of the marketing speak. AI has a very disruptive effect on society, only because we let it happen.


I like Gemini Deep Research because ChatGPT's has very low limits, but it is extremely on rails. Yesterday as an experiment I asked it to do a bunch of math rather than write a report, and it did the math but then wrote a report scolding me for not appreciating the beauty of the humanities.


Suppose it depends. I think of it like this article suggests. It is very good at searching and scraping a lot of websites fast. And then summarizing that some.


I've found the same, but I also haven't gained much value out of "deep research" products as a whole. When I last tested them with topics I'm familiar with, I found the quality of research to be poor. These tools seem to spend their time searching for as much content as possible, then they dump it all into a report. I get better outcomes by extensively searching for a handful of top quality sources. Most of the time your question (or at least some subquestions) has already been answered by an expert, and you're better off using their work than sloppily recreating it.


This begs the question of what would be required to get an AI chatbot to emulate the process you (and others, including myself) use manually, and whether it's possible purely through different prompting.

Is the fundamental problem that it weights all sources equally so a bunch of non-experts stating the wrong answer will overpower a single expert saying the correct answer?


This post has some interesting suggestions about that: https://open.substack.com/pub/mikecaulfield/p/is-the-llm-res...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: