Hacker Newsnew | past | comments | ask | show | jobs | submit | XCSme's commentslogin

Gemma 4 is great: https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

I assume it is the 26B A4B one, if it runs locally?


No, only E2B and E4B.

I tried using Astro for https://aibenchy.com, initially it went great, but then I got into static-website limitations (such as dynamically generating all comparison pages, which would been generating N^4 pages, where N is the number of tested models).

I ended up switching to plain PHP, and it worked great. It is still mostly "static", but I can dynamically include the same content on multiple pages without having to duplicate/build it every time.


It does quite well on my limited/not-so-scientific private tests (note the tests don't include coding tests): https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

3.6 Plus seems to be simply a refined/more consistent 3.5 Plus: https://aibenchy.com/compare/qwen-qwen3-5-plus-02-15-medium/...

Good work, it's quite close to Gemini 3 Pro in my tests, but 10x cheaper:

https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...


Why no (high) variants in the comparison models?

Good question! I might add them, but there were multiple reasons:

1. Most variants on HIGH/XHIGH provide only marginal improvements in accuracy, but at drastically increased latency and cost. One special example is Gemini 3.1 Flash Lite, which on High used 1.5M reasoning tokens, and it's cost was 5x the one of running 5.3-Codex: https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...

2. On medium it seems like most models use a similar amount of reasoning tokens, this should be a more fair comparison.

3. Most models in the wild are used on medium (chat apps, default coding apps, tools, etc.).

4. Running on models on HIGH/XHIGH can lead to huge costs for me maintaining the test suite. I might add more models on high, if I can do it in a sustainable way.

5. Running models on HIGH would make running tests suites take much longer, so the results won't be published as fast.

6. Some models even show degradation when used on HIGH, as they tend to overthink/doubt themselves more. This seems to be a trend especially for new models, which wore trained to actually say "wait, but" quite a lot...

Overall, I am happy with how the current leaderboard/comparisons work. I might test some models on high, but for me, a better indication of true intelligence of a model/AGI is how well it does with "none"/no reasoning, than how well it does with high.


Now try to use it to develop a simple app.

I don't have coding tests yet, will add soon

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...


Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)


Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.


The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.

Yuck. At that point don't publish a benchmark, explains why their results are useless too.

-

Edit since I'm not able to reply to the below comment:

"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.

I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.


Why not? I described this in more detail in other comments.

Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.

Most models get this right. Also, this is just one failure mode of Claude.


Like I said in the edit, when people want specific formatting they ask for well known formats: Markdown, XML, JSON

I don't even need to debate if the benchmark is useful, it doesn't pass a sniff test: GPT-5.4 is not worse than Gemini 2.5 Flash in any way that matters to most users. In your benchmark it's meaningfully worse.


The questions do ask specifically to respond with the answer only, with an example format given in many cases.

Note that all reasoning models are tested with "medium" reasoning.

The benchmarks are questions/data processing tasks that an average user will likely ask, not coding questions (I didn't add any coding tests yet).

Gemini models also tend to be very consistent. Asking the same question will likely give the same result.

The two models you mention scored the same, the only difference is that Gemini was better at domain-specific questions (i.e. you ask something quite technical/niche).


If Gemini 2.5 Flash and GPT 5.4 perform the same for you, I'm glad.

It's not a useful finding for the rest of the world, and I sure hope non-technical people aren't being taken in by a steaming pile that implies those similarly performing LLMs (and many other ridiculous findings), but c'est la vie.

Now a days anyone can vibecode a "benchmark" with 0 understanding of the domain, what more should I expect?


It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.

I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.

While I like these models, if you're getting similar results to SOTA models from 6 months ago, I have to question how far you pushed those models 6 months ago. It is really easy to find scenarios were these models really underperform. They take far more advanced harnesses to perform reasonably (and hence the linked project). It's possible to get good results out of them, but it takes a lot of extra work.

I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.


we used Kimi 2.5, its really good

I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.

GLM 5 here is significantly better than GPT-5.4

It's 8.3 vs 8.1, I wouldn't call that significantly better.

I think GLM got a bit in front, because on some tests that both got wrong, GLM did sometimes (inconsistently) respond with the correct answer.

That being said, yes, in this case probably with more and more tests added, gpt-5.4 would edge in front, especially if a coding would be added (there are no coding tests yet).


Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).

The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.


Yeah, good tests are associated with cost. I'd like to see benchmarks on big messy codebases and how models perform on a clearly defined task that's easy to verify.

I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.


Lite >For small brands wanting to get started with monitoring and content. >$249/month

Is $249/month something most small brands/shops can afford? Many have a few $ks in total revenue.


What about images, links? Formatted text like bold or underline?

I also prefer plain text, but in most of my emails I talk about technical stuff, or I send transactional emails that require actions, in which case showing buttons is a much better user experience than plain text.


I don’t want buttons in my emails.

But they are a lot easier to see and click (accessibility, larger hit area).

You could have a larger text instead of a button, but changing font size is also HTML and not plain-text anymore.


Every MUA I've used allows the reader to set a font size, so changing font sizes is 100% a feature of plain-text emails. Then they get the link the size they need to read it correctly and it's absolutely easy to read. This here comment is pain text. Is it hard to read this link:

http://microsoft.com/

I don't think so. I certainly didn't have to resort to HTML to make that link readable and clickable.


I don’t have problems seeing and clicking normal text, thank you very much. I don’t want buttons on my emails.

I think the OP app is meant for creating transactional emails (or bulk-send emails like newsletters).

Those templates should account for all types of people and accessibility levels (including things like ADHD, where you need a big red button to click, otherwise you get overwhelmed by a block of text).


You can just send a link, and the user's client will probably highlight it even if it is plain text.

Yea, but how will they hide all the tracking URLs and base64 encoded PII from you in the email?

Using a URL shortener obviously. But you are right, if they only send plain text, they won't be able to include those 1x1 images at the bottom to track whether you have opened the email. Any sane email client blocks images by default, but whatever.

> What about images, links? Formatted text like bold or underline?

Easy. Don't.

That's the great bit. You don't have to.

https://useplaintext.email/


Why isn't this website plain text then?

Probably because it's a website and not email.

But I have to send the same sort of information (albeit shorter) via email on a regular basis.

A lot of alerts, reporting, quotes, code snippets, short documentation or step by step instructions, etc.

I don't just send emails to say "Hey, let's meet at 5". You know the memes with "this could have been an email", it usually is this case.

Just to be clear, most of those rich emails are the automatic/transactional emails.


Yeah, I get it, I unfortunately live in the real world too. I like to keep it plain text whenever possible but it's extremely useful sometimes to have inline screenshots and stuff like that.

I didn't mean to be sarcastic but it's just that to me, philosophically, email is a plaintext technology that had HTML bolted on to it kicking and screaming, and it's always been kind of crap. People like me hate things that are fundamentally ugly and crap even if they are useful. The web was designed for HTML from the start.


Change is always hard, even if it will be good in 20 years, the transitions are always tough.

Sometimes the transition is tough and then the end state is also worse!

Hoping that won't be the case with AI but we may need some major societal transformations to prevent it.


Not sure if AI can have clever or new ideas, it still seems to be it combines existing knowledge and executes algoritms.

I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.


Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.

My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.


This is my take as well. A human who learns, say, a Towers of Hanoi algorithm, will be able to apply it and use it next time without having to figure it out all over again. An LLM would probably get there eventually, but would have to do it all over again from scratch the next time. This makes it difficult combine lessons in new ways. Any new advancement relying on that foundational skill relies on, essentially, climbing the whole mountain from the ground.

I suppose the other side of it is that if you add what the model has figured out to the training set, it will always know it.


We call that Standing On The Shoulders Of Giants and revere Isaac Newton as clever, even though he himself stated that he was standing on the shoulders of giants.

Clever/novel ideas are very often subtle deviations from known, existing work.

Sometimes just having the time/compute to explore the available space with known knowledge is enough to produce something unique.


There is no such thing. All new ideas are derived from previous experiences and concepts.

The difference people are neglecting to point out is the experiences we have versus the experiences the AI has.

We have at least 5 senses, our thoughts, feelings, hormonal fluctuations, sleep and continuous analog exposure to all of these things 24/7. It's vastly different from how inputs are fed into an LLM.

On top of that we have millions of years of evolution toward processing this vast array of analog inputs.


So, just connect LLMs to lava lamps?

Jokes aside, imagine you give LLMs access to real-time, world-wide satellite imagery and just tell it to discover new patrerns/phenomens and corrrlations in the world.


"extrapolation" literally implies outside the extents of current knowledge.

Yes, but not necessarily new knowledge.

It means extending/expanding something, but the information is based on the current data.

In computer games, extrapolation is finding the future position of an object based on the current position, velocity and time wanted. We do have some "new" position, but the sistem entropy/information is the same.

Or if we have a line, we can expand infinitely and get new points, but this information was already there in the y = m * x + b line formula.


How would you know if it wasn't an extrapolation of current knowledge? Can you point me to somethings humans have done which isn't an extrapolation?

That was my point: "I am not necessarily saying humans do something different".

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: