Slight tangent: Interesting that they use o3-mini as the comparison rather than ...

kmod · 2025-03-25T18:25:01 1742927101

The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says "All numbers are sourced from providers' self-reported numbers" but the only mention of o3-mini having a score of 86.5 I could find was from this other source [2]

[1] https://openai.com/index/openai-o3-mini/ [2] https://www.vals.ai/benchmarks/aime-2025-03-24

You just have to use the models yourself and see. In my experience o3-mini is much worse than o1.

logicchains · 2025-03-25T17:59:56 1742925596

It's a reasonable comparison given it'll likely be priced similarly to o3-mini. I find o1 to be strictly better than o3-mini, but still use o3-mini for the majority of my agentic workflow because o1 is so much more expensive.

FloorEgg · 2025-03-25T17:26:08 1742923568

I noticed this too, I have used both o1 and o3 mini extensively, and I have ran many tests on my own problems and o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability.

The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.

boldlybold · 2025-03-25T17:19:05 1742923145

I find o3 at least faster to get to the response I care about, anecdotally.

PunchTornado · 2025-03-25T18:22:56 1742926976

Why would you compare against all the models from a competitor. You take their latest one that you can test. Openai or anthropoc don’t compare against the whole gemini family.

jnd0 · 2025-03-25T17:36:11 1742924171

Probably because It is more similar to o3 in terms of size/parameters as well as price (although I would expect this to be at least half price)