Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.

I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.

Are there benchmarks showing o3-mini performing better than o1?



The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says "All numbers are sourced from providers' self-reported numbers" but the only mention of o3-mini having a score of 86.5 I could find was from this other source [2]

[1] https://openai.com/index/openai-o3-mini/ [2] https://www.vals.ai/benchmarks/aime-2025-03-24

You just have to use the models yourself and see. In my experience o3-mini is much worse than o1.


It's a reasonable comparison given it'll likely be priced similarly to o3-mini. I find o1 to be strictly better than o3-mini, but still use o3-mini for the majority of my agentic workflow because o1 is so much more expensive.


I noticed this too, I have used both o1 and o3 mini extensively, and I have ran many tests on my own problems and o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability.

The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.


I find o3 at least faster to get to the response I care about, anecdotally.


Why would you compare against all the models from a competitor. You take their latest one that you can test. Openai or anthropoc don’t compare against the whole gemini family.


Probably because It is more similar to o3 in terms of size/parameters as well as price (although I would expect this to be at least half price)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: