to get the score they had on ARC AGI, they had to fine-tune o3 on ARC AGI. That ...

to get the score they had on ARC AGI, they had to fine-tune o3 on ARC AGI. That is hardly a sign of general intelligence or emergent capability.

PhD novel research ? What is the novel research discovered by an LLM ever since the emergence of ChatGPT ? None. Despite all the knowledge these models accumulate in their weights they haven't been able to connect the dots and discover a lot of things humans haven't discovered, autonomously.

Replace which software engineering job ? They are useful, sure; good at benchmarks, yes; but not a drop in replacement of any software engineer.