Most models are multi-paradigm, and so they get... Fixated on procedural language design. Concepts like the stack, backtracking, etc. violate the logic they've absorbed, leading to... Burning tokens whilst it corrects itself.
This won't show up in a smaller benchmark, because the clutching at straws tends to happen nearer to the edge of the window. The place where you can get it to give up obvious things that don't work, and actually try the problem space you've given.
This won't show up in a smaller benchmark, because the clutching at straws tends to happen nearer to the edge of the window. The place where you can get it to give up obvious things that don't work, and actually try the problem space you've given.