my unvalidated theory is that this comes down to the coding model’s training objective: Tetris is fundamentally an optimization problem with delayed rewards. Some models seem to aggressively over-optimize toward near term wins (clearing lines quickly), which looks good early but leads to brittle states and catastrophic failures later. Others appear to learn more stable heuristics like board smoothness, height control, long-term survivability even if that sacrifices short-term score
That difference in objective bias shows up very clearly in Tetris, but is much harder to notice in typical coding benchmarks. Just a theory though based on reviewing game results and logs
answered this in a comment above! It's not turn or visual layout based since LLMs are not trained that way. The representation is a JSON structure, but LLMs plug in algorithms and keeps optimizing it as the game state evolves
Thanks for the clarification! Kind of reminds me of the Brian Moore's AI clocks which uses several LLMs to continuously generate HTML/CSS to create an analog clock for comparisons.
Curious how the token economics compare here to a standard agent loop. It seems like if you're using the LLM as a JIT to optimize the algorithm as the game evolves, the context accumulation would get expensive fast even with Flash pricing.
Thanks for all the questions! More details on how this works:
- Each model starts with an initial optimization function for evaluating Tetris moves.
- As the game progresses, the model sees the current board state and updates its algorithm—adapting its strategy based on how the game is evolving.
- The model continuously refines its optimizer. It decides when it needs to re-evaluate and when it should implement the next optimization function
- The model generates updated code, executes it to score all placements, and picks the best move.
- The reason I reframed this problem to a coding problem is Tetris is an optimization game in nature. At first I did try asking LLMs where to place each piece at every turn but models are just terrible at visual reasoning. What LLMs great at though is coding.
Before I began the test, I thought the agents would be much better at this task than most humans -- after all they should have better, more stateful memory than us. The results are intriguing.
Here are the scores from 10 attempts:
OpenAI operator: 5, 5, 6, 5, 5, 4, 6, 5, 5, 5
Anthropic computer use agent: 7, 9, 6 (rate limited), 12, 9, 7, 9, 11, 12, 6 (rate limited)