Insincere answer that will probably be attempted sincerely nonetheless: throw even more agents at the problem by having them do code review as well. The solution to problems caused by AI is always more AI.
Technically that's known as "LLM-as-judge" and it's all over the literature. The intuition would be that the capability to choose between two candidates doesn't exactly overlap with the ability to generate either one of them from scratch. It's a bit like how (half of) generative adversarial networks work.
Simple, just ask an(other) AI! But seriously, different models are better/worse at different tasks, so if you can figure out which model is best at evaluating changes, use that for the review.
Surely humans are the ones initiating the agent though, no? Just do that at a measured pace. And set up comprehensive prompts/mechanisms to make sure the agent satisfies your criteria for tests, style, etc - there's a lot of prompts and tools around the Cline/Roo community for doing stuff like that.