I have an idea. What if we used a third LLM to evaluate how good the secondary LLM is at critiquing the primary LLM.