> My observation is that the [o1-like] models are better at evaluating than they are generating
This is very good (a very good thing that you see that the out-loud reasoning is working well as judgement),
but we at this stage face an architectural problem. The "model, exemplary" entities will iteratively judge and both * approximate the world model towards progressive truthfulness and completeness, and * refine their judgement abilities and general intellectual proficiency in the process. That (in a way) requires that the main body of knowledge (including "functioning", proficiency over the better processes) is updated. The current architectures I know are static... Instead, we want them to learn: to understand (not memorize) e.g. that Copernicus is better than Ptolemy and to use the gained intellectual keys in subsequent relevant processes.
The main body of knowledge - notions, judgements and abilities - should be affected in a permanent way, to make it grow (like natural minds can).