Agree on all points!

The difference between Gemma and Qwen here is that Qwen followed a much more detailed process - it consider leap years and seconds in its calculations (where Gemma used estimates like "roughly x years")

▲

fc417fc802 a day ago | parent [-]

Turns out I wasn't reading closely enough. Notice that it first comes up with the number out of thin air prior to the math which is to "verify" it.

Following this charade, the "precise check" using "common tools" (which it does not have access to) pulls an entirely different number out of thin air.

It then asks if this new different number is correct, checks by "converting it back" with a utility it doesn't have access to, declares success, and then prints this second number.

Both numbers are wrong.

The fact that I was so easily misled on such a basic task when I was actively interested in where things had gone wrong is concerning to say the least. I'm beginning to think that thinking traces are actually quite nefarious in many contexts and that the entire exercise is some sort of trained hallucination task as opposed to even remotely resembling what's actually going on.

	▲	imtringued a day ago \| parent [-]
		There were research papers that showed that even just printing out dots in the thinking phase improves performance.