Remix.run Logo
fc417fc802 a day ago

I find it interesting that it decided to write a script despite not having access to tools, and is apparently aware of this lack of access since it then proceeds to do the computation manually.

It's impressive it got as close as it did with estimates (and that it can actually do basic math now). Yet then it goes "let's do a precise check using tools" and just blatantly makes the whole thing up. Comedic but also worrisome.

I find the entire sequence pretty weird. It's such a bizarre mix of competence with blatant incompetence that borders on deceit.

neonstatic a day ago | parent [-]

Agree on all points!

The difference between Gemma and Qwen here is that Qwen followed a much more detailed process - it consider leap years and seconds in its calculations (where Gemma used estimates like "roughly x years")

fc417fc802 a day ago | parent [-]

Turns out I wasn't reading closely enough. Notice that it first comes up with the number out of thin air prior to the math which is to "verify" it.

Following this charade, the "precise check" using "common tools" (which it does not have access to) pulls an entirely different number out of thin air.

It then asks if this new different number is correct, checks by "converting it back" with a utility it doesn't have access to, declares success, and then prints this second number.

Both numbers are wrong.

The fact that I was so easily misled on such a basic task when I was actively interested in where things had gone wrong is concerning to say the least. I'm beginning to think that thinking traces are actually quite nefarious in many contexts and that the entire exercise is some sort of trained hallucination task as opposed to even remotely resembling what's actually going on.

imtringued a day ago | parent [-]

There were research papers that showed that even just printing out dots in the thinking phase improves performance.