Remix.run Logo
fellowniusmonk a day ago

In all my unpublished tests, which focus on 1. unique logic puzzles that are intentionally adjacent to existing puzzles and 2. implementing a specific unique CRDT algorithm that is not particularly common but has an official reference implementation on github (so the models definitely been trained on it) I find that 5.2 overfits to the more common implementation and will actively break working code and puzzles.

I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)

I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.

pillefitz 21 hours ago | parent [-]

How does Claude perform?

fellowniusmonk 21 hours ago | parent [-]

They all have difficulty with certain crdts types in general, 4.5 opus has to go through a round of ask to give it clarifying instructions but then it's fine. Neither get it perfectly as a one shot, claude if you jump straight into agent won't break code but will churn for a bit.