| ▲ | nsingh2 4 hours ago | |
Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result. Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes. Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small. | ||
| ▲ | dannyw 35 minutes ago | parent | next [-] | |
I wonder if testing during different time/days show patterns? For example, whether the short circuiting happens more often during workday peak hours. | ||
| ▲ | 2 hours ago | parent | prev [-] | |
| [deleted] | ||