| ▲ | GodelNumbering 6 hours ago | |
> One of the most prominent improvements in Opus 4.8 is its honesty. I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release. In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success. The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%) | ||
| ▲ | 6 hours ago | parent [-] | |
| [deleted] | ||