Remix clone Hacker News

new | show | ask | jobs Github

	▲	GodelNumbering 6 hours ago
		> One of the most prominent improvements in Opus 4.8 is its honesty. I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release. In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success. The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)
	▲	6 hours ago \| parent [-]
		[deleted]