Remix clone Hacker News

new | show | ask | jobs Github

	▲	XCSme a day ago
		I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on. I have already re-tested DeepSeek v4, so it doesn't have any API error issues. API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/ I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs. That being said, the reliability of the API is still a huge factor for production use-cases.
	▲	XCSme a day ago \| parent [-]
		Some API errors are actually not about reliability, are because that specific API doesn't support some common features (e.g. specific structured_output formats).