Remix.run Logo
admax88qqq 2 hours ago

> beats Claude in our Cyber Benchmarks

Beats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).

It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.

InsideOutSanta 2 hours ago | parent | next [-]

They say "Claude Opus 4.8" in the first paragraph.

crm9125 an hour ago | parent [-]

We're supposed to read the article?

How are we supposed to stay skeptical of everything if we read anything!?

ls612 2 hours ago | parent | prev [-]

Opus 4.8 according to TFA. Whether or not the safety guardrails were responsible for the difference is an open question but for a dev who wants to secure their software who doesn’t work at one of the blessed Glasswing companies it doesn’t really matter why, it matters what the best tool you actually have is.