Remix.run Logo
po1nt 6 hours ago

From all the things I read I'm pretty convinced that Mythos is just standard LLM with safety features turned off. If current models weren't reluctant to search for vulnerabilities, they might perform as good as Mythos.

SwellJoe 5 hours ago | parent | next [-]

Early on, I had a vague suspicion that the reason some of the Chinese models, including quite small ones, perform so well on this task, especially relative to their size and cost, is because they don't have the same safety guardrails baked in regarding software security that US models seem to have. Gemini 3.1 Pro doing so poorly sort of reinforced that gut feeling.

But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).

So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.

scorpioxy 4 hours ago | parent | next [-]

Can you elaborate on the "software security that US models" seem to have? According to blog posts I read, the code generated had security problems and naive ones at that. Perhaps it got better now or people have learned not to blindly vibe code applications that are to be used publicly but it certainly didn't feel like there were security guardrails.

SwellJoe 4 hours ago | parent [-]

I'm talking about guardrails that prevent finding exploits, which is only peripherally related to writing secure code.

This benchmark is about finding security bugs, not writing secure code. I don't believe the models have guardrails that prevent writing safe code, but they're also not intelligent and have a bunch of insecure code in their training data, so they definitely write insecure code sometimes.

coldtea 4 hours ago | parent | prev | next [-]

>But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes.

Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?

SwellJoe 4 hours ago | parent [-]

I don't know. I think it proves that if Google is baking guardrails into their models that prevent them from finding security bugs, they didn't bake those guardrails into Gemma 4, because it is very good at it. Maybe that means Google devs had a change of heart. Maybe it means something about Gemma 4 architecture is better for this task than Gemini 3.1 Pro. Gemini Flash 3.5 did OK though.

Anyway, I kinda think among US models only Fable really tries to block security work like this, based on my experience so far.

pbgcp2026 an hour ago | parent | prev [-]

I concur with "Gemma 4 31B the best model I have results for". My workflow includes a lot of Gemma 4 – but dense 31B non-quantised version.(BTW I found it is most cost effective to run on Bedrock)

kevinh456 5 hours ago | parent | prev | next [-]

Fable, the same model as mythos with extra safety controls, was much faster, more accurate, and more token efficient than previous models. What I got done with it in 48 hours accelerated my personal project from concept to deployed prototype.

pbgcp2026 an hour ago | parent [-]

Fable is not the same model as Mythos but with guardrails. There are many things that were never disclosed by Project Glasswind. And probably will never be.

cheeze 5 hours ago | parent | prev [-]

Why wouldn't OpenAI offer the same?

pbgcp2026 an hour ago | parent [-]

My bet is actually on GLM. Z.ai does amazing work and they will overcome Western models. IMO, faster than DS or Qwen. They have amazing team and very capable and smart leader.