Remix.run Logo
ferguess_k 5 hours ago

Are we already in the time, or close to the time, that well-trained LLMs are more efficient in finding security holes than all but the best developers out there, even for OS kernel code? Can someone educate me on this?

stratos123 4 hours ago | parent | next [-]

In terms of quantity, definitely yes (a single person managing a swarm of Opusi can already find much more real bugs than a security researcher, hence the rise in reports).

In terms of quality ("are there bugs that professional humans can't see at any budget but LLMs can?") - it's not very clear, because Opus is still worse than a human specialist, but Mythos might be comparable. We'll just have to wait and see what results Project Glasswing gets.

Either way, cybersecurity is going to get real weird real soon, because even slightly-dumb models can have a large effect if they are cheap and fast enough.

EDIT: Mozilla thinks "no" to the second question, by the way: "Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher.", when talking about the 271 vulnerabilities recently found by Mythos. https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...

DanielHB 4 hours ago | parent | next [-]

There is also a huge surface area of security problems that can't happen in practice due to how other parts of the code work. A classic example is unsanitized input being used somewhere where untrusted users can't inject any input.

Being flooded with these kind of reports can make the actual real problems harder to see.

chuckadams 4 hours ago | parent | prev [-]

> Opusi

The plural of "Opus" is "Opera". Might be a tad confusing tho :)

skeledrew 3 hours ago | parent [-]

Wondered for a second "what does that browser have to do with all this?"

yk 4 hours ago | parent | prev | next [-]

My theory is, that a lot of security bugs are low hanging fruit for LLMs in the sense that it is a bit tedious but not that hard pattern matching. (Let's see the free occurs in foo(), so if I trigger bar() after foo() then I have a use after free, that should be possible if I trigger an exception in baz::init().)

toast0 2 hours ago | parent | prev | next [-]

Efficiency in finding isn't really the metric to consider. I'm sure a good security person could look at these and find the bugs, but nobody did.

IMHO, if you were to do a manual audit of the Linux kernel, the first thing to do is exclude all the stuff you're never going to run, because why spend time on it?

These scans are looking at everything, because once you set it up, the incremental cost to look at everything is not so bad.

This is going to push lesser used stuff out of the mainline, which sucks for people who were using it, but is better for everyone else.

jcalvinowens 4 hours ago | parent | prev | next [-]

My experience with these tools is that they generate absolutely enormous amounts of insidiously wrong false positives, and it actually takes a decent amount of skill to work through the 99% which is garbage with any velocity.

Of course some people don't do that, and send all the reports anyway... and then scream from the hilltops about how incredible LLMs are when by sheer luck one happens to be right. Not only is that blatant p-hacking, it's incredibly antisocial.

It's disingenuous marketing speak to say LLMs are "finding" any security holes at all: they find a thousand hypotheticals of which one or two might be real. A broken clock is right twice a day.

binaryturtle 4 hours ago | parent | next [-]

I used GitHub's Copilot once and let it check one of my repositories for security issues. It found countless (like 30 or 40 or so for a single PHP file of some ~400 lines). Some even sounded reasonable enough, so I had a closer look, just to make sure. In the end none of it was an issue at all. In some cases it invented problems which would have forced to add wild workaround code around simple calls into the PHP standard library. And that was the only time I wasted my time with that. :D

3 hours ago | parent [-]
[deleted]
NitpickLawyer 4 hours ago | parent | prev | next [-]

Your experience seems to be at least 3-6 months old. Long time kernel maintainers have recently written on this subject. They say that ~3 months ago the quality and accuracy of the reports crossed a threshold and are now legitimately useful.

jcalvinowens 3 hours ago | parent [-]

The experience I'm describing was two weeks ago.

Yes, what we see coming out of the bottom of funnel is now is a little better. But it's sort of like reading day trading blogs: nobody shares their negative results, which in my direct experience are so bad they almost negate any investigative benefit. I also think part of this is that a small set of very prolific spammers were sufficiently discouraged to stop.

Legend2440 4 hours ago | parent | prev | next [-]

This is incorrect. Here's the curl maintainer talking about dozens of bugs found using LLMs: https://daniel.haxx.se/blog/2025/10/10/a-new-breed-of-analyz...

warkdarrior 3 hours ago | parent [-]

From the curl blog post:

> "Remarkably few of them complete false positives."

defmacr0 3 hours ago | parent [-]

That's worse than a report that can be easily dismissed

bri3d 3 hours ago | parent | prev [-]

I strongly disagree with this take, and frankly, this reads like the state of "research" pre-LLMs where people would run fuzzers and scripted analysis tools (which by their nature DO generate enormous amounts of insidiously wrong false positives) and stuff them into bug bounty boxes, then collect a paycheck when one was correct by luck.

Modern LLMs with a reasonable prompt and some form of test harness are, in my experience, excellent at taking a big list of potential vulnerabilities and figuring out which ones might be real. They're also pretty good, depending on the class of vuln and the guardrails in the model, at developing a known-reachable vulnerability into real exploit tooling, which is also a big win. This does require the _slightest_ bit of work (ie - don't prompt the LLM with "find possible use after free issues in this code," or it will give you a lot of slop; prompt the LLM with "determine whether the memory safety issues in this file could present a security risk" and you get somewhere), but not some kind of elaborate setup or prompt hacking, just a little common sense.

LeCompteSftware 3 hours ago | parent | prev | next [-]

"Even for OS kernel code" is doing a lot of work. What you really mean is "legacy C code" and yes, since about 6 months ago these systems have gotten reliable enough that they are basically superhuman at identifying buffer overflows / etc. A remarkable number of these bugs are fixed by adding a (if (length > MAX_BUFFER) {return -1;}), just the classic C footguns. Even as a huge LLM skeptic I am not too too surprised that these systems might be superhuman at finding tedious tricky stuff like this.

At the same time, a lot of these bugs were in places that people weren't looking because it's not actually important. This kernel code had already been a longstanding problem in terms of low-effort bot-driven security reports and nobody had any interest in maintaining it. So this was more LLM-assisted technical management than LLM-assisted security, it finally made a situation uncomfortable enough for the team to do something about it.

Another example: Mythos found a real bug in FreeBSD that occurs when running as an NFS with a public connection. But... who on earth is doing that? I would guess 99.9% of FreeBSD NFS installations are on home LANs. More importantly, Anthropic spent $20,000 to find this bug. Just think in terms of paying a full-time FreeBSD dev for a month and that's what they find: I'd say "ok, looks like FreeBSD has a pretty secure codebase, let's fix that stupid bug, stop wasting our money, and get you on a more exciting project."

I do think anyone who has a legacy open-source C/C++ codebase owes it to their users to run it by Claude/Codex, check your pointers and arrays, make sure everything looks ok. I just wish people were able to discuss it in proper context about other native debugging tools!

traceroute66 4 hours ago | parent | prev | next [-]

> well-trained LLMs are more efficient in finding security holes than all but the best developers out there, even for OS kernel code?

No.

Like everything else an LLM touches, it is prone to slop and hallucinations.

You still need someone who knows what they are doing to review (and preferably manually validate) the findings.

What all this recent hype carefully glosses over is the volume of false-positives. I guarantee you it is > 0 and most likely a fairly large number.

And like most things LLM, the bigger the codebase the more likely the false-positives due to self-imposed context window constraints.

Its all very well these blog posts saying "LLM found this serious bug in Firefox", well yeah but that's only because the security analyst filtered out all the junk (and knew what to ask the LLM in the prompt in the first place).

stratos123 4 hours ago | parent [-]

A 0% false-positive rate is not necessary for LLM-powered security review to be a big deal. It was worthless a few months ago, when the models were terrible at actually finding vulnerabilities and so basically all the reports were confabulated, with a false positive rate of >95%. Nowadays things are much better - see e.g. [1] by a kernel maintainer.

Another way to see this is that you mentioned "LLM found this serious bug in Firefox", but the actual number in that Mozilla report [2] was 14 high-severity bugs, and 90 minor ones. However you look at it, it's an impressive result for a security audit, and I dount that the Antropic team had to manually filter out hundreds-to-thousands of false-positives to produce it.

They did have to manually write minimal exploits for each bug, because Opus was bad at it[3]. This is a problem that Mythos doesn't have. With access to Mythos, to repeat the same audit, you'd likely just need to make the model itself write all the exploits, which incidentally would also filter out a lot of the false positives. I think the hype is mostly justified.

[1] https://lwn.net/Articles/1065620/

[2] https://blog.mozilla.org/en/firefox/hardening-firefox-anthro...

[3] https://www.anthropic.com/news/mozilla-firefox-security

traceroute66 2 hours ago | parent [-]

> A 0% false-positive rate is not necessary

To be clear, I'm not saying 0% false-positive because that will always be impossible with any LLM.

However, to greatly over-simplify what I already said ...

The presence of >0 false-positives means you still need someone who knows what they are doing behind the keyboard.

The presence of an LLM, no matter how good, will never remove the need for a human with domain expertise in security analysis.

You cannot blindly fix stuff just because the LLM says it needs fixing.

You cannot report stuff just because the LLM says it needs reporting.

There may well be scope for LLM-assisted workflows, but WHO is being assisted is a critical part of the equation.

That is the fundamental point I am making.

olmo23 4 hours ago | parent | prev | next [-]

We are there. This is pretty much the reason why Mythos isn't being released publically.

pocksuppet 4 hours ago | parent [-]

The reason Mythos isn't being released publicly is to drive up Anthropic's valuation by making big promises.

dymk 4 hours ago | parent [-]

https://blog.mozilla.org/en/privacy-security/ai-security-zer...

> As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week’s release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation.

warkdarrior 3 hours ago | parent [-]

So you're saying Mozilla is in on it, hyping up Anthropic. Are they getting a kickback?

dymk 2 hours ago | parent | next [-]

What I’m saying is the youths call this “smoking copium”

pocksuppet 18 minutes ago | parent [-]

Both can be true at once. It can be good at finding vulnerabilities, and also overhyped to pump the stock price.

bitwize 2 hours ago | parent | prev [-]

What they're saying is that the capabilities of Mythos to find overlooked vulnerabilities in large code bases are real.

We're in a new era for security. You're either using AI to catch vulnerabilities in your code... or someone else is, and 0wning you.

bri3d 3 hours ago | parent | prev | next [-]

"More efficient" of course has many axes (cost, energy consumption, manual labor requirement vs cost of human, time, quality, etc.). However, as a long-time reverse engineer and exploit developer who has worked in the field professionally, I would say LLMs are now useful; their utility exceeds that which was previously available. That is, LLM assisted exploit discovery and especially development is faster, more efficient, and ultimately cheaper than non-LLM assisted processes.

What commenters don't seem to understand is that especially CVE spam / bug bounty type vulnerability research has always been an exercise in sifting through useless findings and hallucinations, and LLMs, used well, are great at reducing this burden.

Previously, a lot of "baseline" / bottom tier research consisted of "run fuzzers or pentest tools against a product; if you're a bottom feeder just stuff these vulns all into the submission box, if you're more legit, tediously try to figure out which ones are reachable." LLMs with a test harness do an _amazing_ job at reducing this tedium; in the memory safety space "read across 50 files to figure out if this UAF might be reachable" or in the web space, "follow this unsanitized string variable to see if it can be accessed by the user" are tasks that LLMs with a harness are awesome. The current models are also about 50% there at "make a chain for this CVE," depending on the shape of the CVE (they usually get close given a good test harness).

It seems that the concern with the unreleased models is pretty much that this has advanced once again from where it is today (where you need smart prompting and a good harness) to the LLM giving you exploit chains in exchange for "giv 0day pl0x," and based on my experience, while this has got an element of puffery and classic capitalist goofiness to it ("the model is SO DANGEROUS only our RICHEST CUSTOMERS can have it!"), I believe this is just a small incremental step and entirely believable.

To summarize: "more efficient than all but the best" comes with too many qualifiers, but "are LLMs meaningfully useful in exercising vulnerabilities in OS kernel code," or "is it possible to accelerate vulnerability research and development with LLMs" - 100% absolutely.

And you don't have to believe one random professional (me); this opinion is fairly widespread across the community:

https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...

https://lwn.net/Articles/1065620/

etc.

4 hours ago | parent | prev [-]
[deleted]