Remix.run Logo
Sytten 2 days ago

The app automated pentest scanners find the bottom 10-20% of vulns, no real pentester would consider them great. Agents might get us to 40%-50% range, what they are really good at is finding "signals" that the human should investigate.

tptacek 2 days ago | parent [-]

I agree with you about scanners (we banned them at Matasano), but not about the ceiling for agents. Having written agent loops for somewhat similar "surface and contextualize hypotheses from large volumes of telemetry" problems, and, of course, having delivered hundreds of application pentests: I think 80-90% of all the findings in a web pentest report, and functionally all of the findings in a netpen report, are within 12-18 months reach of agent developers.

KurSix 2 days ago | parent | next [-]

I agree with the prediction. The key driver here isn't even model intelligence, but horizontal scaling. A human pentester is constrained by time and attention, whereas an agent can spin up 1,000 parallel sub-agents to test every wild hypothesis and every API parameter for every conceivable injection. Even if the success rate of a single agent attempt is lower than a human's, the sheer volume of attempts more than compensates for it.

tptacek 2 days ago | parent [-]

They also don't fatigue in the same way humans do. Within the constraint of a netpen, a human might be, say, 20% more creative at peak performance than an agent loop. But an agent loop will operate within a narrow band of its own peak performance throughout the whole test, on every stimulus/response trial it does. Humans cannot do that.

torginus 2 days ago | parent | prev | next [-]

I wonder how the baseline for 100% is established - are there (security relevant) software that you'd say are essentially free of vulnerabilities?

tptacek 2 days ago | parent [-]

Nope! It's extremely unknowable.

EE84M3i 2 days ago | parent | prev | next [-]

Would be curious to hear your hypothesis on what's the remaining 10-20% that might be out of reach? Business logic bugs?

tptacek 2 days ago | parent [-]

Honestly I'm just trying to be nice about it. I don't know that I can tell you a story about the 90% ceiling that makes any sense, especially since you can task 3 different high-caliber teams of senior software security people on an app and get 3 different (overlapping, but different) sets of vulnerabilities back. By the end of 2027, if you did a triangle test, 2:1 agents/humans or vice/versa, I don't think you'd be able to distinguish.

Just registering the prediction.

karlmdavis 2 days ago | parent [-]

I would take the other side of that bet.

  # if >10 then was_created_by_agent = true
  $ grep -oP '\p{Emoji}' vulns.md | wc -l
tptacek 2 days ago | parent [-]

I don't understand what you're trying to say here.

Paracompact 2 days ago | parent [-]

Just that the superficial details of how AI communicate (e.g. with lots of emojis) might give them away in any triangle test :)

tptacek 2 days ago | parent | next [-]

Ah! Touche.

worksonmine 2 days ago | parent | prev [-]

I see this emoji thing being mentioned a lot recently, but I don't remember ever seeing one. Granted I rarely use AI and when I do it's on duck.ai. What models are (ab)using emojis?

nullcathedral 2 days ago | parent | prev [-]

I'd say I agree with you there for the low-hanging fruit. The deep research (there's an image filter here but we can bypass it by knowing some obscure corner of the SVG spec) is where they still fall over and need hand holding by pointing them at the browser rendering stack, specs, etc

hrimfaxi 2 days ago | parent [-]

Until those obscure corner cases are fed into the next training round.

viraptor 2 days ago | parent [-]

It doesn't even need to be trained. Just feed parts of the spec. I found some interesting implementation edge cases just by submitting the source and pdf spec of a chip to Claude. Not even in a fancy way.