Model cards are just marketing material. I wouldn’t trust them one bit.

You don't need to trust anyone. GPT 5.4 xhigh is available and you can test it for $20, to verify it is actually able to find complex bugs in old codebases. Do the work instead of denying AI can do certain things. It's a matter of an afternoon. Or, trust the people that did this work. See my YouTube video where I find tons of Redis bugs with GPT 5.4.

▲

Hendrikto 4 hours ago | parent | next [-]

I did not claim or deny anything. You cited the model card, I just pointed out that this is no reliable source. If you have better sources, like your YT video, you should cite those instead.

▲

otterley 4 hours ago | parent [-]

You are claiming something: that the model card is not reliable, therefore it's as useful as nothing. Sowing doubt without a possible solution adds little value to the conversation. Moreover, your rebuttal is unsubstantiated.

	▲	cyanydeez 3 hours ago \| parent [-]
		Guys, think about all the security vulnerabilities you're aware of; now, think about how many of those you know how to technically reproduce. Now imagine that you actually don't know how to reproduce most things and you're never actually be able to judge the result. Well, just cause these are all AI people doesn't mean they verified enough of the output of these models to actually provide the significant security implications they're advertising.

▲

ncjfieuauahwi 2 hours ago | parent | prev [-]

[dead]

▲

Yokohiii 7 hours ago | parent | prev | next [-]

The whole discussion started out as an attempt to disprove/verify anthropics (model card) claims.

He also transfers the logic of their claims to the actual real world. You can say that model cards are marketing garbage. You have to prove that experienced programmers are not significantly better at security.

▲

root_axis 7 hours ago | parent [-]

> You have to prove that experienced programmers are not significantly better at security.

That has not been my experience. It's true that they are "better at security" in the sense that they know to avoid common security pitfalls like unparamaterized SQL, but essentially none of them have the ability to apply their knowledge to identify vulnerabilities in arbitrary systems.

	▲	tracker1 3 hours ago \| parent \| next [-]
		I would think pwn2own competitions would signal the opposite. I'm consistently and often amazed at how a unique combination of exploits can bring a larger exploit and often in ways that most wouldn't even consider. I think it takes a level of knowledge, experience, creativity and paranoia to be really good with security issues all around as a person.
	▲	Yokohiii 6 hours ago \| parent \| prev \| next [-]
		An expert level human doesn't have to be expert at every programming category. A webdev wouldn't spot a use after free. A systems engineer wouldn't know about CSRF. That is if both don't research security beyond their field. Requiring a programmer to apply their knowledge to an arbitrary system is asking too much. On the other hand and LLM can be expert level in every programming field, able to spot and combine vulnerabilities creatively. That is all pretty hard and I don't think an security expert with vast knowledge would say "that's easy". My point is that more experienced programmers are better at security on average, not that they are security experts.
	▲	inetknght 6 hours ago \| parent \| prev [-]
		> essentially none of them have the ability to apply their knowledge to identify vulnerabilities in arbitrary systems. I've found it to be the opposite. Many of them do have the ability to apply their knowledge in that fashion. They're just either not incentivised to do so, or incentivised to not do so.

▲

mbesto 5 hours ago | parent | prev | next [-]

And overfitting benchmarks can easily be gamed. Yet here we are with the top HN comment on the HN Mythos thread outlining it's benchmarking performance gains.

I guess we'll never learn.

▲

2983592 7 hours ago | parent | prev [-]

But they are treated as holy scripture ...