No. To give it a fair test, we didn't tinker with model-specific context-engineering. Adding skills, examples, etc is very likely to improve performance. So is any interactive feedback.

Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...

▲

anamexis 3 hours ago | parent [-]

Why, though? That would make sense if you were just trying to do a comparative analysis of different agent's ability to use specific tools without context, but if your thesis is:

> However, [the approach of using AI agents for malware detection] is not ready for production.

Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."

▲

ronald_petty an hour ago | parent | next [-]

Not the author. Just my thoughts on supplying context during tests like these. When I do tests, I am focused on "out of the box" experiences. I suspect the vast majority of actors (good and bad, junior and senior) will use out of the box more then they will try to affect the outcome based on context engineering. We do expect tweaking prompts to provide better outcomes, but that also requires work (for now). Maybe another way to think is reducing system complexity by starting at the bottom (no configuration) before moving to top (more configuration). We can't even replicate out of the box today much less any level of configuration (randomness is going to random).

Agree it is a good test to try, but there are huge benefits beings able to understand (better recreate) 0-conf tests.

▲

decidu0us9034 an hour ago | parent | prev | next [-]

All the docs are already in its training data, wouldn't that just pollute the context? I think giving a model better/non-free tooling would help as mentioned. binja code mode can be useful but you definitely need to give these models a lot of babysitting and encouragement and their limitations shine with large binaries or functions. But sometimes if you have a lot to go through and just need some starting point to triage, false pos are fine.

▲

stared 3 hours ago | parent | prev [-]

You can solve any problem with AI if you give enough hints.

The question we asked is if they can solve a problem autonomously, with instructions that would be clear for a reverse engineering specialist.

That say, I found these useful for many binary tasks - just not (yet) the end-to-end ones.

	▲	anamexis an hour ago \| parent \| next [-]
		"With instructions that would be clear for a reverse engineering specialist" is a big caveat, though. It seems like an artificial restriction to add. With a longer and more detailed prompt (while still keeping the prompt completely non-specific to a particular type of malware/backdoor), the AI could most likely solve the problem autonomously much better.
	▲	embedding-shape 2 hours ago \| parent \| prev [-]
		> The question we asked is if they can solve a problem autonomously What level of autonomy though? At one point some human have to fire them off, so already kind of shaky what that means here. What about providing a bunch of manuals in a directory and having "There are manuals in manuals/ you can browse to learn more." included in the prompt, if they get the hint, is that "autonomously"?