Remix.run Logo
localuser13 4 hours ago

Is it? Gemini 3-pro-preview and 3-flash-preview, respectively top2 and top3, had 44% and 37% true positive and whooping 65% and 86% false positives. This is worse than a coin toss. Anything more than 0% (3% to be generous) is useless in the real world. This leaves only grok and GPT, with 18%, 9% and 2% success rate.

In fact, this is what authors said themselves: "However, this approach is not ready for production. Even the best model, Claude Opus 4.6, found relatively obvious backdoors in small/mid-size binaries only 49% of the time. Worse yet, most models had a high false positive rate — flagging clean binaries." So I'm not sure if we're even discussing the same article.

I also don't see a comparison with any other methodology. What is the success rate of ./decompile binary.exe | grep "(exec|system)/bin/sh"? What is the success rate of state-of-the-art alternative approaches?