Remix.run Logo
operatingthetan 6 hours ago

Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.