| ▲ | jychang 6 hours ago | |||||||
It's also literally factually incorrect. Pretty much the entire field of mechanistic interpretability would obviously point out that models have an internal definition of what a bug is. Here's the most approachable paper that shows a real model (Claude 3 Sonnet) clearly having an internal representation of bugs in code: https://transformer-circuits.pub/2024/scaling-monosemanticit... Read the entire section around this quote: > Thus, we concluded that 1M/1013764 represents a broad variety of errors in code. (Also the section after "We find three different safety-relevant code features: an unsafe code feature 1M/570621 which activates on security vulnerabilities, a code error feature 1M/1013764 which activates on bugs and exceptions") This feature fires on actual bugs; it's not just a model pattern matching saying "what a bug hunter may say next". | ||||||||
| ▲ | mrbungie 4 hours ago | parent [-] | |||||||
Was this "paper" eventually peer reviewed? PS: I know it is interesting and I don't doubt Antrophic, but for me it is so fascinating they get such a pass in science. | ||||||||
| ||||||||