Remix clone Hacker News

new | show | ask | jobs Github

	▲	gregfrank 9 hours ago
		"Trendslop" is a great name for something I think is a deeper structural problem than it appears. The issue isn't just that LLMs produce generic outputs, it's that our evaluation methods reward the appearance of the right behavior rather than the behavior itself. In safety/alignment research specifically, I've found that refusal-rate benchmarks have the same failure mode: a model can score well on refusal probes (accurately representing the "don't answer this" concept in its latent space) while routing around that representation behaviorally. The benchmark looks fine; the model isn't actually doing what the benchmark measures.