Remix.run Logo
hansvm 3 days ago

There's a subtle distinction though; if you're able to filter out illegal behavior, the move quality conditioned on legality can be extremely different from arbitrary move quality (and, as you might see in LLM json parsing, conditioning per-token can be very different from conditioning per-response).

If you're arguing that the singularity already happened then your criticism makes perfect sense; these are dumb machines, not useful yet for most applications. If you just want to use the LLM as a tool though, the behavior when you filter out illegal responses (assuming you're able to do so) is the only reasonable metric.

Analogizing to a task I care a bit about: Current-gen LLMs are somewhere between piss-poor and moderate at generating recipes. With a bit of prompt engineering most recipes pass my "bar", but they're still often lacking in one or more important characteristics. If you do nothing other than ask it to generate many options and then as a person manually filter to the subset of ideas (around 1/20) which look stellar, it's both very effective at generating good recipes, and they're usually much better than my other sources of stellar recipes (obviously not generally applicable because you have to be able to tell bad recipes from good at a glance for that workflow to make sense). The fact that most of the responses are garbage doesn't really matter; it's still an improvement to how I cook.