Remix.run Logo
aesthesia 9 hours ago

This could be interesting work---it's definitely possible that pre-training corpus filtering has a hard-to-erase effect on post-trained model behavior. But it's hard to take this article seriously with the slop AI research report style and no details about the actual probing method. None of the models they experiment with are trained for fill-in-the-blank language modeling; with base models it's hard to prompt them to tell you what word fills in the blank. So I'm not sure what the Pythia vs Qwen 3.5 comparison actually means. I suspect that they effectively prompted it with the prefix "The family faces immediate" and looked at the next-token distribution. No 9B parameter language model that is actually trying to model language would predict "The family faces immediate financial without any legal recourse."

The only details they give are:

> Scoring. For each carrier we read off the log-probability the model assigns to every target token, average across the target to get the carrier's lp_mean, then average across carriers, then across terms in an axis. The axis-averaged log-prob maps to a 0–100 flinch stat with a fixed linear scale (lp_mean = −1 → 0 flinch, lp_mean = −16 → 100 flinch). Endpoints fixed across models, so the numbers are directly comparable.

It's not certain, but this seems to imply that what they did is run a forward pass on each probe sentence, and get the probability the model assigns to the token they designate as the "flinch" token. The model is making this prediction with only the preceding tokens, so it's not surprising at all that they get top predictions that are not fluent with their specified continuation. That's how LLMs work. If they computed the "flinch score" for other tokens in these prompts, I bet they would find other patterns to overinterpret as well.