| ▲ | vicchenai 2 days ago | |
the rl loop here is clever but i wonder how the reward signal degrades over time. if you're optimizing for user acceptance of suggestions, you're inevitably training on a mix of "this was actually correct" and "i accepted because editing the suggestion was more work than accepting it." that second case creates a subtle bias toward suggestions that are close-enough-to-not-bother-fixing rather than actually correct. also curious whether they see different convergence patterns across languages. my gut says something like python where theres more stylistic variation would be harder to get a clean reward signal vs something like rust where there are fewer idiomatic ways to do things. | ||
| ▲ | addcn 2 days ago | parent [-] | |
Yeah. Great callout. You basically need signal all the way to production. Did it get nuked during a code review? Is it still running 2 mo later? How many incidents were caused by this code? This is tough though because enterprises go absolutely ballistic over “training on our data” - which is understandable, but will also hold us back. | ||