How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?

I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow

bcherny 3 hours ago | parent | prev [-]

A mix of evals and vibes.

efields an hour ago | parent | next [-]

"Evals and vibes" can I put that on a t shirt?

giwook 3 hours ago | parent | prev | next [-]

What's that ratio exactly

	▲	nothinkjustai 2 hours ago \| parent [-]
		99/1

capnchaos 3 hours ago | parent | prev [-]

Are you doing any Digital Twin testing or simulations? I imagine you can't test a product like Claude Code using traditional means.