Remix.run Logo
It's Hard to Eval Is a Product Smell(hamel.dev)
6 points by _pdp_ a day ago | 1 comments
brammertottens a day ago | parent [-]

This is super interesting, and I like the idea of verifiable artifacts that an agent can produce, i.e. notebooks for analysis, links to the source for some claims. Building for scale, it would be interesting to know how the author thinks about automating that and building benchmarks to automate testing the quality