Remix.run Logo
mike_hearn 4 days ago

The most interesting thing about this is the apparent absence of unit tests. The test for the XLA compiler bug just prints the outputs, it's more like a repro case than a unit test in the sense that it'd be run by a test harness and have coverage tracked. And the action items are simply to lean more aggressively into evals.

Although unit testing an entire LLM is not really feasible right now, all these bugs were in small deterministic parts of the system. Load balancing, top-k probability calculations and so on are all engineered parts no different to other software, and should in principle all be unit testable. At most you need an injectable PRNG. Yes, non-deterministic optimization bugs are awful but I've personally found compiler and database bugs in the past using just regular app test suites. With CI you get a lot of runs so rare events can still surface as long as you investigate flakes. One of my current projects runs every unit test in the same process in parallel, which has proven an excellent and cheap strategy for flushing out rare thread safety issues and database deadlocks.

A few days ago I commented on a thread about the Java launch that people often feel Java is "enterprisey" compared to Python because Java code is typically written to be heavily unit testable. A lot of abstraction is driven by the desire for dependency injection, for example. I contrasted that to scripting language culture where I've found testing is often either missing or kinda surface level (e.g. mostly just asserting on types).

When I've been learning PyTorch a few years ago I noticed the same thing. The tutorials took you from simple to complex stuff without talking much about how you test or best structure the code. This makes sense for ML research where you don't have a clear goal and success boils down to maxing a score in some kind of eval, but it doesn't make sense for production deployment at scale.

I wonder if the AI labs could use more people with SRE and HA SWE background to focus on things like this. I'm kinda skeptical that more aggressive rolling evals-in-prod are the best way to avoid bugs like these happening again.

vintagedave 4 days ago | parent [-]

I've had to write some detailed prompts and examples to have AI generate the kind of unit tests I want in Python. I've seen the assertions on types alone too. I want assertions on values and more.

Even more than that, AI tends to mock _everything_. Mocking is useful, but the more real code a unit test invokes, the better, because the risk is not only the code itself but its interactions, the interface. Yet AI in Python will mock so heavily it barely tests even the code itself, with tautological statements.

I've prompted with heavy warnings against mocking and pointing directly at examples of thorough tests as examples. FWIW, Python does have excellent tools for injection and can write really nicely structured code.

redman25 4 days ago | parent | next [-]

I wish I had 100 upvotes to give you. Weak, heavily mocked tests are my biggest pet peave. Test “quality” is important and not something a lot of devs pay attention to.

I’ve found myself preferring integration tests or unit tests with a “real” database set up because the tests are much more effective. If you design them right, they don’t even need to be slower.

whatevaa 3 days ago | parent [-]

They will be locally if you have to also run 3 virus scanners :)

bobbylarrybobby 3 days ago | parent | prev | next [-]

When asked to write UI tests (playwright), I've seen Claude Code do essentially the following:

const elem = document.querySelector(".foo"); // actual element that exists elem.innerHTML = '<div class="bar"></div>'; const child = elem.locator(".bar"); // child we actually want to test for expect(child).toExist()

Gee thanks Claude, what a great test...

vintagedave 3 days ago | parent [-]

Same. Drives me up the wall. I’m writing my own coding agent now and I’m baking into it prompts against all the anti patterns I’ve see.

andoando 3 days ago | parent | prev | next [-]

Mocked tests also make refactoring a pain in the ass.

This is why I heavily prefer integration tests

mike_hearn 4 days ago | parent | prev [-]

I'm curious how you structure your Python to be well testable. I have to admit, my own use of Python has been limited to scripts and (a long time ago) a game engine, not large codebases. So unit testing for those hardly came up.

It seems there's a couple of dependency injection frameworks but they're clones of what's found in Java, right down to the type names. One of them even calls injectable objects beans! (Rhazes)

Balinares 4 days ago | parent | next [-]

Same as you do it in any language: you compose instead of inheriting, you avoid shared state, you generally think about how this thing you're implementing can be tested even as you are implementing it. Test-driven development tends to constrain your interfaces too early but you can get a lot of the same benefits with, let's call it, test-mindful development. That works in any language.

vintagedave 4 days ago | parent | prev | next [-]

Most of my Python is web. Individual components, same as always - approach with a set API and not too many dependencies, and allow injection via some route if so. I also test web endpoints. One thing I really like is isolating tests that require data -- rather than mocking the database, for example, I'll create an in-memory SQLite DB used while running tests. That way I can test the full stack: a web API, see its results, and check what was changed in the database at the same time, all isolated from the 'real' stack.

lordmathis 4 days ago | parent | prev [-]

I learned to write well testable code when I learned go. It pushes you to pass interfaces instead of direct implementations. There's also no inheritance, just composition. While there's no 1 to 1 translation to Python the concepts are still useful. It can be easier in Python thanks to duck typing.