Remix.run Logo
layer8 4 days ago

I’d go further and say while testing is necessary, it is not sufficient. You have to understand the code and convince yourself that it is logically correct under all relevant circumstances, by reasoning over the code.

Testing only “proves” correctness for the specific state, environment, configuration, and inputs the code was tested with. In practice that only tests a tiny portion of possible circumstances, and omits all kinds of edge and non-edge cases.

crabmusket 4 days ago | parent | next [-]

> "proves"

I like using the word "demonstrates" in almost every case where people currently use the word "proves".

A test is a demonstration of the code working in a specific case. It is a piece of evidence, but not a general proof.

And these kinds of narrow ad-hoc proofs are fine! Usually adequate.

To rephrase the title of TFA, we must deliver code that is demonstrated to work.

aspbee555 4 days ago | parent | prev | next [-]

I find myself not really trusting just tests, I really need to try the app/new function in multiple ways with the goal of breaking it. In that process I may not break it but I will notice something that might break, so I rewrite it better

lanstin 4 days ago | parent | next [-]

If you don't push your system to failure, you can't really say you understand it. And anyways the precise failure modes under various conditions are important characteristics for stability/resiliency. (Does it shed load all the way upto network bandwidth of SYNs; allocate all the memory and then exit; freeze up with deadlocks/disk contention; go unresponsive for a few minutes then recover if the traffic dies off; answer health check pings only and not progress on actual work).

Nizoss 4 days ago | parent | prev [-]

If you write your tests the Test-Driven Development way in that they first fail before production changes are introduced, you will be able to trust them a lot more. Especially if they are well-written tests that test behavior or contracts, not implementation details. I find that dependency injection helps a lot with this. I try to avoid mocking and complex dependencies as much as possible. This also allows me to easily refactor the code without having to worry about breaking anything if all the tests still pass.

When it comes to agentic coding. I created an open source tool that enforces those practices. The agent gets blocked by a hook if it tries to do anything that violates those principles. I think it helps a lot if I may say so myself.

https://github.com/nizos/tdd-guard

Edit: I realize now that I misunderstood your comment. I was quick to respond.

Yodel0914 4 days ago | parent | prev | next [-]

Came to leave the same comment. It’s very possible to deliver code that’s proven to work, that is still shit.

roeles 4 days ago | parent | prev | next [-]

Since we can't really formally prove most code, I think property based testing such as with hypothesis[1] would make sense. I have not used it yet, but am about to for stuff that really needs to work.

[1] https://news.ycombinator.com/item?id=45818562

xendo 4 days ago | parent [-]

We can't really property test most code. So it comes down, as with everything, to good judgement and experience.

epgui 4 days ago | parent [-]

You can property test most code.

array_key_first 4 days ago | parent | prev | next [-]

I agree - it's trivial to write 100% test coverage if your code isn't robust and resilient and just does "happy path" type stuff.

9rx 4 days ago | parent | prev | next [-]

Testing is not perfect, but what else is there? Even formal proofs are just another expression of testing. With greater mathematical guarantees than other expressions, granted, but still testing all the same; prone to all the very same human problems testing is burdened with.

layer8 4 days ago | parent [-]

The difference with proofs (whether formal or informal) is that they quantify over all possible cases, whereas testing is always limited to specific cases.

9rx 3 days ago | parent [-]

There is no difference. It is all testing. Testing captures the full gamut, from simply manually using the software all the way up to formal proofs. Although the advantages of formal proofs over other modes of testing was already written about, so it is unclear what you are trying to add. Perhaps you want to clarify?

shepherdjerred 4 days ago | parent | prev | next [-]

A good type system helps with this quite a lot

crazygringo 4 days ago | parent [-]

It helps some. There are plenty of errors, a large majority I'd say, where types don't help at all. Types don't free up memory or avoid off-by-one errors or keep you from mixing up two counter variables.

anthonypasq 4 days ago | parent | prev | next [-]

if your tests cover the acceptance criteria as defined in the ticket, why is all htat other stuff necessary?

sunsetMurk 4 days ago | parent | next [-]

Acceptance criteria are often buggy themselves, and require more context to interpret and develop a solution.

otterley 4 days ago | parent [-]

If you don't have sufficiently detailed acceptance criteria, how can anyone be expected to write code to satisfy them?

That's why you have to start with specifications. See, e.g., https://martinfowler.com/articles/exploring-gen-ai/sdd-3-too...

9rx 4 days ago | parent [-]

I wonder how many more times we'll rebrand TDD (BDD, SDD)?

Just 23 more times? ADD, CDD, EDD, DDD, etc.

Or maybe more?! AADD, ABDD, ACDD, ..., AAADD, AABDD, etc.

pydry 4 days ago | parent | next [-]

BDD is different, it is a way of gathering requirements.

As is, SDD it is some sort of AI nonsense.

9rx 3 days ago | parent | next [-]

BDD was trying to recapture what TDD was originally, renamed from TDD in an effort to shed all the confusion that surrounded TDD. Of course, BDD picked up all of its own confusion (e.g. Gherkin/Cucumber and all that ridiculousness). So now it is rebranded as SDD to try and shed all of that confusion, with a sprinkle of "AI" because why not. Of course, SDD already is clouded in its own confusion.

Testing is the least understood aspect of computer science and it turns out that you cannot keep changing the name and expect everyone to suddenly get it. But that won't stop anyone. We patiently await the next rebrand.

otterley 4 days ago | parent | prev [-]

Developers who aren't yet using AI would benefit from specs as well. They're good to have whether it's you or an LLM that's writing code. As a general rule, the clearer and less ambiguous the criteria you have, the better.

4 days ago | parent | prev [-]
[deleted]
layer8 4 days ago | parent | prev | next [-]

If your acceptance criteria state something like “produces output f(x) for any inout x, where f(x) is defined as follows: […]”, then you can’t possibly test that, because you can’t test all possible values of x. And if the criteria don’t state that, then they don’t cover the full specification of how the software is expected to behave, hence you have to go beyond those criteria to ensure that the software always behaves as expected.

You can’t prove that something is correct by example. Examples can only disprove correctness. And tests are always only examples.

Yodel0914 4 days ago | parent | prev [-]

Because AC don’t cover non-functional things like maintainability/understandability, adherence to corporate/team standards etc.

simianwords 4 days ago | parent | prev | next [-]

I would like to challenge this claim. I think LLMs are maybe accurate enough that we don't need to check every line and remember everything. High level design is enough.

abathur 4 days ago | parent | next [-]

I've been tasked with doing a very superficial review of a codebase produced by an adult who purports to have decades of database/backend experience with the assistance of a well-known agent.

While skimming tests for the python backend, I spotted the following:

    @patch.dict(os.environ, {"ENVIRONMENT": "production"})
    def test_settings_environment_from_env(self) -> None:
        """Test environment setting from env var."""
        from importlib import reload

        import app.config

        reload(app.config)

        # Settings should use env var
        assert os.environ.get("ENVIRONMENT") == "production"
This isn't an outlier. There are smells everywhere.
simianwords 3 days ago | parent [-]

If it is so obvious to you that there is a smell here then an agent would have caught it. Try it yourself.

stuffn 4 days ago | parent | prev | next [-]

I have plenty of anecdata that counters your anecdata.

LLMs can generate code that works. That much is true. You can generate sufficiently complex projects that simply run on the first (or second try). You can even get the LLM to write tests for the code. You can prompt it for 100% test coverage and it will provide you exactly what you want.

But that doesn't mean OP isn't correct. First, you shouldnt be remembering everything. If you are finding yourself remembering everything your project is either small (I'd guess less than 1000 lines) or you are overburdened and need help. Reasoning, logically, through code you write can be done JIT as you're writing the code. LLMs even suffer from the same problem. Instead of calling it "having to remember to much" we refer to it as a quantity called "context window". The only problem is the LLM won't prompt you telling you that it's context window is so full it can't do it's job properly. A human will.

I think an engineer should always be reasoning about their code. They should be especially suspicious of LLM generated code. Maybe I'm alone but if I use an LLM to generate code I will review it and typically end up modifying it. I find even prompting with something like "the code you write should be maintainable by other engineers" doesn't produce good value.

newsoftheday 4 days ago | parent | prev [-]

My jaw hit the table when I read that. Just checking here but, are you being serious?

simianwords 3 days ago | parent [-]

I absolutely believe this and follow what I said to an extent. You don't need to triple check every line of code and deeply understand what it has done - just the highlevel design.

I usually skim through the code (spot some issues like are they using modern version of language?), check the high level design like which interfaces and do manual testing. That is more than enough.

user34283 4 days ago | parent | prev [-]

I'd go further and say vibe coding it up, testing the green case, and deploying it straight into the testing environment is good enough.

The rest we can figure out during testing, or maybe you even have users willing to beta-test for you.

This way, while you're still on the understanding part and reasoning over the code, your competitor already shipped ten features, most of them working.

Ok, that was a provocative scenario. Still, nowadays I am not sure you even have to understand the code anymore. Maybe having a reasonable belief that it does work will be sufficient in some circumstances.

TheTxT 4 days ago | parent | next [-]

This approach sounds like a great way to get a lot of security holes into your code. Maybe your competitors will be faster at first, but it’s probably better to be a bit slower and not leaking all your users data.

user34283 4 days ago | parent [-]

I'm mostly thinking about the frontend.

If I had a backend API that was serving user data, I'd of course check more carefully.

This kind of mistake always seemed amateurish to me.

TheTxT 4 days ago | parent [-]

Fair enough. I would still personally feel uneasy about it, but I guess it’s alright if it works for others.

doganugurlu 4 days ago | parent | prev [-]

How often do you buy stuff that doesn't work, and you are OK with the provider telling you "we had a reasonable belief that it worked"?

How are we supposed to use software in healthcare, defense, transportation if that's the bar?

user34283 4 days ago | parent [-]

There's a lot of functionality in the frontend that I am building that I did not review. If it worked in testing, that's good enough.

You're free to review every line the model produces. Not every project is in healthcare or defense, and sometimes different standards apply.

doganugurlu 4 days ago | parent [-]

I’m assuming you work in a setting where there is a QA team?

I haven’t been in such a setting in 2008 so you can ignore everything I said.

But I wouldn’t want to be somewhere where people don’t test their code, and I have to write code that doesn’t break the code that was never tested until the QA cycle?

user34283 3 days ago | parent [-]

No, in my day job I obsess over every line I add, although there is QA.

In my side project I'm building a frontend that, according to me, is the best looking and most feature rich option out there.

I find that I'm making great progress with it, even when I don't know every line in the project. I understand the architecture and roughly where what functionality is located, and that is good enough for me.

If in testing I see issues with some functionality, I can first ask the model to summarize the implementation. I can then come up with a better approach and have the model make the change. Or alternatively I edit some values myself. So far it wasn't often that I felt the need to write more than a few lines of code manually.