| ▲ | Hypothesis: Property-Based Testing for Python(hypothesis.readthedocs.io) |
| 224 points by lwhsiao a day ago | 153 comments |
| |
|
| ▲ | rmunn a day ago | parent | next [-] |
| I love property-based testing, especially the way it can uncover edge cases you wouldn't have thought about. Haven't used Hypothesis yet, but I once had FsCheck (property-based testing for F#) find a case where the data structure I was writing failed when there were exactly 24 items in the list and you tried to append a 25th. That was a test case I wouldn't have thought to write on my own, but the particular number (it was always the 25th item that failed) quickly led me to find the bug. Once my property tests were running overnight and not finding any failures after thousands and thousands of random cases, I started to feel a lot more confident that I'd nailed down the bugs. |
| |
| ▲ | rmunn 11 hours ago | parent | next [-] | | I can no longer edit the parent comment, but I want to address one thing: > Once my property tests were running overnight This was happening because I was setting my test runner to say "Keep generating random data and running tests for 8 hours, no matter how many tests that is", not because the tests were slow. I was running literally millions of tests overnight (usually about 2-3 million), though later when I added some slow tests that number went down into the tens of thousands of tests overnight. I may have given the impression that property-based testing is slow. It doesn't have to be. It can be very, very quick. But the more random tests you run, the more likely to are to uncover the really, REALLY bizarre edge cases. | |
| ▲ | tombert a day ago | parent | prev | next [-] | | I had a similar thing, with F# as well actually. We had some code that used a square root, and in some very rare circumstances, we could get a negative number, which would throw an exception. I don't think i would have even considered that possibility if FsCheck hadn't generated it. | |
| ▲ | dkarl 17 hours ago | parent | prev | next [-] | | > Once my property tests were running overnight and not finding any failures Quickly addressing the time it takes to run property-based tests, especially in Python: it's extremely helpful to make sure that running subsets of unit tests is convenient and documented and all your developers are doing it. Otherwise they may revolt against property-based testing because of the time it takes to run tests. Again, especially in Python. | | |
| ▲ | rmunn 11 hours ago | parent [-] | | I may have given a misleading impression. Each property test took milliseconds to run, and FsCheck defaults to generating 100 random inputs for each test. Running the whole test suite took 5-10 minutes depending on whether I ran the longer tests or skipped them (the tests that generated very large lists, then split and concatenated them several times, took longer than the rest of the test suite combined). What I was doing that ran overnight was the stress testing feature of the Expecto test runner (https://github.com/haf/expecto?tab=readme-ov-file#stress-tes...), where instead of running 100 tests for each property test you define, it keeps on generating random input and running tests for a fixed length of time. I would set it to run for 8 hours then go to bed. In the morning I would look at the millions (literally millions, usually between 2-3 million) of random tests that had been run, all of which were passing, and say "Yep, I probably don't have any bugs left". | | |
| ▲ | dkarl 9 hours ago | parent [-] | | That's pretty cool, and now I'm curious if there's something similar for ScalaCheck. My comment comes from my own experience, though, introducing Hypothesis and ScalaCheck into codebases and quickly causing noticeable increases in unit test times. I think the additional runtime for tests is undoubtedly worth it, but maybe not a good trade-off when people are used to running unit tests several times an hour as part of their development cycle. To avoid people saying, "Running four minutes of tests five times per hour is ruining my flow and productivity," I make sure they have a script or command to run a subset of basic, less comprehensive tests, or to only run the tests relevant to the changes they've made. | | |
| ▲ | rmunn 9 hours ago | parent [-] | | Or a watch command that runs tests in the background on save, and an IDE setting to flag code when the watched tests produce a failure. Get used to that, and it's not even a matter of stopping to run the tests: they run every time you hit Ctrl-S, and you just keep on typing — and every so often the IDE notifies you of a failed test. The drawback is that you might get used to saying "Well, of course the tests failed, I'm in the middle of refactoring and haven't finished yet" and ignoring the failed-test notifications. But just like how I'm used to ignoring typecheck errors until I finish typing but then I look at them and see if they're still there, you probably won't get used to ignoring test failures all the time. Though having a five-minute lag time between "I'm finished typing, now those test errors should go away" and having them actually go away might be disconcerting. | | |
| ▲ | eru 8 hours ago | parent [-] | | You can also set things up so that you only run 10 examples per test when doing a quick check during development, but your CI runs the full 200 examples per test (or even more). |
|
|
|
| |
| ▲ | bunderbunder 20 hours ago | parent | prev | next [-] | | I love them for this, too. Sadly I have a really hard time getting teammates to agree to using property-based testing - or letting me use it - because they take "no non-deterministic tests" as ironclad dogma without really understanding the the principle's real intent. (I can do it to find edge cases to convert to deterministic unit tests in the privacy of my own home, of course. But not being able to commit a library of useful strategies without people calling foul in code review is obnoxious.) | | |
| ▲ | iLemming 19 hours ago | parent | next [-] | | > "no non-deterministic tests" These tests can be made determenistic. Shrinking to the minimal reproducing case, is always deterministic. Also you can [re]start the tests with the specific seed - you typically log the seed, so you can replay the exact failure deterministically - prop-based tests are stochastic by design but reproducible by capture. | | |
| ▲ | dd82 18 hours ago | parent [-] | | correct, I've done this at past places to verify rules engine output for variety of inputs. |
| |
| ▲ | eru 8 hours ago | parent | prev [-] | | You could just hard code a PRNG seed, and then the test would trivially be deterministic? I'm not sure what they say their objection is, is actually their real objection? |
| |
| ▲ | teiferer a day ago | parent | prev [-] | | That example caught my attention. What was it in your code that made length 24 special? | | |
| ▲ | rmunn 21 hours ago | parent [-] | | I was trying out an implementation of Relaxed Radix-Balanced Trees, a data structure for immutable lists that allows splitting and joining them in constant* time. In a normal RRB Tree implementation, each leaf in the tree has up to 32 items in it, and a tree node that's not a leaf has up to 32 pointers to leaves or to lower tree nodes. There's also a tail, a single array of up to 32 which, when full, will be promoted to a leaf and a new empty tail array created. (Or the full tail will be promoted the next time an item is appended and a new tail of length 1 will be created). For my implementation, I was using nodes of size 4 rather than size 32, because that let me test much easier; then after testing, I would switch the branching constant to 32 and run final tests. With a branching factor of 4, that means that a tree of size 24 contained one root node with two children, a left child that was a full tree of 16 (a tree node pointing to four leaf nodes, all of them full of 4 items), and a right child that was a leaf node of size 4. That makes 20 items in the tree, plus a full tail array of 4 making 24 items. The failure came when the next item was appended. Now, if you've never implemented an RRB tree you won't have noticed the mistake. But in the previous paragraph, when I said that the right child of the root was a leaf? That was the mistake; in an RRB tree all leaves are supposed to be at the same level, and the right child of the root should have been a one-child node, whose child was a leaf of size 4. My code that promoted the root up one level when it was full (when the tree was size 20, a full 16-item tree plus a tail of 4) was faulty, and had created a new root with the previous root as the left child and the previous tail as the right child, instead of creating a one-child node as the right child of the new root with the 4-item leaf as its child. The tree still functioned for all read-only purposes (traversal, lookups, etc) and even for appends, as long as the appends did no more than fill the tail. Once the tail was full (making a total of 24 items total stored in the structure) and I needed to append a 25th item, the tail-promotion code found an illegal tree state and did the wrong thing. I no longer remember which wrong thing it did, just that once that 25th item was appended, the resulting tree no longer passed all the property checks. * Constant for all practical purposes. In fact it's O(log32 N), but since log32 N is no more than 7 when N is 2^32, the operation is bounded by a constant. So it's O(1) for all practical purposes: most lists of any reasonable size will translate into trees whose height is no more than 2 or 3, which means only 2 or 3 nodes will need to be split when splitting the entire list (and the RRB tree structure that models the list) into two parts. And likewise for joining two RRB structures together (a list append operation): there might be up to O(H) merged nodes where H is the height of the tree, which again no more than log32 N for a tree of branching factor 32. |
|
|
|
| ▲ | NortySpock a day ago | parent | prev | next [-] |
| I keep thinking I have a possible use case for property -based testing, and then I am up to my armpits in trying to understand the on-the-ground problem and don't feel like I have time to learn a DSL for describing all possible inputs and outputs when I already had an existing function (the subject-under-test) that I don't understand. So rather than try to learn to black boxes at the same time , I fall back to "several more unit tests to document more edge cases to defensibly guard against" Is there some simple way to describe this defensive programming iteration pattern in Hypothesis? Normally we just null-check and return early and have to deal with the early-return case. How do I quickly write property tests to check that my code handles the most obvious edge cases? |
| |
| ▲ | eru a day ago | parent | next [-] | | In addition to what other people have said: > [...] time to learn a DSL for describing all possible inputs and outputs when I already had an existing function [...] You don't have to describe all possible inputs and outputs. Even just being able to describe some classes of inputs can be useful. As a really simple example: many example-based tests have some values that are arbitrary and the test shouldn't care about them, like eg employees names when you are populating a database or whatever. Instead of just hard-coding 'foo' and 'bar', you can have hypothesis create arbitrary values there. Just like learning how to write (unit) testable code is a skill that needs to be learned, learning how to write property-testable code is also a skill that needs practice. What's less obvious: retro-fitting property-based tests on an exiting codebase with existing example-based tests is almost a separate skill. It's harder than writing your code with property based tests in mind. --- Some common properties to test: * Your code doesn't crash on random inputs (or only throws a short whitelist of allowed exceptions). * Applying a specific functionality should be idempotent, ie doing that operation multiple times should give the same results as applying it only once. * Order of input doesn't matter (for some functionality) * Testing your prod implementation against a simpler implementation, that's perhaps too slow for prod or only works on a restricted subset of the real problem. The reference implementation doesn't even have to be simpler: just having a different approach is often enough. | | |
| ▲ | wodenokoto a day ago | parent [-] | | But let's say employee names fail on apostrophe. Won't you just have a unit test that sometimes fail, but only when the testing tool randomly happens to add an apostrophe in the employee name? | | |
| ▲ | IanCal a day ago | parent | next [-] | | Hypothesis keeps a database of failures to use locally and you can add a decorator to mark a specific case that failed. So you run it, see the failure, add it as a specific case and then that’s committed to the codebase. The randomness can bite a little if that test failure happens on an unrelated branch, but it’s not much different to someone just discovering a bug. edit - here's the relevant part of the hypothesis guide https://hypothesis.readthedocs.io/en/latest/tutorial/replayi... | | |
| ▲ | theptip 19 hours ago | parent [-] | | You can also cache the DB across CI runs, which will reduce the randomness (ie failures won’t just disappear between runs). |
| |
| ▲ | movpasd a day ago | parent | prev | next [-] | | You can either use the @example decorator to force Hypothesis to check an edge case you've thought of, or just let Hypothesis uncover the edge cases itself. Hypothesis won't fail a test once and then pass it next time, it keeps track of which examples failed and will re-run them. The generated inputs aren't uniformly randomly distributed and will tend to check pathological cases (complex symbols, NaNs, etc) with priority. You shouldn't think of Hypothesis as a random input generator but as an abstraction over thinking about the input cases. It's not perfect: you'll often need to .map() to get the distribution to reflect the usage of the interface being tested and that requires some knowledge of the shrinking behaviour. However, I was really surprised how easy it was to use. | |
| ▲ | jcattle a day ago | parent | prev | next [-] | | As far as I remember, hypothesis tests smartly. Which means that possibly problematic strings are tested first. It then narrows down which exact part of the tested strings caused the failure. So it might as well just throw the kitchen sink at the function, if it handles that: Great, if not: That string will get narrowed down until you arrive at a minimal set of failing inputs. | |
| ▲ | dragonwriter 17 hours ago | parent | prev | next [-] | | > But let's say employee names fail on apostrophe. Won't you just have a unit test that sometimes fail, but only when the testing tool randomly happens to add an apostrophe in the employee name? If you just naively treat it as a string and let hypothesis generate values, sure. Which is better than if you are doing traditional explicit unit testing and haven’t explicitly defined apostrophes as a concern. If you do have it (or special characters more generally) as a concern, that changes how you specify your test. | |
| ▲ | chriswarbo 11 hours ago | parent | prev | next [-] | | Hypothesis is a search algorithm for finding ways that we have misunderstood our code (whether that be in the codebase, or the claims we have made about it in the spec/tests). Conceptually, that's the inverse of "a unit test that sometimes fails": it's a search that sometimes succeeds at finding out that we're wrong. That's infinity% more effective than a search procedure which is hard-coded to check a single example; especially when that example is chosen by the same dev (with the same misconceptions!) that implemented the code-under-test! Now, as for what would actually happen in that situation, when using Hypothesis: - As others have indicated, Hypothesis keeps a database of the failures it's found. Committing that database to your project repo would act like a ratchet: once they've successfully found a problem, they will continue to do so (until it's fixed). - If you don't want to commit your database (to avoid churn/conflicts), then the same would happen per working-copy. This is fine for dev machines, where repo directories can live for a long time; though not so useful for CI, if the working-copy gets nuked after each run. - Even if we disable the failure database, property failures will always show the inputs which caused it (the "counterexample"), along with the random seed used to generate it. Either of those is enough to reproduce that exact run, giving us a ready-made regression test we can copy/paste to make the failure permanent (until it's fixed). As others have said, Hypothesis properties can be decorated with `@example(foo)` to ensure a particular input is always checked. - Even if we disable the failure database, and don't copy the counterexample as a regression test, property checkers like Hypothesis will still "shrink" any counterexamples they find. Your example of apostrophes causing problems is trivial for Hypothesis to shrink, so if/when it happens to find some counterexample, it will always manage to shrink that down to the string "'"; essentially telling us that "employee names fail on apostrophe", which we could either fix right now, or stick in our bug tracker, or whatever. | |
| ▲ | bluGill a day ago | parent | prev | next [-] | | If you know it will fail on apostrophe you should have a specific test for that. However if that detail is burried in some function 3 levels deep that you don't even realize is used you wouldn't write the test or handle it even though it matters. This should find those issuses too. | |
| ▲ | vlovich123 a day ago | parent | prev | next [-] | | Either your code shouldn’t fail or the apostrophe isn’t a valid case. In the former, hypothesis and other similar frameworks are deterministic and will replay the failing test on request or remember the failing tests in a file to rerun in the future to catch regressions. In the latter, you just tell the framework to not generate such values or at least to skip those test cases (better to not generate in terms of testing performance). | | |
| ▲ | reverius42 a day ago | parent [-] | | I think what they meant is, "won't Hypothesis sometimes fail to generate input with an apostrophe, thus giving you false confidence that your code can handle apostrophes?" I think the answer to this is, in practice, it will not fail to generate such input. My understanding is that it's pretty good at mutating input to cover a large amount of surface area with as few as possible examples. | | |
| ▲ | eru a day ago | parent [-] | | Hypothesis is pretty good, but it's not magic. There's only so many corner cases it can cover in the 200 (or so) cases per tests it's running by default. But by default you also start with a new random seed every time you run the tests, so you can build up more confidence over the older tests and older code, even if you haven't done anything specifically to address this problem. Also, even with Hypothesis you can and should still write specific tests or even just specific generators to cover specific classes of corners cases you are worried about in more detail. | | |
| ▲ | thunky a day ago | parent [-] | | > But by default you also start with a new random seed every time you run the tests, so you can build up more confidence over the older tests and older code Is it common practice to use the same seed and run a ton of tests until you're satisfied it tested it thoroughly? Because I think I would prefer that. With non-deterministic tests I would always wonder if it's going to fail randomly after the code is already in production. | | |
| ▲ | eru 8 hours ago | parent | next [-] | | You can think of property based tests as defining a vast 'universe' of tests. Like 'for all strings S and T, we should have to_upper(S) + to_upper(T) == to_upper(S+T)' or something like that. That defines an infinite set of individual test cases: each choice for S and T gives you a different test case. Running all of these tests would take too long. So instead we take a finite sample from our universe, and only run these. > With non-deterministic tests I would always wonder if it's going to fail randomly after the code is already in production. You could always take the same sample, of course. But that means you only ever explore a very small fraction of that universe. So it's more likely you miss something. Remember: closing your eyes doesn't make the tiger go away. If there are important cases you want to have checked every time, you can use the @example decorator in Hypothesis, or you can just write a traditional example based test. | |
| ▲ | kobebrookskC3 8 hours ago | parent | prev | next [-] | | > With non-deterministic tests I would always wonder if it's going to fail randomly after the code is already in production. if you didn't use property-based testing, what are the odds you would've thought of the case? | |
| ▲ | bluGill 21 hours ago | parent | prev | next [-] | | the more the test runs the less likely it is there is an uncovered case left. So your confidence grows. Remember too anything found before release is something a customer won't find. | |
| ▲ | vlovich123 12 hours ago | parent | prev | next [-] | | You should save the seeds so you can reproduce the issue. But you should let the seed float so that you test as many cases as possible over time. | |
| ▲ | nyrikki 19 hours ago | parent | prev [-] | | Saving the seed in the build artifacts/logs has saved a lot of time for me even with tools like faker. | | |
| ▲ | eru 8 hours ago | parent [-] | | Yes, you should note the seed you use for a run in the logs, but you should use a new seed each run (unless trying to reproduce a known bug) so you cover more of the search space. |
|
|
|
|
| |
| ▲ | Balinares a day ago | parent | prev [-] | | No, Hypothesis iterates on test failures to isolate the simplest input that triggers it, so that it can report it to you explicitly. |
|
| |
| ▲ | akshayshah a day ago | parent | prev | next [-] | | Sibling comments have already mentioned some common strategies - but if you have half an hour to spare, the property-based testing series on the F# for Fun and Profit blog is well worth your time. The material isn’t really specific to F#. https://fsharpforfunandprofit.com/series/property-based-test... | |
| ▲ | sunshowers a day ago | parent | prev | next [-] | | The simplest practical property-based tests are where you serialize some randomly generated data of a particular shape to JSON, then deserialize it, and ensure that the output is the same. A more complex kind of PBT is if you have two implementations of an algorithm or data structure, one that's fast but tricky and the other one slow but easy to verify. (Say, quick sort vs bubble sort.) Generate data or operations randomly and ensure the results are the same. | | |
| ▲ | eru a day ago | parent [-] | | > The simplest practical property-based tests are where you serialize some randomly generated data of a particular shape to JSON, then deserialize it, and ensure that the output is the same. Testing that f(g(x)) == x for all x and some f and g that are supposed to be inverses of each other is a good test, but it's probably not the simplest. The absolute simplest I can think of is just running your functionality on some randomly generated input and seeing that it doesn't crash unexpectedly. For things like sorting, testing against an oracle is great. But even when you don't have an oracle, there's lots of other possibilities: * Test that sorting twice has the same effect as sorting once. * Start with a known already in-order input like [1, 2, 3, ..., n]; shuffle it, and then check that your sorting algorithm re-creates the original. * Check that the output of your sorting algorithm is in-order. * Check that input and output of your sorting algorithm have the same elements in the same multiplicity. (If you don't already have a datastructure / algorithm that does this efficiently, you can probe it with more randomness: create a random input (say a list of numbers), pick a random number X, count how many times X appears in your list (via a linear scan); then check that you get the same count after sorting. * Check that permuting your input doesn't make a difference. * Etc. | | |
| ▲ | sunshowers a day ago | parent | next [-] | | Speaking for myself — those are definitely all simpler cases, but for me I never found them compelling enough (beyond the "it doesn't crash" property). For me, the simplest case that truly motivated PBT for me was roundtrip serialization. Now I use PBT quite a lot, and most of them are either serialization roundtrip or oracle/model-based tests. | | |
| ▲ | eru a day ago | parent [-] | | Oh, yes, I was just listing simple examples. I wasn't trying to find a use case that's compelling enough to make you want to get started. I got started out of curiosity, and because writing property based tests is a lot more fun than writing example based tests. |
| |
| ▲ | jgalt212 a day ago | parent | prev [-] | | > The absolute simplest I can think of is just running your functionality on some randomly generated input and seeing that it doesn't crash unexpectedly. For this use case, we've found it best to just use a fuzzer, and work off the tracebacks. That being said, we have used hypothesis to test data validation and normalizing code to decent success. We use on a one-off basis, when starting something new or making a big change. We don't run these tests everyday. Also, I don't like how hypothesis integrates much better with pytest than unittest. | | |
| ▲ | eru 9 hours ago | parent [-] | | > For this use case, we've found it best to just use a fuzzer, and work off the tracebacks. Yes, if that was the only property you want to test, a dedicated fuzzer is probably better. But if you are writing a bunch of property based tests, you can add a basic "this doesn't crash in unexpected ways" property nearly for free. |
|
|
| |
| ▲ | bunderbunder 20 hours ago | parent | prev | next [-] | | For working with legacy systems I tend to start with Approval Tests (https://approvaltests.com/). Because they don't require me to understand the SUT very well before I can get started with them, and because they help me start to understand it more quickly. | |
| ▲ | disambiguation a day ago | parent | prev | next [-] | | I've only used it once before, not as unit testing, but as stress testing for a new customer facing api. I wanted to say with confidence "this will never throw an NPE". Also the logic was so complex (and the deadline so short) the only reasonable way to test was to generate large amounts of output data and review it manually for anomalies. | |
| ▲ | dd82 18 hours ago | parent | prev | next [-] | | Essentially this is a good example of parametrized tests, just supercharged with generated inputs. So if you already have parametrized tests, you're already halfway there. | | |
| ▲ | eru 8 hours ago | parent [-] | | Yes, when I saw eg Golang people use table driven tests like this, I was wondering why nobody seems to have told them about generating these tables automatically.. |
| |
| ▲ | meejah a day ago | parent | prev | next [-] | | Here are some fairly simple examples: testing port parsing https://github.com/meejah/fowl/blob/e8253467d7072cd05f21de7c... ...and https://github.com/magic-wormhole/magic-wormhole/blob/1b4732... The simplest ones to get started with are "strings", IMO, and also gives you lots of mileage (because it'll definitely test some weird unicode). So, somewhere in your API where you take some user-entered strings -- even something "open ended" like "a name" -- you can make use of Hypothesis to try a few things. This has definitely uncovered unicode bugs for me. Some more complex things can be made with some custom strategies. The most-Hypothesis-heavy tests I've personally worked with are from Magic Folder strategies: https://github.com/tahoe-lafs/magic-folder/blob/main/src/mag... The only real downside is that a Hypothesis-heavy test-suite like the above can take a while to run (but you can instruct it to only produce one example per test). Obviously, one example per test won't catch everything, but is way faster when developing and Hypothesis remembers "bad" examples so if you occasionally do a longer run it'll remember things that caused errors before. | |
| ▲ | chriswarbo 10 hours ago | parent | prev | next [-] | | When I'm "up to my armpits in trying to understand the on-the-ground problem", I find PBT great for quickly find mistakes in the assumptions/intuitions I'm making about surrounding code and helper functions. Whenever I find myself thinking "WTF? Surely ABC does XYZ?", and the code for ABC isn't immediately obvious, then I'll bang-out an `ABC_does_XYZ` property and see if I'm wrong. This can be much faster than trying to think up "good" examples to check, especially when I'm not familiar with the domain model, and the relevant values would be giant nested things. I'll let the computer have a go first. | |
| ▲ | fwip a day ago | parent | prev [-] | | I think the easiest way is to start with general properties and general input, and tighten them up as needed. The property might just be "doesn't throw an exception", in some cases. If you find yourself writing several edge cases manually with a common test logic, I think the @example decorator in Hypothesis is a quick way to do that: https://hypothesis.readthedocs.io/en/latest/reference/api.ht... | | |
| ▲ | NortySpock 17 hours ago | parent [-] | | Thanks, the "does not throw an exception" property got my mental gears turning in terms of how to get started on this, and from there I can see how one could add a few more properties as one goes along. Appreciate you taking the time to answer. | | |
| ▲ | eru 8 hours ago | parent [-] | | Yes. One of the first things you have to do when writing any property based tests, no matter what you actually want to test, is defining your input generators. And once you have those generators, you might as well throw together the "doesn't crash unexpectedly" test together. That not only tests your code, but also exercises your just written input generators. |
|
|
|
|
| ▲ | ed_blackburn a day ago | parent | prev | next [-] |
| I'm using sqlglot to parse hundreds of old mysql back up files to find diffs of schemas. The joys of legacy code. I've found hypothesis to be super helpful for tightening up my parser. I've identified properties (invariants) and built some strategies. I can now generate many more permutations of DDL than I'd thought of before. And I have absolutely confidence in what I'm running. I started off TDD covered the basics. Learned what I needed to learn about the files I'm dealing with, edge cases, sqlglot and then I moved onto Hypothesis for extra confidence. I'm curious to see if it'll help with commands for APIs. I nothing else it'll help me appreciate how liberal my API is when perhaps I don't want it to be? |
| |
| ▲ | groundhogstate a day ago | parent [-] | | You might find schemathesis useful for API testing (which IIRC is built on Hypothesis). Certainly helped me find a stack of unhandled edge cases. YMMV, but once I first set up Schemathesis on one of my MVPs, it took several hours to iron out the kinks. But thereafter, if built into CI, I've found it guards against regression quite effectively. | | |
| ▲ | IanCal a day ago | parent | next [-] | | https://schemathesis.readthedocs.io/en/stable/ This looks really good. I like using hypothesis for APIs so this is a really nice layer on top, thanks for pointing this one out. | |
| ▲ | agumonkey 19 hours ago | parent | prev [-] | | pretty nice tool, helped us poke into our app in ways we didn't foresee also pretty happy to see the reports of thousands or api calls that matched the expected responses, that increases the confidence you have in your system |
|
|
|
| ▲ | benrutter a day ago | parent | prev | next [-] |
| I love the idea of hypothesis! Haven't found a lot of use cases for it yet, I think the quick start example helps explain why. Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all. Obviously it's just a tiny example, but in general I often find writing a test that will cover every hypothetical case isn't realistic. Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it? |
| |
| ▲ | Doxin 38 minutes ago | parent | next [-] | | I've used hypothesis in practice for testing a slugify function. When someone signs up for our product our software produces a subdomain based on the organisation name you input. People put all sorts of weird things in that field, and we want to be pretty broad in what we'll accept. The hypothesis test for this is near trivial. Pass some text to the slugify function, and check it either a) throws an invalid input error (e.g. input too short etc) or b) return a string that's a valid domain name. Doing this found me so many odd little edge cases around length truncation and idna encoding. The one bug I would've never found myself is the fact that lowercasing "ß" turns it into "ss" which broke truncation, resulting in an invalid domain only if it includes ß and is exactly 64 characters long. hypothesis is a pretty niche tool, but boy when it's the right tool it's the right tool. | |
| ▲ | masklinn a day ago | parent | prev | next [-] | | What you're talking about is using an oracle (a different implementation of what you're testing), it's an option for property (or exhaustive) testing but is by no means a requirement. Even for a sort function there are plenty of properties you can check without needing an oracle e.g. - that the output sequence is the same length as the input - that the output sequence is sorted - that the population counts are the same in the input and the output > In real life, if you had that you probably wouldn't have written the function "my_sort" at all. Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate. | | |
| ▲ | eru 8 hours ago | parent [-] | | > Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate. The opposite might also be true. Suppose you already have a specialised implementation, and you want to write a new generalised one. You can still test them against each other. Eg suppose you are writing a library that supports sorting crazy huge datasets that don't fit into memory. You can still check that it gives the same answers as the built-in sorting algorithm from the standard library on input that's small enough to fit into memory. |
| |
| ▲ | arnsholt a day ago | parent | prev | next [-] | | I've only used PBT a few times, but when it fits it's been extremely useful. A concrete example from my practice of what has been pointed out in this thread, that you want to properties of your function's output rather than the output itself: I was implementing a fairly involved algorithm (the Zhang-Shasha edit distance algorithm for ordered trees), and PBT was extremely useful in weeding out bugs. What I did was writing a function that generated random tree structures in the form I needed for my code, and tested the four properties that all distance functions should have: 1. d(x, x) = 0 for all x
2. d(x, y) >= 0 for all x, y
3. d(x, y) = d(y, x) for all x, y
4. d(x, z) <= d(x, y) + d(y, z) for all x, y, z (the triangle inequality) Especially the triangle inequality check weeded out some tricky corner cases I probably wouldn't have figured out on my own. Some will object that you're not guaranteed to find bugs with this kind of random-generation strategy, but if you blast through a few thousand cases every time you run your test suite, and the odd overnight run testing a few million, you quickly get fairly confident that the properties you test actually hold. Of course any counterexamples the PBT finds should get lifted to regression tests in addition, to make sure they're always caught if they crop up again. And as with any testing approach, there are no guarantees, but it adds a nice layer of defense in depth IMO. | |
| ▲ | aflukasz a day ago | parent | prev | next [-] | | > Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it? One example would be when you have a naive implementation of some algorithm and you want to introduce a second one, optimized but with much more complex implementation. Then this naive one will act as a model for comparisons. Another case that comes to mind is when you have rather simple properties to test (like: does it finish without a crash, within a given time?, does not cross some boundaries on the output?), and want to easily run over a sensible set of varying inputs. | |
| ▲ | sevensor a day ago | parent | prev | next [-] | | I use hypothesis to test that the happy path is really happy. I use it to define the set of expected inputs for a function, and it looks for the nastiest possible sets of inputs. It’s kind of the opposite of how I write ordinary tests, where I pick challenging but representative examples. Often I’m just checking some basic property, like “function f is the inverse of function g” for round tripping data through a different representation. Sometimes it’s just “function f does not divide by zero.”. Hypothesis seems to have a library of corner cases that it checks first. If you don’t say you can’t handle floats above 2^53, it will try 2^53, for instance. (That’s the threshold where doubles stop being able to represent every integer.) Often the result of running hypothesis is not a code change but a more precise description of what inputs are valid. | |
| ▲ | dragonwriter 17 hours ago | parent | prev | next [-] | | > Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all. You don't need that, that’s just a convenient shortcut when you do have a function that does that (but, I would argue, a not-great example for precisely that reason.) All you actually need is the ability to verify the property (or set of properties) that you are testing in the output. You could replace the test in that sample with something like this, which I think is a better illustration of what is going on (this is somewhat simplified, rather than the length test you really want to test that it has the same set: @given(st.lists(st.integers() | st.floats()))
def test_sort_correct(lst):
result = my_sort(lst)
# result has the same unique items as original list
assert set(result) | set(lst)
# for each item, result has the same number of copies as original
assert all(
result.count(item) == lst.count(item)
for item in set(result)
)
# result is sorted
assert all(result[n] <= result[n+1] for n in range(len(result)-1))
| |
| ▲ | thekoma a day ago | parent | prev | next [-] | | Indeed. For a sorting algorithm it would be more insightful to show it test the actual property of something that is sorted: every consecutive element is larger than the previous one (or the other way around). You don’t need a sorting function to test the “sorted” property. | |
| ▲ | wongarsu a day ago | parent | prev | next [-] | | Imho it's just not a great example. A better example would be testing the invariants of the function. "For any list of numbers, my_sort does not throw an exception" is trivial to test. "For any list of numbers, in the list returned by my_sort the item at index n is smaller than the item at index n+1" would be another test. Those two probably capture the full specification of my_sort, and you don't need a sort function to test either. In a real-world example you would more likely be testing just some subset of the specification, those things that are easy to assert | | |
| ▲ | lgas a day ago | parent [-] | | Always returning the empty list meets your spec. | | |
| ▲ | wongarsu a day ago | parent [-] | | Good point. I suppose we should add "number of input elements equals number of output elements" and "every input element is present in the output". Translated in a straightforward test that still allows my_sort([1,1,2]) to return [1,2,2], but we have to draw the line somewhere | | |
| ▲ | travisjungroth 16 hours ago | parent | next [-] | | Just use Counter and if the objects aren’t hashable, use the count of IDs. Grab this before calling the function, in case the function is destructive. Check it against the output. Add in checking each item is less than or equal to its successor and you have the fundamental sort properties. You might have more, like stability. | |
| ▲ | masklinn 18 hours ago | parent | prev [-] | | > we have to draw the line somewhere Do we? You can pop-count the two lists and checks that those are equal. |
|
|
| |
| ▲ | chriswarbo 9 hours ago | parent | prev | next [-] | | Here's a property check I wrote yesterday, which found a couple of bugs in a large, decade-old codebase. I'd just changed a data structure with three components, and made sure the test suite was still passing. I happened to notice a function in the same file, for parsing strings into that data structure, which had a docstring saying that it ignores whitespace at the start/end of the string, and in-between the components. It had tests for the happy-path, like "foo123x" -> ("foo", 123, 'x'), as well as checking that optional components could be left out, and it even checked some failure cases. Yet none of the tests used any whitespace. I thought it would be good to test that, given that somebody had gone to the effort of documenting it. Yet I didn't want to write a bunch of combination like " foo123x", "foo 123x", "foo123 x", " foo 123x", "foo 123 x", and so on. Instead, I wrote a property which adds some amount of whitespace (possibly none) to each of those places, and assert that it gets the same result as with no whitespace (regardless of whether it's a successful parse or not). I wasn't using Python, but it was something like this: def whitespace_is_ignored(b1: bool, b2: bool, b3: bool, s1: int, s2: int, s3: int, s4: int):
v1 = "foo" if b1 else ""
v2 = "123" if b2 else ""
v3 = "x" if b3 else ""
spaces = lambda n: " " * n
spaced = "".join([spaces(s1), v1, spaces(s2), v2, spaces(s3), v3, spaces(s4)])
assert parser(v1 + v2 + v3) == parser(spaced)
The property-checker immediately found that "foo123x " (with two spaces at the end) will fail to parse. When I fixed that, it found that spaces after the first component will end up in the result, like "foo 123x" -> ("foo ", 123, 'x').Of course, we could make this property more general (e.g. by taking the components as inputs, instead of hard-coding those particular values); but this was really quick to write, and managed to find multiple issues! If I had written a bunch of explicit examples instead, then it's pretty likely I would have found the "foo 123x" issue; but I don't think I would have bothered writing combinations with multiple consecutive spaces, and hence would not have found the "foo123x " issue. | |
| ▲ | FuckButtons 16 hours ago | parent | prev | next [-] | | Mostly I found making composite strategies most helpful. I had a lot of fiddly details to get right when validating some dsp algorithms and the intermediate data structures they required, and for that it was very helpful, mostly I used property testing with some kind of composite strategy for generating valid inputs. | |
| ▲ | IanCal a day ago | parent | prev | next [-] | | I definitely had the same issue when I started, I think you've hit on the two main mental blocking points many people hit. > this means you need a second function that acts the same as the function you wrote. Actually no, you need to check that outcomes match certain properties. The example there works for a reimplementation (for all inputs, the output of the existing function and new one must be identical) but you can break down a sorting function into other properties. Here are some for a sorting function that sorts list X into list Y. Lists X and Y should be the same length. All of the elements in list X should be in list Y. Y[n] <= Y[n+m] for n+m<len(X) If you support a reverse parameter, iterating backwards over the sorted list should match the reverse sorted list. Sorting functions are a good simple case but we don't write them a lot and so it can be hard to take that idea and implement it elsewhere. Here's another example: In the app focus ELEMENT A. Press tab X times. Press shift-tab X times. Focus should be be on ELEMENT A. In an app with two or more elements, focus ELEMENT A. Press tab. The focus should have changed. In an app with N elements. Focus ELEMENT A. Press tab N times. The focus should be on ELEMENT A. Those are a simple properties that should be true, are easy to test and maybe you'd have some specific set of tests for certain cases and numbers but there's a bigger chance you then miss that focus gets trapped in some part of your UI. I had a library for making UIs for TVs that could be shaped in any way (so a carousel with 2 vertical elements then one larger one, then two again for example). I had a test that for a UI, if you press a direction and the focus changes, then pressing the other direction you should go back to where you were. Extending that, I had that same test run but combined with "for any sequence of UI construction API calls...". This single test found a lot of bugs in edge cases. But it was a very simple statement about using the navigation. I had a similar test - no matter what API calls are made to change the UI, either there are no elements on screen, or one and only one has focus. This was defined in the spec, but we also had defined in the spec that if there was focus, removing everything and then adding in items meant nothing should have focus. These parts of the spec were independently tested with unit tests, and nobody spotted the contradiction until we added a single test that checked the first part automatically. This I find exceptionally powerful when you have APIs. For any way people might use them, some basic things hold true. I've had cases where we were identifying parts of text and things like "given a random string, adding ' thethingtofind ' results in getting at least a match for ' thethingtofind ', with start and end points, and when we extract that part of the string we get ' thethingtofind '". Sounds basic, right? Almost pointless to add when you've seen the output work just fine and you have explicit test cases for this? It failed. At one step we lowercased things for normalisation and we took the positions from this lowercased string. If there was a Turkish I in the string, when lowercased it became two characters: >>> len("İ") == len("İ".lower())
False
So anything where this appeared before the match string would throw off the positions.Have a look at your tests, I imagine you have some tests like "Doing X results in Y" and "Doing X twice results in Y happening once". This is two tests expressing "Doing X n times where n>=1, Y happens". This is much more explicitly describing the actual property you want to test as well. You probably have some tests where you parameterise a set of inputs but actually check the same output - those could be randomly generated inputs testing a wider variety of cases. I'd also suggest that if you have a library or application and you cannot give a general statement that always holds true, it's probably quite hard to use. Worst case, point hypothesis at a HTTP API, have it auto generate some tests that say "no matter what I send to this it doesn't return a 500". In summary (and no this isn't LLM generated): 1. Don't test every case. 2. Don't check the exact output, check something about the output. 3. Ask where you have general things that are true. How would you explain something to a user without a very specific setup (you can't do X more than once, you always get back Y, it's never more than you put in Z...) 4. See if you have numbers in tests where the specific number isn't relevant. 5. See if you have inputs in your tests where the input is a developer putting random things in (sets of short passwords, whatever) | |
| ▲ | pfdietz 21 hours ago | parent | prev | next [-] | | I don't do the testing using Hypothesis, but I do a great deal of property based testing of Common Lisp implementations, where it has been invaluable. One large testing loop is testing CL's compile function. This involves creating random valid lambda expressions (unnamed functions), compiling them, and running them on random input. The properties tested are things like "the compiler doesn't crash, the generated code doesn't crash, different versions of the code with different valid declarations compute the same thing". More specific properties are tested for specific CL functions. For example, CL has a function subtypep which, given two types, says either the first value is or is not a subtype of the second, or that it couldn't figure it out. One can test this on randomly generated types, evaluating properties like (subtypep T1 T2) should be consistent with (subtypep `(not ,T2) `(not ,T1)). The point of PBT isn't that you write tests to cover every possible case, it's that PBT will cook up tests for all sorts of cases you will never come up with yourself. I have repeatedly found it kicking out the whackiest test inputs that happen to trigger bugs from strange interactions. | |
| ▲ | rowanG077 a day ago | parent | prev [-] | | On more complex logic you can often reason about many invariants. I don't think using sorting and then comparing to the existing sort is a good example. Precisely for the reason you mention. A sort itself is a good example though. You can do much better then just compare to an existing sort implementation. Because you can easily check if an array is sorted with a linear scan. If your sort is stable you can do an additional check that for all pairs of adjacent elements that compare as equal the order of them is the same as in the input array. |
|
|
| ▲ | dbcurtis a day ago | parent | prev | next [-] |
| It’s been quite some time since I’ve been in the business of writing lots of unit tests, but back in the day, I found hypothesis to be a big force multiplier and it uncovered many subtle/embarrassing bugs for me. Recommend. Also easy and intuitive to use. |
| |
| ▲ | eru a day ago | parent | next [-] | | Hypothesis is also a lot better at giving you 'nasty' floats etc than Haskell's QuickCheck or the relevant Rust and OCaml libraries are. (Or at least used to be, I haven't checked on all of them recently.) | |
| ▲ | aethor a day ago | parent | prev | next [-] | | I concur. Hypothesis saved me many times. It also helped me prove the existence of bugs in third party code, since I was able to generate examples showing that a specific function was not respecting certain properties. Without that I would have spent a lot of time trying to manually find an example, let alone the simplest possible example. | |
| ▲ | IanCal a day ago | parent | prev [-] | | Huge second. I’ve never use pbt and failed to find a new bug. I recommended it in a job interview, they used it and discovered a pretty clear bug on their first test. It’s really powerful. |
|
|
| ▲ | spooky_deep a day ago | parent | prev | next [-] |
| Property based testing is fantastic. Why is it not more popular? My theory is that only code written in functional languages has complex properties you can actually test. In imperative programs, you might have a few utils that are appropriate for property testing - things like to_title_case(str) - but the bulk of program logic can only be tested imperatively with extensive mocking. |
| |
| ▲ | IanCal a day ago | parent | next [-] | | I strongly disagree, but I think there are a few problems 1. Lots of devs just don't know about it. 2. It's easy to do it wrong and try and reimplement the thing you're testing. 3. It's easy to try and redo all your testing like that and fail and then give up. Here's some I've used: Tabbing changes focus for UIs with more than 1 element. Shift tabbing the same number of times takes you back to where you came from. This one on TVs with u/d/l/r nav -> if pressing a direction changes focus, pressing the opposite direction takes you back to where you came from. An extension of the last -> regardless of the set of API calls used to make the user interface, the same is true. When finding text ABC in a larger string and getting back `Match(x, start, end)`, if I take the string and chop out string[start:end] then I get back exactly ABC. This failed because of a dotted I that when lowercased during a normalisation step resulted in two characters - so all the positions were shifted. Hypothesis found this and was able to give me a test like "find x in 'İx' -> fail". No input to the API should result in a 500 error. N, where N>0, PUT requests result in one item created. Clicking around the application should not result in a 404 page or error page. Overall I think there's lots of wider things you can check, because we should have UIs and tools that give simple rules and guarantees to users. | |
| ▲ | chamomeal a day ago | parent | prev | next [-] | | I think testing culture in general is suffering because the most popular styles/runtimes don’t support it easily. Most apps (at least in my part of the world) these days are absolutely peppered with side effects. At work our code is mostly just endpoints that trigger tons of side effects, then return some glob of data returned from some of those effects. The joys of micro services!! If you’re designing from the ground up with testing in mind, you can make things super testable. Defer the actual execution of side effects. Group them together and move local biz logic to a pure function. But when you have a service that’s just a 10,000 line tangle of reading and writing to queues, databases and other services, it’s really hard to ANY kind of testing. I think that’s why unit testing and full on browser based E2E testing are popular these days. Unit testing pretends the complexity isn’t there via mocks, and lets you get high test coverage to pass your 85% coverage requirement. Then the E2E tests actually test user stories. I’m really hoping there’s a shift. There are SO many interesting and comprehensive testing strategies available that can give you such high confidence in your program. But it mostly feels like an afterthought. My job has 90% coverage requirements, but not a single person writes useful tests. We have like 10,000 unit tests literally just mocking functions and then spying on the mocked return. For anybody wanting to see a super duper interesting use of property based testing, check out “Breaking the Bank with test contract”, a talk by Allen Rohner. He pretty much uses property based testing to verify that mocks of services behave identically to the actual services (for the purpose of the program) so that you can develop and test against those mocks. I’ve started implementing a shitty version of this at work, and it’s wicked cool!! | |
| ▲ | DRMacIver 17 hours ago | parent | prev | next [-] | | How popular do you want it to be? The Python survey data (https://lp.jetbrains.com/python-developers-survey-2024/) holds pretty consistently at 4% of Python users saying they use it, which isn't as large as I'd like, but given that only 64% of people in the survey say they use testing at all isn't doing too badly, and I think certainly falsifies the claim that Python programs don't have properties you can test. | |
| ▲ | ibizaman a day ago | parent | prev | next [-] | | I actually used property testing very successfully to test a DB driver and a migration to another DB driver in Go. I wrote up about it here https://blog.tiserbox.com/posts/2024-02-27-stateful-property... | | |
| ▲ | imiric a day ago | parent [-] | | Thanks for sharing! Your article illustrates well the benefits of this approach. One drawback I see is that property-based tests inevitably need to be much more complex than example-based ones. This means that bugs are much more likely, they're more difficult to maintain, etc. You do mention that it's a lot of code, but I wonder if the complexity is worth it in the long run. I suppose that since testing these scenarios any other way would be even more tedious and error-prone, the answer is "yes". But it's something to keep in mind. | | |
| ▲ | dragonwriter 15 hours ago | parent | next [-] | | > One drawback I see is that property-based tests inevitably need to be much more complex than example-based ones. I don’t think that’s true, I just think the complexity is more explicit (in code) rather than implicit (in the process of coming up with examples). Example-based testing usually involves defining conditions and properties to be tested, then involves constructing sets of examples to test them and which attempt to anticipated edge cases from the description of the requirements (black box) or from knowledge of how the code is implemented (white box). Property based testing involves defining the conditiosn and properties, writing code that generates the conditions and for each property writing a bit of code that can refute it by passing if and only if it is true of the subject under test for a particular set of inputs. With a library like Hypothesis which both has good generators for basic types and good abstractions for combining and adapting generators, the latter seems to be less complex overall, as well as moving the complexity into a form where it is explicit and easy to maintain/adapt, whereas adapting example-based tests to requirements changes involves either throwing out examples and starting over or revalidating and updating examples individually. | | |
| ▲ | imiric 11 hours ago | parent [-] | | > Property based testing involves defining the conditiosn and properties, writing code that generates the conditions and for each property writing a bit of code that can refute it by passing if and only if it is true of the subject under test for a particular set of inputs. You're downplaying the amount of code required to properly setup a property-based test. In the linked article, the author implemented a state machine to accurately model the SUT. While this is not the most complex of systems, it is far from trivial, and certainly not a "bit of code". In my many years of example-based unit/integration/E2E testing, I've never had to implement something like that. The author admits that the team was reluctant to adopt PBT partly because of the amount of code. This isn't to say that example-based tests are simple. There can be a lot of setup, mocking, stubbing, and helper code to support the test, but this is usually a smell that something is not right. Whereas with PBT it seems inevitable in some situations. But then again, I can see how such tests can be invaluable, very difficult and likely more complex to implement otherwise. So, as with many things, it's a tradeoff. I think PBT doesn't replace EBT, nor vice-versa, but they complement eachother. |
| |
| ▲ | Jtsummers 14 hours ago | parent | prev [-] | | My experience is that PBT tests are mostly hard in devising the generators, not in the testing itself. Since it came up in another thread (yes, it's trivial), a function `add` is no easier or harder to test with examples than with PBT, here are some of the tests as both PBT-style and example-based style: @given(st.integers())
def test_left_identity_pbt(a):
assert add(a, 0) == a
def test_left_identity():
assert add(10, 0) == 10
@given(st.integers(), st.integers())
def test_commutative(a, b):
assert add(a, b) == add(b, a)
@parametrize("a,b", examples)
def test_commutative():
assert add(a, b) == add(b, a)
They're the same test, but one is more comprehensive than the other. And you can use them together. Supposing you do find an error, you add it to your example-based tests to build out your regression test suite. This is how I try to get people into PBT in the first place, just take your existing example-based tests and build a generator. If they start failing, that means your examples weren't sufficiently comprehensive (not surprising). Because PBT systems like Hypothesis run so many tests, though, you may need to either restrict the number of generated examples for performance reason or breakup complex tests into a set of smaller, but faster running, tests to get the benefit.Other things become much simpler, or at least simpler to test comprehensively, like stateful and end-to-end tests (assuming you have a way to programmatically control your system). Real-world, I used Hypothesis to drive an application by sending a series of commands/queries and seeing how it behaved. There are so many possible sequences that manually developing a useful set of end-to-end tests is non-trivial. However, with Hypothesis it just generated sequences of interactions for me and found errors in the system. After each command (which may or may not change the application state) it issued queries in the invariant checks and verified the results against the model. Like with example-based testing, these can be turned into hard-coded examples in your regression test suite. | | |
| ▲ | imiric 11 hours ago | parent [-] | | > Since it came up in another thread (yes, it's trivial), a function `add` is no easier or harder to test with examples than with PBT Come on, that example is practically useless for comparing both approaches. Take a look at the article linked above. The amount of non-trivial code required to setup a PBT should raise an eyebrow, at the very least. It's quite possible that the value of such a test outweighs the complexity overhead, and that implementing all the test variations with EBT would be infeasible, but choosing one strategy over the other should be a conscious decision made by the team. So as much as you're painting PBT in a positive light, I don't see it that clearly. I think that PBT covers certain scenarios better than EBT, while EBT can be sufficient for a wide variety of tests, and be simpler overall. But again, I haven't actually written PBTs myself. I'm just going by the docs and articles mentioned here. | | |
| ▲ | Jtsummers 7 hours ago | parent [-] | | > Come on, that example is practically useless for comparing both approaches. Come on, I admitted it was trivial. It was a quick example that fit into a comment block. Did you expect a dissertation? > that implementing all the test variations with EBT would be infeasible That's kind of the point to my previous comment. PBTs will generate many more examples than you would create by hand. If you have EBTs already, you're one step away from PBTs (the generators, I never said this was trivial just to preempt another annoying "Come on"). And then you'll have more comprehensive testing than you would have had sticking to just your carefully handcrafted examples. This isn't the end of property-based testing, but it's a really good start and the easiest way to bring it into an existing project because you can mostly reuse the existing tests. Extending this, once you get used to it, to stateful testing (which many PBT libraries support, including Hypothesis) you can generate a lot of very useful end-to-end tests that would be even harder to come up with by hand. And again, if you have any example-based end-to-end tests or integration tests, you need to come up with generators and you can start converting them into property-based tests. > but choosing one strategy over the other should be a conscious decision made by the team. Ok. What prompted this? I never said otherwise. It's also not an either/or situation, which you seem to want to make it. As I wrote in that previous comment, you can use both and use the property-based tests to bolster the example-based tests, and convert counterexamples into more example-based tests for your regression suite. > I haven't actually written PBTs myself. Huh. |
|
|
|
| |
| ▲ | iLemming 18 hours ago | parent | prev | next [-] | | > Why is it not more popular? Probably because it's difficult to build simplistic intuition around them in languages that don't have robust built-in mechanisms that enable that kind of thinking. For example, gen-testing in Clojure feels simple and intuitive, but if I have to do something similar, hmmm... I dunno, say in Java, I wouldn't even know where to start - the code inevitably would be too verbose; mutable state would obscure invariants; and the required boilerplate alone would cloud the core idea. While in Clojure, immutability-by-default feels transformative for gen-testing; data generation is first-class - generators are just values or functions; composition is natural; debugging is easier; invariants are verifiable; FP mindset already baked in - easier to reason about stateful sequences/event streams; there's far less ceremony overall; But if I'm forced to write Java and see a good case for prop-based testing, I'd still probably try doing it. But only because I already acquired some intuition for it. That's why it is immensely rewarding to spend a year or more learning an FP language, even if you don't see it used at work or every day. So, the honest answer to your question would probably be: "because FP languages are not more popular" | |
| ▲ | vrnvu a day ago | parent | prev | next [-] | | >> Why is it not more popular? Property, fuzzy, snapshot testing. Great tools that make software more correct and reliable. The challenge for most developers is that they need to change how they design code and think about testing. I’ve always said the hardest part of programming isn’t learning, it’s unlearning what you already know… | |
| ▲ | maweki a day ago | parent | prev | next [-] | | I think core of the problem in property-based testing that the property/specification needs to be quite simple compared to the implementation. I did some property-based testing in Haskell and in some cases the implementation was the specification verbatum. So what properties should I test?
It was clearer where my function should be symmetric in the arguments or that there is a neutral element, etc.. If the property is basically your specification which (as the language is very expressive) is your implementation then you're just going in circles. | | |
| ▲ | chriswarbo 9 hours ago | parent [-] | | Yeah, reimplementing the solution just to have something to check against is a bad idea. I find that most tutorials talk about "properties of function `foo`", whereas I prefer to think about "how is function `foo` related to other functions". Those relationships can be expressed as code, by plugging outputs of one function into arguments of another, or by sequencing calls in a particular order, etc. and ultimately making assertions. However, there will usually be gaps; filling in those gaps is what a property's inputs are for. Another good source of properties is trying to think of ways to change an expression/block which are irrelevant. For example, when we perform a deletion, any edits made beforehand should be irrelevant; boom, that's a property. If something would filter out negative values, then it's a property that sprinkling negative values all over the place has no effect. And so on. |
| |
| ▲ | eru a day ago | parent | prev [-] | | But wouldn't that apply just as much to example based testing? |
|
|
| ▲ | rtu8lu a day ago | parent | prev | next [-] |
| Great project. I used it a lot, but now I mostly prefer ad hoc generators. Hypothesis combinators quickly become unmaintainable mess for non-trivial objects. Also, shrinking is not such a big deal when you can generate your data in a complexity-sorted order. |
| |
| ▲ | ctenb a day ago | parent [-] | | What is complexity sorted order? | | |
| ▲ | rtu8lu 21 hours ago | parent [-] | | When you generate randomized data in an order from the simplest to the most complex. | | |
| ▲ | ctenb 4 hours ago | parent [-] | | Isn't that very wasteful or difficult to do in practice? If you consider that shrinkers generally take lower numbers to be 'simpler' than higher numbers, complexity-ordering requires you to generate all the numbers from low to high |
|
|
|
|
| ▲ | __mharrison__ a day ago | parent | prev | next [-] |
| I've taught this in my testing courses. I find that (pytest) fixtures are often good enough for coming up with multiple tests but are simple enough to implement. |
|
| ▲ | jamesponddotco 19 hours ago | parent | prev | next [-] |
| Never heard of “property-based testing” before. Coming from Go, I mostly use table and fuzzy tests. Is this approach something that makes sense in Go as well? I see at least one package[1][2] for Go, but I wonder if the approach itself makes sense for the language. [1]: https://github.com/flyingmutant/rapid [2]: Oh, I just found out about testing/quick. |
| |
| ▲ | akshayshah 17 hours ago | parent [-] | | Rapid is excellent. It also integrates with the standard library's fuzz testing, which is handy to persist a high-priority corpus of inputs that have caused bugs in the past. Testing/quick is adequate for small things and doesn't introduce new dependencies, but it's also frozen. Many years ago, the Go team decided that PBT is complex enough that it shouldn't be in stdlib. Here's a small example of a classic PBT technique used in a very practical Go project: https://github.com/connectrpc/connect-go/blob/cb2e11fb88c9a6... |
|
|
| ▲ | anordin95 13 hours ago | parent | prev | next [-] |
| I am a huge fan of the "Explanations" page of their docs: "These explanation pages are oriented towards deepening your understanding of Hypothesis, including its design philosophy." I feel much more software documentation could greatly benefit from this approach. Describing the "why" and design tradeoffs helps me grok a system far better than the typical quickstart or tutorials which show snippets but offer little understanding. That is, they rarely help me answer: is this going to solve my problem and how? |
| |
| ▲ | lwhsiao 12 hours ago | parent [-] | | You would like the Diátaxis framework: https://diataxis.fr/ That is the structure they (any many others) are following :). | | |
| ▲ | anordin95 6 hours ago | parent [-] | | I know it and do like it! Even for those that follow Diataxis, in my experience the "Explanation" sections are often lacking, especially compared to the "How-To" or "Tutorial" ones. Python is one example that comes to mind. They do have explanations here: https://docs.python.org/3/howto/index.html. And, to be fair, they are generally excellent in my opinion! But they're far from front and center and there's much less overall content compared to the other Diataxis types I think (granted, I haven't rigorously checked). |
|
|
|
| ▲ | cat-whisperer a day ago | parent | prev | next [-] |
| A little off topic but theirs’ this https://youtu.be/64t-gPC33cc this is a great video by jon about property testing in Rust |
|
| ▲ | vlade11115 a day ago | parent | prev | next [-] |
| One cool application of this is schemathesis. I really enjoyed it, and I found more input validation bugs in my code than I can count.
Very useful for public APIs. |
|
| ▲ | pyuser583 a day ago | parent | prev | next [-] |
| Make sure to read the docs and understand this well. It has its own vocabulary that can be very counterintuitive. |
|
| ▲ | thdhhghgbhy a day ago | parent | prev | next [-] |
| Is there something this nice for JS, with the decorators like that? |
| |
| ▲ | ngruhn a day ago | parent | next [-] | | No decorators, but fast-check has add-ons to various test frameworks. E.g. if you use Vitest you can write: import { test, fc } from '@fast-check/vitest'
test.prop([fc.array(fc.double())])('sort is correct', (lst) => {
expect(lst).toEqual(lst.toSorted())
})
https://www.npmjs.com/package/@fast-check/vitest?activeTab=r... | | |
| ▲ | chamomeal a day ago | parent [-] | | Fast check is fantastic!! I found it to be pretty verbose but I think that’s just a typescript limitation. It’s VERY well typed, which was a nice surprise. Such a great library. Shrinking, model based testing, it’s really comprehensive |
| |
| ▲ | eru a day ago | parent | prev | next [-] | | The decorators are a nice approach in Python, but they aren't really core to what Hypothesis does, nor what makes it better than eg Haskell's QuickCheck. | |
| ▲ | aidos a day ago | parent | prev | next [-] | | Not decorators (or at least not last time I looked) but we use fast-check. Was already familiar with and using Hypothesis in Python so went in search of something with similar nice ergonomics. Am happy with fast-check in that regard. https://fast-check.dev/ | |
| ▲ | greener_grass a day ago | parent | prev [-] | | Decorators are just function application with more syntax. |
|
|
| ▲ | imiric a day ago | parent | prev | next [-] |
| Coincidentally, I recently stumbled upon a similar library for Go[1]. I haven't used it, or property-based testing, but I can see how it could be useful. [1]: https://github.com/flyingmutant/rapid |
|
| ▲ | toast_x 20 hours ago | parent | prev | next [-] |
| this is pretty sick - i imagine this speeds up hardening a lot. will try it out - thank you! |
|
| ▲ | languagehacker 20 hours ago | parent | prev | next [-] |
| My first impression on property-based testing is that it's likely to introduce flakes that distract engineers from their task at hand, forcing them to investigate a new failed test in an area that they weren't even touching. |
|
| ▲ | klntsky a day ago | parent | prev | next [-] |
| It seems to only implement a half of QuickCheck idea, because there is no counterexample shrinking. Good effort though! I wonder how hard would it be to derive generators for any custom types in python - probably not too hard, because types are just values |
| |
| ▲ | sunshowers a day ago | parent | next [-] | | Shrinking is by far the most important and impressive part of Hypothesis. Compared to how good it is in Hypothesis, it might as well not exist in QuickCheck. Proptest in Rust is mostly there but has many more issues with monadic bind than Hypothesis does (I wrote about this in https://sunshowers.io/posts/monads-through-pbt/). | | |
| ▲ | eru a day ago | parent [-] | | Python's Hypothesis has some very clever features to deal with shrinking past a monadic bind. If I remember right, it basically uses a binary 'tape' of random decisions. Shrinking is expressed as manipulations of that tape. Your generators (implicitly) define a projection from that tape to your desired types. Shrinking an early part of the tape, leave the later sub-generators to try and re-use the later parts of the tape. That's not guaranteed to work. But it doesn't have to work reliably for every shrink operation the library tries! It's sufficient, if you merely have a good-enough-chance to recover enough of the previous structure to trigger the bug again. | | |
| ▲ | ctenb a day ago | parent | next [-] | | > Shrinking is expressed as manipulations of that tape. How do you do that in general? I can't find any documentation on that. | | | |
| ▲ | sunshowers a day ago | parent | prev [-] | | I've always wondered if there could be a small machine learning model trained on shrinking. | | |
| ▲ | eru a day ago | parent [-] | | I'm not sure whether it would be useful, but it would definitely get you a grant (if done academically) or VC money (if done as a company) these days. |
|
|
| |
| ▲ | Jtsummers a day ago | parent | prev | next [-] | | > because there is no counterexample shrinking Hypothesis does shrink the examples, though. | | |
| ▲ | eru a day ago | parent [-] | | And Hypothesis is miles ahead of QuickCheck in how it handles shrinking! Not only does it shrink automatically, it has no problem preserving invariants from generation in your shrinking; like only prime numbers or only strings that begin with a vowel etc. | | |
| ▲ | lgas a day ago | parent [-] | | QuickCheck also shrinks automatically and preserves invariants though? | | |
| ▲ | eru 8 hours ago | parent | next [-] | | Others have pointed out that QuickCheck doesn't shrink automatically. But in addition: QuickCheck's shrinking also doesn't preserve invariants (in general). QuickCheck's shrinking is type based. There's lots of different ways to generate eg integers. Perhaps you want them in a specific range, or only prime numbers or only even numbers etc. To make QuickCheck's shrinker preserve these invariants, you'd have make a typed wrapper for each of them, and explicitly write a new shrinking strategy. It's annoying and complicated. Hypothesis does this automatically. | |
| ▲ | chriswarbo 8 hours ago | parent | prev | next [-] | | QuickCheck won't preserve invariants, since its shrinkers are separate from its generators. For example: data Rat = Rat Int Nat deriving (Eq, Show)
genRat = do
(num, den) <- arbitrary
pure (Rat num (1 + den))
`genRat` is a QuickCheck generator. It cannot do shrinking, because that's a completely separate thing in QuickCheck.We can write a shrinker for `Rat`, but it will have nothing to do with our generator, e.g. shrinkRat (Rat num den) = do
(num', den') <- shrink (num, den)
pure (Rat num' den')
Sure, we can stick these in an `Arbitrary` instance, but they're still independent values. The generation process is essentially state-passing with a random number generator; it has nothing to do with the shrinking process, which is a form of search without backtracking. instance Arbitrary Rat where
arbitrary = genRat
shrink = shrinkRat
In particular, `genRat` satisfies the invariant that values will have non-zero denominator; whereas `shrinkRat` does not satisfy that invariant (since it shrinks the denominator as an ordinary `Nat`, which could give 0). In fact, we can't even think about QuickCheck's generators and shrinkers as different interpretations of the same syntax. For example, here's a shrinker that follows the syntax of `genRat` more closely: shrinkRat2 (Rat n d) = do
(num, den) <- shrink (n, d)
pure (Rat num (1 + den))
This does have the invariant that its output have non-zero denominators; however, it will get stuck in an infinite loop! That's because the incoming `d` will be non-zero, so when `shrink` tries to shrink `(n, d)`, one of the outputs it tries will be `(n, 0)`; that will lead to `Rat n 1`, which will also shrink to `Rat n 1`, and so on.In contrast, in Hypothesis, Hedgehog, falsify, etc. a "generator" is just a parser from numbers to values; and shrinking is applied to those numbers, not to the output of a generator. Not only does this not require separate shrinkers, but it also guarantees that the generator's invariants hold for all of the shrunken values; since those shrunken values have also been outputted by the generator (when it was given smaller inputs). | |
| ▲ | sunshowers 13 hours ago | parent | prev | next [-] | | No, QuickCheck very importantly does not shrink automatically. You have to write the shrinker yourself. Hypothesis, Hedgehog, proptest and a few others shrink automatically. | |
| ▲ | valcron1000 13 hours ago | parent | prev [-] | | Yes, but instances require the user to provide shrinking while Hypothesis does not: shrinking is derived automatically. |
|
|
| |
| ▲ | pfdietz a day ago | parent | prev [-] | | The way it does counterexample shrinking is the most clever part of Hypothesis. | | |
| ▲ | ctenb a day ago | parent [-] | | Do you have a reference where it is explained? It's not part of the docs as far as I can tell | | |
|
|
|
| ▲ | teiferer a day ago | parent | prev [-] |
| This approach has two fundamental problems. 1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert it. Is your function doing a+b? Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does. Any logical error that might be slipped into your SUT implementation has a high risk of also slipping into your test and will therefore be hidden by the complexity, even though it would be obvious from just looking at a few well thought through examples. 2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. You as the developer know the implementation and have a chance of coming up with the edge cases. You know the dark corners of your code. Random tests are just playing the lottery, replacing thinking hard. |
| |
| ▲ | ChadNauseam a day ago | parent | next [-] | | > Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does. Not true. For example, if `f` is `+`, you can assert that f(x,y) == f(y,x). Or that f(x, 0) == x. Or that f(x, f(y, z)) == f(f(x, y), z). Even a test as simple as "don't crash for any input" is actually extremely useful. This is fuzz testing, and it's standard practice for any safety-critical code, e.g. you can bet the JPEG parser on the device you're reading this on has been fuzz tested. > You basically just give up and leave edge case finding to chance. I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed. Doing this for a JPEG parser actually causes it to produce valid images, which you would never expect to happen by chance. See: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th... > Such a fuzzing run would be normally completely pointless: there is essentially no chance that a "hello" could be ever turned into a valid JPEG by a traditional, format-agnostic fuzzer, since the probability that dozens of random tweaks would align just right is astronomically low. > Luckily, afl-fuzz can leverage lightweight assembly-level instrumentation to its advantage - and within a millisecond or so, it notices that although setting the first byte to 0xff does not change the externally observable output, it triggers a slightly different internal code path in the tested app. Equipped with this information, it decides to use that test case as a seed for future fuzzing rounds: | | |
| ▲ | eru a day ago | parent [-] | | > I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed. The author of Hypothesis experimented with this feature once, but people usually want their unit tests to run really quickly, regardless of whether property based or example based. And the AFL style exploration of branch space typically takes quite a lot longer than what people have patience for in a unit test that runs eg on every update to every Pull Request. | | |
| ▲ | tybug a day ago | parent [-] | | (Hypothesis maintainer here) Yup, a standard test suite just doesn't run for long enough for coverage guidance to be worthwhile by default. That said, coverage-guided fuzzing can be a really valuable and effective form of testing (see eg https://hypofuzz.com/). | | |
|
| |
| ▲ | vlovich123 a day ago | parent | prev | next [-] | | I have not met anyone that says you should only fuzz/property test, but claiming it can’t possibly find bugs or is unlikely to is silly. I’ve caught numerous non-obvious problems, including a non-fatal but undesirable off-by-1 error in math heavy code due to property testing. It works well when it’s an “np”-hard style problem where the code is harder than the verification. It does not work well for a+b but most problems it’s generally easier to write assertions that have to hold when executing your function. But if it’s not don’t use it - like all testing, it’s an art to determine when it’s useful and how to write it well. Hypothesis in particular does something neat where it tries to generate random inputs that are more likely to execute novel paths within the code under test. That’s not replicated in Rust but is super helpful about reaching more paths of your code and that’s simply not able to be done manually if you have a lot of non obvious boundary conditions. | | | |
| ▲ | eru a day ago | parent | prev | next [-] | | > 1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert No. That's one valid approach, especially if you have a simpler alternative implementation. But testing against an oracle is far from the only property you can check. For your example: suppose you have implemented an add function for your fancy new data type (perhaps it's a crazy vector/tensor thing, whatever). Here are some properties that you might want to check: a + b == b + a a + (b + c) = (a + b) + c a + (-a) == 0 For all a and b and c, and assuming that these properties are actually supposed to hold in your domain, and that you have an additive inverse (-). Eg many of them don't hold for floating point numbers in general, so it's good to note that down explicitly. Depending on your domain (eg https://en.wikipedia.org/wiki/Tropical_semiring), you might also have idempotence in your operation, so a + b + b = a + b is also a good one to check, where it applies. You can also have an alternative implementation that only works for some classes of cases. Or sometimes it's easier to prepare a challenge than to find it, eg you can randomly move around in a graph quite easily, and you can check that your A* algorithm you are working on finds a route that's at most as long as the number of random steps you took. > 2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. [...] You'd be surprised how often the generic heuristics for edge cases actually work and how often manual test writers forget that zero is also a number, and how often the lottery does a lot of the rest. Having said this: Python's Hypothesis is a lot better at its heuristics for these edge cases than eg Haskell's QuickCheck. | | |
| ▲ | teiferer a day ago | parent [-] | | > a + b == b + a > a + (b + c) = (a + b) + c > a + (-a) == 0 Great! Now I have a stupid bug that always returns 0, so these all pass, and since I didn't think about this case (otherwise I'd not have written that stupid bug in the first place), I didn't add a property about a + b only being 0 if a == -b and boom, test is happy, and there is nothing that the framework can do about it. Coming up with those properties is hard for real life code and my main gripe with formal methods based approaches too, like model checking or deductice proofs. They move the bugs from the (complicated) code to the list of properties, which ends up just as complicated and error prone, and is entirely un...tested. Contrast that with an explicit dead-simple test. Test code doesn't have tests. It needs to be orders of magnitudes simpler than the system it's testing. Its correctness must be obvious. Yes, it is really hard to write a good test. So hard that it should steer how you architect your system under test and how you write code. How can I test this to have confidence that it's correct? That must be a guiding principle from the first line of code. Just doing this as an afterthought by playing lottery and trying to come up with smart properties after the fact is not going to get you the best outcome. | | |
| ▲ | Jtsummers 20 hours ago | parent | next [-] | | For the example it would be easily caught with PBT by testing the left and right identities for addition. In general though, the advice is don't excessively dogmatic. If you can't devise a property to test, use traditional testing with examples. It's not that hard. Same way to deal with the problem of end-to-end tests or unit tests not being useful for certain things, use the appropriate test style and approach to your problem. | |
| ▲ | chriswarbo 8 hours ago | parent | prev | next [-] | | > Just doing this as an afterthought by playing lottery and trying to come up with smart properties after the fact is not going to get you the best outcome. This sounds backwards to me. How could you write any tests, or indeed implement any functions, if you don't know any relationships between the arguments/return-value, or the state before/after, or how it relates to other functions, etc.? For the addition example, let's say the implementation includes a line like `if (x == 0) return y`; would you seriously suggest that somebody writing that line doesn't know a property like `0 + a == a`? Would that only be "an afterthought" when "trying to come up with smart properties after the fact"? On the contrary, I would say that property came first, and steered how the code was written. Incidentally, that property would also catch your "always returns 0" counterexample. I also don't buy your distinction that "real life code" makes things much harder. For example, here's another simple property: delete(key); assert lookup(key) == []
This is pretty similar to the `a + (-a) == 0` example, but it applies to basically any database; from in-memory assoc-lists and HashMaps, all the way to high-performance, massively-engineered CloudScale™ distributed systems. Again, I struggle to imagine anybody implementing a `delete` function who doesn't know this property; indeed, I would say that's what deletion means. It's backwards to characterise such things as "com[ing] up with smart properties after the fact". | | |
| ▲ | eru 5 hours ago | parent [-] | | I agree with you. However the grand-parent comment has a point that it's not easy to extract testable properties from code that's already written. It's much easier to proceed explicitly like you suggest: have some properties in mind, co-develop property tests and code. Often when I try to solve a problem, I start with a few simple properties and let them guide my way to the solution. Like your deletion example. Or, I already know that the order of inputs shouldn't matter, or that adding more constraints to an optimiser shouldn't increase the maximum, etc. And I can write down some of these properties before I have any clue about how to solve the problem in question. --- However, writing properties even after the fact is a learnable skill. You just need to practice, similarly to any other skill. |
| |
| ▲ | eru 9 hours ago | parent | prev [-] | | > Great! Now I have a stupid bug that always returns 0, so these all pass, and since I didn't think about this case (otherwise I'd not have written that stupid bug in the first place), I didn't add a property about a + b only being 0 if a == -b and boom, test is happy, and there is nothing that the framework can do about it. Well, just add one example based test for eg 1 + 2 = 3, and you would catch this bug. You can freely mix and match property based tests with example based tests. > Coming up with those properties is hard for real life code and my main gripe with formal methods based approaches too, like model checking or deductice proofs. Yes, property based testing is an acquired skill that has to be learned. It's not so bad, once you've done it a few times, though. It also changes how you design code in the first place. Just like even example-based testing changes how you design your code. > They move the bugs from the (complicated) code to the list of properties, which ends up just as complicated and error prone, and is entirely un...tested. You might be doing this wrong, if your list of properties is as complicated as the code. And it's not 'untested': you are exercising your properties against your code. For a conceptually really simple example: suppose you have two different ways to solve a certain problem, and suppose neither is really simpler than the other. You still gain something from comparing the output of both approaches on random input: unless you made the mistake in both approaches, many bugs will show up as discrepancies that you can investigate. Hypothesis even includes 'shrinking' to helpfully provide you with a minimal counterexample that shows the divergence. From a formal point of view, you can say that this property is 'untested'; but in practice you still gain something by comparing the two implementations. |
|
| |
| ▲ | 12_throw_away a day ago | parent | prev | next [-] | | > Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b Not really, no, it's right there in the name: you should be testing properties (you can call them "invariants" if you want to sound fancy). In the example of testing an addition operator, you could test: 1. f(x,y) >= max(x,y) if x and y are non-negative 2. f(x,y) is even iff x and y have the same parity 3. f(x, y) = 0 iff x=-y etc. etc. The great thing is that these tests are very easy and fast to write, precisely because you don't have to re-model the entire domain. (Although it's also a great tool if you have 2 implementations, or are trying to match a reference implementation) | |
| ▲ | robertfw a day ago | parent | prev | next [-] | | I feel like this talk by John Hughes showed that there is real value in this approach with production systems of varying levels of complexity, with two different examples of using the approach to find very low level bugs that you'd never think to test for in traditional approaches. https://www.youtube.com/watch?v=zi0rHwfiX1Q | |
| ▲ | locknitpicker a day ago | parent | prev | next [-] | | > (...) but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does. I think you're manifesting some misconceptions and ignorance about property-based testing. Property-based testing is still automated testing. You still have a sut and you still exercise it to verify and validate invariants. This does not change. The core trait of property-based testing is that instead of having to define and maintain hard coded test data to drive your tests, which are specific realizations of the input state, property-based testing instead focuses on generating sequences of randomly-generated input data, and in the event of a test failing it follows up with employing reduction strategies to distil input values that pinpoint minimum reproducible examples. As a consequence, tests don't focus on which specific value a sut returns when given a specific input value. Instead, they focus on verifying more general properties of a sut. Perhaps the main advantage of property-based testing is that developers don't need to maintain test data anymore, and this tests are no longer be green just because you forgot to update the test data to cover a scenario or to reflect an edge case. Developers instead define test data generators, and the property-based testing framework implements the hard parts such as the input distillation step. Property-based testing is no silver bullet though. > Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. Your comment completely misses the point of property-based testing. You still need to exercise your sut to cover scenarios. Where property-based testing excels is that you no longer have to maintain curated sets of test data, or update them whenever you update a component. Your inputs are already randomly generated following the strategy you specified. | |
| ▲ | dragonwriter 16 hours ago | parent | prev [-] | | Your comment is downvoted currently, but I think it has value in the discussion (despite being wrong in literally every respect) because it shows the immense misleading power of a single extraordinarily poorly chosen headline example on the project page. Testing a sorting function against the results of the sorted builtin is concise, and technically correct, but (even though there are situations where it would be exactly the right thing to do) it is completely misleading to anyone new to the concept when it comes to indicating what property-based testing is all about. > It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert it. It does not, and it would be next to worthless if it did. It requires being able to define the properties required of the SUT and right implement code that can refute them if they are not present (the name "hypothesis" for this library is, in fact, a reference to that; PBT treats the properties of code as a hypothesis, and attempts to refute it.) > but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does. No, see my reimplementation of the sorting example without resorting to the builtin (or any) sorting function other than the one being tested: https://news.ycombinator.com/item?id=45825482 > Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You may think this based on zero experience, but I have seen no one who has tried hypothesis even once who has had that experience. Its actually very good at finding edge cases. > You as the developer know the implementation and have a chance of coming up with the edge cases. You as the developer have a very good chance of finding the same edge cases when writing tests for your own code that you considered when writing the code. You have much less chance of finding edge case when writing tests that you missed when writing code. You can incorporate the knowledge of probable edge cases you have when crafting Hypothesis tests just as with more traditional unit tests—but with traditional unit tests you have zero chance of finding the edge cases you didn't think of. Hypothesis is, actually, quite good at that. > Random tests are just playing the lottery, replacing thinking hard. Property-based testing doesn’t replace the thinking that goes into traditional unit testing, it just acts as a force multiplier for it. | | |
| ▲ | tybug 7 hours ago | parent [-] | | (Hypothesis maintainer here) If you have recommendations for a better example on the front page, I'd love to hear them! (I mean this entirely genuinely and non-sarcastically; I agree sorting can give misleading ideas, but it is also concise and well understood by every reader). | | |
| ▲ | dragonwriter 4 hours ago | parent [-] | | The more I think about it the more I think calling it a bad example may be unfair. It can be extremely misleading for someone unfamiliar with the concept coming at it with a particular viewpoint, but I’m less sure, with more time to think, that an example that is better for that wouldn’t be worse in other ways. I like sorting as an example, and I like that using the built-in is concise, and reimplementing the behavior of an existing function where using the existing function as an oracle is a reasonable thing to do for a test isn't all that uncommon. I feel like something with a couple of properties described in comments with assertions testing those properties (but where the functionality and properties are familiar enough that it would make a clear connection) would be a bit better, in theory, but I don't have a great particular example to use, and anything done that way will be, at best, somewhat less concise. |
|
|
|