Remix.run Logo
benrutter a day ago

I love the idea of hypothesis! Haven't found a lot of use cases for it yet, I think the quick start example helps explain why.

Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

Obviously it's just a tiny example, but in general I often find writing a test that will cover every hypothetical case isn't realistic.

Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?

masklinn a day ago | parent | next [-]

What you're talking about is using an oracle (a different implementation of what you're testing), it's an option for property (or exhaustive) testing but is by no means a requirement. Even for a sort function there are plenty of properties you can check without needing an oracle e.g.

- that the output sequence is the same length as the input

- that the output sequence is sorted

- that the population counts are the same in the input and the output

> In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate.

eru 10 hours ago | parent [-]

> Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate.

The opposite might also be true. Suppose you already have a specialised implementation, and you want to write a new generalised one. You can still test them against each other.

Eg suppose you are writing a library that supports sorting crazy huge datasets that don't fit into memory. You can still check that it gives the same answers as the built-in sorting algorithm from the standard library on input that's small enough to fit into memory.

Doxin 3 hours ago | parent | prev | next [-]

I've used hypothesis in practice for testing a slugify function. When someone signs up for our product our software produces a subdomain based on the organisation name you input. People put all sorts of weird things in that field, and we want to be pretty broad in what we'll accept.

The hypothesis test for this is near trivial. Pass some text to the slugify function, and check it either a) throws an invalid input error (e.g. input too short etc) or b) return a string that's a valid domain name.

Doing this found me so many odd little edge cases around length truncation and idna encoding. The one bug I would've never found myself is the fact that lowercasing "ß" turns it into "ss" which broke truncation, resulting in an invalid domain only if it includes ß and is exactly 64 characters long.

hypothesis is a pretty niche tool, but boy when it's the right tool it's the right tool.

arnsholt a day ago | parent | prev | next [-]

I've only used PBT a few times, but when it fits it's been extremely useful. A concrete example from my practice of what has been pointed out in this thread, that you want to properties of your function's output rather than the output itself: I was implementing a fairly involved algorithm (the Zhang-Shasha edit distance algorithm for ordered trees), and PBT was extremely useful in weeding out bugs. What I did was writing a function that generated random tree structures in the form I needed for my code, and tested the four properties that all distance functions should have:

1. d(x, x) = 0 for all x 2. d(x, y) >= 0 for all x, y 3. d(x, y) = d(y, x) for all x, y 4. d(x, z) <= d(x, y) + d(y, z) for all x, y, z (the triangle inequality)

Especially the triangle inequality check weeded out some tricky corner cases I probably wouldn't have figured out on my own. Some will object that you're not guaranteed to find bugs with this kind of random-generation strategy, but if you blast through a few thousand cases every time you run your test suite, and the odd overnight run testing a few million, you quickly get fairly confident that the properties you test actually hold. Of course any counterexamples the PBT finds should get lifted to regression tests in addition, to make sure they're always caught if they crop up again. And as with any testing approach, there are no guarantees, but it adds a nice layer of defense in depth IMO.

aflukasz a day ago | parent | prev | next [-]

> Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?

One example would be when you have a naive implementation of some algorithm and you want to introduce a second one, optimized but with much more complex implementation. Then this naive one will act as a model for comparisons.

Another case that comes to mind is when you have rather simple properties to test (like: does it finish without a crash, within a given time?, does not cross some boundaries on the output?), and want to easily run over a sensible set of varying inputs.

sevensor a day ago | parent | prev | next [-]

I use hypothesis to test that the happy path is really happy. I use it to define the set of expected inputs for a function, and it looks for the nastiest possible sets of inputs. It’s kind of the opposite of how I write ordinary tests, where I pick challenging but representative examples. Often I’m just checking some basic property, like “function f is the inverse of function g” for round tripping data through a different representation. Sometimes it’s just “function f does not divide by zero.”. Hypothesis seems to have a library of corner cases that it checks first. If you don’t say you can’t handle floats above 2^53, it will try 2^53, for instance. (That’s the threshold where doubles stop being able to represent every integer.) Often the result of running hypothesis is not a code change but a more precise description of what inputs are valid.

dragonwriter 19 hours ago | parent | prev | next [-]

> Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

You don't need that, that’s just a convenient shortcut when you do have a function that does that (but, I would argue, a not-great example for precisely that reason.)

All you actually need is the ability to verify the property (or set of properties) that you are testing in the output. You could replace the test in that sample with something like this, which I think is a better illustration of what is going on (this is somewhat simplified, rather than the length test you really want to test that it has the same set:

  @given(st.lists(st.integers() | st.floats()))
  def test_sort_correct(lst):
    result = my_sort(lst)
    
    # result has the same unique items as original list
    assert set(result) | set(lst) 

    # for each item, result has the same number of copies as original
    assert all(
      result.count(item) == lst.count(item)
      for item in set(result)
    ) 

    # result is sorted
    assert all(result[n] <= result[n+1] for n in range(len(result)-1))
thekoma a day ago | parent | prev | next [-]

Indeed. For a sorting algorithm it would be more insightful to show it test the actual property of something that is sorted: every consecutive element is larger than the previous one (or the other way around). You don’t need a sorting function to test the “sorted” property.

wongarsu a day ago | parent | prev | next [-]

Imho it's just not a great example. A better example would be testing the invariants of the function. "For any list of numbers, my_sort does not throw an exception" is trivial to test. "For any list of numbers, in the list returned by my_sort the item at index n is smaller than the item at index n+1" would be another test. Those two probably capture the full specification of my_sort, and you don't need a sort function to test either. In a real-world example you would more likely be testing just some subset of the specification, those things that are easy to assert

lgas a day ago | parent [-]

Always returning the empty list meets your spec.

wongarsu a day ago | parent [-]

Good point. I suppose we should add "number of input elements equals number of output elements" and "every input element is present in the output". Translated in a straightforward test that still allows my_sort([1,1,2]) to return [1,2,2], but we have to draw the line somewhere

travisjungroth 18 hours ago | parent | next [-]

Just use Counter and if the objects aren’t hashable, use the count of IDs. Grab this before calling the function, in case the function is destructive. Check it against the output.

Add in checking each item is less than or equal to its successor and you have the fundamental sort properties. You might have more, like stability.

masklinn 20 hours ago | parent | prev [-]

> we have to draw the line somewhere

Do we? You can pop-count the two lists and checks that those are equal.

chriswarbo 11 hours ago | parent | prev | next [-]

Here's a property check I wrote yesterday, which found a couple of bugs in a large, decade-old codebase.

I'd just changed a data structure with three components, and made sure the test suite was still passing. I happened to notice a function in the same file, for parsing strings into that data structure, which had a docstring saying that it ignores whitespace at the start/end of the string, and in-between the components. It had tests for the happy-path, like "foo123x" -> ("foo", 123, 'x'), as well as checking that optional components could be left out, and it even checked some failure cases. Yet none of the tests used any whitespace.

I thought it would be good to test that, given that somebody had gone to the effort of documenting it. Yet I didn't want to write a bunch of combination like " foo123x", "foo 123x", "foo123 x", " foo 123x", "foo 123 x", and so on. Instead, I wrote a property which adds some amount of whitespace (possibly none) to each of those places, and assert that it gets the same result as with no whitespace (regardless of whether it's a successful parse or not). I wasn't using Python, but it was something like this:

    def whitespace_is_ignored(b1: bool, b2: bool, b3: bool, s1: int, s2: int, s3: int, s4: int):
      v1 = "foo" if b1 else ""
      v2 = "123" if b2 else ""
      v3 = "x" if b3 else ""

      spaces = lambda n: " " * n
      spaced = "".join([spaces(s1), v1, spaces(s2), v2, spaces(s3), v3, spaces(s4)])
      assert parser(v1 + v2 + v3) == parser(spaced)
The property-checker immediately found that "foo123x " (with two spaces at the end) will fail to parse. When I fixed that, it found that spaces after the first component will end up in the result, like "foo 123x" -> ("foo ", 123, 'x').

Of course, we could make this property more general (e.g. by taking the components as inputs, instead of hard-coding those particular values); but this was really quick to write, and managed to find multiple issues!

If I had written a bunch of explicit examples instead, then it's pretty likely I would have found the "foo 123x" issue; but I don't think I would have bothered writing combinations with multiple consecutive spaces, and hence would not have found the "foo123x " issue.

FuckButtons 18 hours ago | parent | prev | next [-]

Mostly I found making composite strategies most helpful. I had a lot of fiddly details to get right when validating some dsp algorithms and the intermediate data structures they required, and for that it was very helpful, mostly I used property testing with some kind of composite strategy for generating valid inputs.

IanCal a day ago | parent | prev | next [-]

I definitely had the same issue when I started, I think you've hit on the two main mental blocking points many people hit.

> this means you need a second function that acts the same as the function you wrote.

Actually no, you need to check that outcomes match certain properties. The example there works for a reimplementation (for all inputs, the output of the existing function and new one must be identical) but you can break down a sorting function into other properties.

Here are some for a sorting function that sorts list X into list Y.

Lists X and Y should be the same length.

All of the elements in list X should be in list Y.

Y[n] <= Y[n+m] for n+m<len(X)

If you support a reverse parameter, iterating backwards over the sorted list should match the reverse sorted list.

Sorting functions are a good simple case but we don't write them a lot and so it can be hard to take that idea and implement it elsewhere.

Here's another example:

In the app focus ELEMENT A. Press tab X times. Press shift-tab X times. Focus should be be on ELEMENT A.

In an app with two or more elements, focus ELEMENT A. Press tab. The focus should have changed.

In an app with N elements. Focus ELEMENT A. Press tab N times. The focus should be on ELEMENT A.

Those are a simple properties that should be true, are easy to test and maybe you'd have some specific set of tests for certain cases and numbers but there's a bigger chance you then miss that focus gets trapped in some part of your UI.

I had a library for making UIs for TVs that could be shaped in any way (so a carousel with 2 vertical elements then one larger one, then two again for example). I had a test that for a UI, if you press a direction and the focus changes, then pressing the other direction you should go back to where you were. Extending that, I had that same test run but combined with "for any sequence of UI construction API calls...". This single test found a lot of bugs in edge cases. But it was a very simple statement about using the navigation.

I had a similar test - no matter what API calls are made to change the UI, either there are no elements on screen, or one and only one has focus. This was defined in the spec, but we also had defined in the spec that if there was focus, removing everything and then adding in items meant nothing should have focus. These parts of the spec were independently tested with unit tests, and nobody spotted the contradiction until we added a single test that checked the first part automatically.

This I find exceptionally powerful when you have APIs. For any way people might use them, some basic things hold true.

I've had cases where we were identifying parts of text and things like "given a random string, adding ' thethingtofind ' results in getting at least a match for ' thethingtofind ', with start and end points, and when we extract that part of the string we get ' thethingtofind '". Sounds basic, right? Almost pointless to add when you've seen the output work just fine and you have explicit test cases for this? It failed. At one step we lowercased things for normalisation and we took the positions from this lowercased string. If there was a Turkish I in the string, when lowercased it became two characters:

    >>> len("İ") == len("İ".lower())
    False
So anything where this appeared before the match string would throw off the positions.

Have a look at your tests, I imagine you have some tests like

"Doing X results in Y" and "Doing X twice results in Y happening once". This is two tests expressing "Doing X n times where n>=1, Y happens". This is much more explicitly describing the actual property you want to test as well.

You probably have some tests where you parameterise a set of inputs but actually check the same output - those could be randomly generated inputs testing a wider variety of cases.

I'd also suggest that if you have a library or application and you cannot give a general statement that always holds true, it's probably quite hard to use.

Worst case, point hypothesis at a HTTP API, have it auto generate some tests that say "no matter what I send to this it doesn't return a 500".

In summary (and no this isn't LLM generated):

1. Don't test every case.

2. Don't check the exact output, check something about the output.

3. Ask where you have general things that are true. How would you explain something to a user without a very specific setup (you can't do X more than once, you always get back Y, it's never more than you put in Z...)

4. See if you have numbers in tests where the specific number isn't relevant.

5. See if you have inputs in your tests where the input is a developer putting random things in (sets of short passwords, whatever)

pfdietz a day ago | parent | prev | next [-]

I don't do the testing using Hypothesis, but I do a great deal of property based testing of Common Lisp implementations, where it has been invaluable.

One large testing loop is testing CL's compile function. This involves creating random valid lambda expressions (unnamed functions), compiling them, and running them on random input. The properties tested are things like "the compiler doesn't crash, the generated code doesn't crash, different versions of the code with different valid declarations compute the same thing".

More specific properties are tested for specific CL functions. For example, CL has a function subtypep which, given two types, says either the first value is or is not a subtype of the second, or that it couldn't figure it out. One can test this on randomly generated types, evaluating properties like (subtypep T1 T2) should be consistent with (subtypep `(not ,T2) `(not ,T1)).

The point of PBT isn't that you write tests to cover every possible case, it's that PBT will cook up tests for all sorts of cases you will never come up with yourself. I have repeatedly found it kicking out the whackiest test inputs that happen to trigger bugs from strange interactions.

rowanG077 a day ago | parent | prev | next [-]

On more complex logic you can often reason about many invariants. I don't think using sorting and then comparing to the existing sort is a good example. Precisely for the reason you mention.

A sort itself is a good example though. You can do much better then just compare to an existing sort implementation. Because you can easily check if an array is sorted with a linear scan. If your sort is stable you can do an additional check that for all pairs of adjacent elements that compare as equal the order of them is the same as in the input array.

a day ago | parent | prev [-]
[deleted]