> Make illegal states unrepresentable

This is a nice ideal to shoot for, but strict adherence as advocated in the article is a short path to algorithmic explosions and unusable interfaces on real life systems.

For example, if you have two options that are mutually incompatible, this principle says you don't make them booleans, but instead a strict enum type populated with only legal combinations of the options. A great idea until you have 20 options to account for and your enum is now 2^16 entries long. Then your company opens a branch in a different country with a different regulatory framework and the options list grows to 50 and you code no longer fits on a hard drive.

▲ MeetingsBrowser 8 months ago | parent | next [-]

The mistake here is having 2^16 valid options.

If you truly do have 2^16 valid and distinct behaviors, it is not possible for humans to correctly write 2^16 different code paths anyway.

More than likely, the majority of the combinations of your 29 Boolean flags are invalid. You should strive to have that checked by the program, and not only as a note in external documentation.

No one is saying int should be turned into an enum.

▲

jandrese 8 months ago | parent | next [-]

You only have 20 options, but making that many distinctive options is not exactly a stretch. It's not like every single set of options is its own code path, most options represents a small deviation at one particular part of the code. Most of the options aren't mutually exclusive either, only a few combinations are illegal.

Imagine a simple shipping system for example. The package might be routed via truck, boat, plane, pipeline, foot, etc... Probably even a combination of those options. The package might be low, medium, or high priority, although high priority packages are not allowed to be delivered by boat. The package might be small, large, or liquid, but liquids can't be delivered by foot. There are 195 different countries with even more regulatory regimes to consider, some of which may have different customs requirements based on mode of transport. Now imagine a problem that is actually complicated.

The idea of encoding all of this business logic into the datatype is a road to madness IMHO. Especially if the rules can change on you and require you to rework your datatypes to match. On the small scale this makes a lot of sense and is good advice, but strict adherence is impractical.

	▲	MeetingsBrowser 8 months ago \| parent [-]
		You don't need a single type to represent the entire program state. We probably both agree that separate types for shipping methods, priorities, size, country makes sense. The API can be designed to prevent illegal transitions between types to arrive at an invalid state. The exact implementation depends on the language. > The idea of encoding all of this business logic into the datatype is a road to madness IMHO The alternative is hoping that every developer remembers not to violate any of the implicit and unenforced rules. If the system is too complicated to be represented in code, can a human realistically hold the entire state machine in their head as they make changes?

▲

joe_the_user 8 months ago | parent | prev [-]

The parent is discussing a situation where you have 16 distinct, binary states many but not all of which are mutually compatible. So you can have 16 bit vector of the states, 16 binary variables or an enum of valid states - the enum would have O(2^16) members because of the distribution of valid states but unlike the others, an invalid state would not be possible to represent.

▲ yazzku 8 months ago | parent | prev | next [-]

If you only have 3 states, then yes, that should be an enum, not a pair of booleans, because you have 3 states, not 2x2 independent ones. Making the 4th state unrepresentable removes code and error checking. It's also just simple domain modeling.

Your latter example needs context. In what situation have you had an enum with 2^16 states? In any case, if you generally have a reasonable number of booleans, with some states being logically invalid, then you'll need error checking on those anyway.

Leaving them as booleans gives you the ability to monkey-patch things later in an ad-hoc manner, which is useful when you're still in a prototyping phase and the requirements aren't clear (and you could argue that this is almost always the case in many problem domains; I think that would be valid criticism.) But if you've already modeled the domain and you want any assurance of correctness, then I don't see why you wouldn't enforce the constraints at the type level. Your criticism seems to me the usual trade-off between strong static typing and dynamic and/or weak monkey-typing.

▲ kccqzy 8 months ago | parent | prev | next [-]

Nobody says you have to have a single enum type containing all the combinations. Chances are, you can use sum types (discriminated unions) to factor things nicely if you think about them. For example if option B is only relevant when option A is set to true, you can have something like

    data OptA = ATrue OptB | AFalse
    data OptB = BTrue | BFalse

There are three valid combinations but no type has three alternatives. Nobody in their right mind would write out an enum with 2^16 cases. If you do, you are misusing enums and it's time to consider other tools in your language.

▲

joe_the_user 8 months ago | parent | next [-]

Nobody says you have to have a single enum type containing all the combinations.

No, no one would continue up to 2^16 and the code would get unmanageable long before that. But it's illustration of the problems starting out dealing with the invalid states of two variables using an enum because what happens when more and more variables arrive? Sure, the standard answer is "just refactor" but my experience is no client or boss wants to hear "adding this small state is going require a lot of change" and a trickle of binary conditions is a very common occurrence as is code expanding to handle these (and become excessively complicated).

Chances are, you can use sum types (discriminated unions) to factor things nicely if you think about them.

Maybe you have a good chance of combining these binary conditions in a good way. But I mention you've substituted a hard problem instance (factoring binary conditions) for an easy problem instance (checking binary conditions). Functional programming has a weird dual personality where on the one hand you hear "A functional programmer is always a smarty and solve hard problems as a matter of course" but also you hear "functional programming would be the dominant paradigm if only ... we taught people young so they wouldn't have their bad procedural habits"

▲

BoiledCabbage 8 months ago | parent [-]

> and a trickle of binary conditions is a very common occurrence as is code expanding to handle these (and become excessively complicated).

But you still have to handle this in your code. Wherever you have your conditions that handle this, your nest of if statements still need to cover all of these invalid combinations and ensure your app doesn't silently do the wrong thing or just crash (better).

Changing requirements requires changing code. I don't think it's a valid argument to say "we shouldn't take that approach because as requirements change we'll have to change the code". That's essentially software development.

Practically if you don't want to use enums and want another option, use a "builder" object. Pass in all of your booleans there and have it do you validation when you call its build method. It returns a read only configuration that the rest of your system can use, and the build method fails if an invalid combination of flags are passed in.

Again you force only valid combinations to exist after you call "build". And all code relies on the config produced by the build method.

▲

joe_the_user 8 months ago | parent [-]

You're misunderstanding me. Of course changing requirements mean changing code. The distinction is between a situation of "a small change in a requirement means a small change in code" and "a small change in a requirement means a BIG change in code". The make "make an enum of all the legal cases" approaches produces a situation where adding more binary conditions and requirements around them resulting in each change resulting a larger increase in code. And this in turn can result in a "we have to refactor this if we get one more change" and that's even more frustrating to those dictating requirements (and not uncommon in enterprise software).

Practically if you don't want to use enums and want another option, use a "builder" object. Pass in all of your booleans there and have it do you validation when you call its build method.

Cases are fine in some circumstances, enums of legal cases are fine in other circumstances and these builder might even be useful in a few bizarre cases. The main argument I'd have is that when one passes from "this approach can be useful, let's look at the situation" to "anything without this is bad", you often wind-up with a complete misallocation of resources and a fragile inability to make changes, as can be seen in today's legacy code (which was often yesterday's "best coding practices code").

	▲	BoiledCabbage 8 months ago \| parent [-]
		I still feel like there is an important point you're walking past. If there are 2^16 different combinations that are relevant, then you still need to handle these 2^16 combinations in your code. If you're ready a configuration file from a user and only different subsets of them are valid, somewhere in your code you still need all of the complexity to let the user know they have passed in an invalid combination. And all of that logic is equally or more complex than an enum. If all of those cases can be handled by a few simple "if" in your code, then you'll have only a few valid options in your enum. If you have a ton of valid options you need to list in your enum, then you'll have a tone of cases you need to handle in your code. Your underlying complaint to me sounds like you've received a lot of complex cases via your requirements. But either way the complexity is there in your code regardless of whether you validate it upfront in an enum, or deep in our code base.

▲

jandrese 8 months ago | parent | prev [-]

Imagine a case where you have 4 options. W, X, Y, Z.

Y and Z are mutually exclusive.

X can only be set if W is set.

If Y is set then X must be set.

Going down this road you end up encoding your business logic into your datatypes. Which is good to a degree, but makes things messy when new options are added or requirements change. Imagine a new option U is introduced that is only valid when W is unset and Z is set but allows X to be set. Your interface becomes very hard to manage with even a small amount of complexity added to the options.

	▲	kccqzy 8 months ago \| parent \| next [-]
		This is an instance of inherent complexity. Your domain is complex. You either place them into a series of nested if statements (which is what majority of programmers do), or you place it into the type system. You cannot avoid complexity either way. We are merely arguing where this complexity belongs. Such complexity is hard to manage in either case.
	▲	aiono 8 months ago \| parent \| prev [-]
		What is the alternative? And is it really better than encoding into the data type? Only option I can think of is writing unit tests for each case, which doesn't seem like too much different. And without type encoding you never sure that invariant always hold. You can always manipulate any options in any part of the program. Then after any modification you have to perform a "sanity check" if options are in a well defined state or not. I don't see how this is better than encoding the invariant into the types.

▲ galaxyLogic 8 months ago | parent | prev | next [-]

True. And what is an "illegal state" anyway? If two options are "mutually incompatible" it means that they, as far as we know, so far, have never occured at the same time.

But that is just how things are currently. The world may change. And it's important to strive for maintainable code that can accommodate such change easily.

Expanding company operations to work in a different country is an example of this (our) "world changing". States that never occur together here, may occur together there. Or in the future more generally.

So, making illegal states non-representable is about avoiding errors in respect to the system specification. But it does not take into account the fact that in the real world specifications typically evolve and change even during development, and perhaps more so later.

▲

thfuran 8 months ago | parent [-]

So, what, you’d advocate declaring all function parameters and returns as void* to maximize flexibility with respect to future changes in requirements?

	▲	galaxyLogic 8 months ago \| parent [-]
		No of course not. Just that we shouldn't forget that "specs" can change and typically do. Keep an open mind.. Never say "never". Illegal states can become legal.

▲ aiono 8 months ago | parent | prev | next [-]

In this example, is it really the case that all 20 are dependent into each other so you have to combine all of them into a single enum? For the ones that are independent you can keep them as booleans and you should.

▲ keybored 8 months ago | parent | prev | next [-]

Where does this exponentional size requirement come from?

▲ jandrese 8 months ago | parent | next [-]

Imagine you have a program where there are 4 options, W, X, Y, Z. Y and Z can't be set at the same time, and X can't be set unless W is set. If Y is set then X must be set as well.

How do you represent this in a way that makes it impossible, even through programmer error elsewhere in the program, to have the flags in an invalid state?

You can create en enum that looks like:

    enum program_state =
    (
      W_X_Y_NZ,
      W_NX_NY_Z,
      NW_NX_NY_Z,
      ... and so on
     );

▲ Hackbraten 8 months ago | parent | next [-]

Maybe I'm spoiled by TypeScript but this is how I'd do it in a structural typing system:

    type ConstrainedByW =
    | { w: false, x: false }
    | { w: true, x: bool }
    
    type ConstrainedByY =
    | { x: bool, y: false, z: bool }
    | { x: true, y: true, z: false }
    
    type ProgramState = ConstrainedByW & ConstrainedByY

▲

jandrese 8 months ago | parent [-]

Yes, but certainly you can see how even a toy example with just 4 booleans is already getting complicated?

	▲	Hackbraten 8 months ago \| parent [-]
		It's certainly complicated. But not WTF-level complicated. Complexity is both subjective and relative. Allowing invalid states in your model is going to complicate things, too. I wouldn't mind having this example in production. It will scale just fine with the number of booleans.

▲ aiono 8 months ago | parent | prev | next [-]

  enum program_state {
    X_W,
    Y_X_W,
    Z,
    W,
    Z_W,
    Z_X_W,
  }

You only have these 6 options. And in this case you can easily use a SAT solver to generate the cases for you so that you don't have to write them by hand.

▲ keybored 8 months ago | parent | prev [-]

And the Make Impossible States Unrepresentable crowd program like that?

▲ WorldMaker 8 months ago | parent [-]

It is not too bad in languages with discriminated unions. It's also not hard to fake discriminated unions in languages without them, even if you will miss some of the niceties.

Rather than thinking of it as an enum, think of it as a list of contructors:

    class ProgramState {
        bool w, x, y, z;

        ProgramState(x, z) // implies y = true, w = true
        ProgramState(w, z) // cannot set x; implies y = false (cannot set y)
    }

Even if the class needs all four fields, internally to represent all the possible combinations of data, there's no constructors/setters to work with them independently. (Which is also related to why "Make Impossible States Unrepresentable" also generally implies immutable, no setters at all makes it much easier to make states impossible.)

In languages with discriminated unions you might even have some natural field sharing depending on how your "constructors" are written and the memory expansion isn't necessarily "exponential".

▲ crdrost 8 months ago | parent [-]

Also to mention it, languages without discriminated unions often have generics and function types, which can be used to build discriminated unions with Church encodings:

    // idiomatic typescript
    type Optional<IfAbsent, IfPresent> = 
     | {type: 'absent', detail: IfAbsent}
     | {type: 'present', value: IfPresent}
    
    // Church-encoded version
    type Either<x, y> = <z>(ifLeft: (x: x) => z, ifRight: (y: y) => z) => z
    
    // isomorphism between the two
    function church<x, y>(opt: Optional<x, y>): Either<x, y> {
      return (ifLeft, ifRight) => opt.type === 'absent'? ifLeft(opt.detail) : ifRight(opt.value)
    }
    function unchurch<x, y>(opt: Either<x, y>): Optional<x, y> {
      return opt<Optional<x,y>>(x => ({type: 'absent', detail: x}), y => ({type: 'present', value: y}))
    }

In addition the Church encoding of a sum type, is a function that takes N handler functions and calls the appropriate one for the case that the data type is in. With a little squinting, this is the Visitor pattern.

    interface LeftRightVisitor<X, Y, Z> {
      visit(x: Left<X>): Z
      visit(y: Right<Y>): Z
    }
    interface LeftRight<X, Y> {
      accept<Z>(visitor: LeftRightVisitor<X, Y, Z>): Z;
    }
    class Left<X> implements LeftRight<X, any> {
      constructor(public readonly x: X) {}
      accept<Z>(visitor: LeftRightVisitor<X, any, Z>) {
        return visitor.visit(this)
      }
    }
    class Right<Y> implements LeftRight<any, Y> {
      constructor(public readonly y: Y) {}
      accept<Z>(visitor: LeftRightVisitor<any, Y, Z>) {
        return visitor.visit(this)
      }
    }
    // isomorphism
    function visitify<X, Y>(opt: Optional<X, Y>): LeftRight<X, Y> {
      return opt.type === 'absent' ? new Left(opt.detail) : new Right(opt.value)
    }
    function unvisitify<X, Y>(opt: LeftRight<X, Y>): Optional<X, Y> {
      return opt.accept({
        visit(value: Left<X> | Right<Y>) {
          return value instanceof Left? {type: 'absent', detail: value.x} : {type: 'present', value: value.y}
        }
      })
    }

The main difference with the usual visitor pattern is that the usual visitor pattern doesn't return anything (it expects you to be holding some mutable state and the visitor will mutate it), you can do that too if you don't have access to a suitable generic for the Z parameter.

▲ crdrost 8 months ago | parent | prev [-]

Approximate expansion of the original claim, without direct endorsement:

Suppose you have an Order object that needs to track where some larger process is in relation to three subtasks. We could imagine say that the Inventory department needs to set a physical object aside, then the Billing department needs to successfully charge for it, then the Shipping department needs to retrieve that physical object and ship it.

You start from a description of this as "one Order contains three Subtasks" where a Subtask contains the Go-style (Optional[Result], Optional[Error]) type. This architecture almost fits into a relational database, except that foreign key constraints are a bit funky if you shove everything into one nullable Result column. But let's just have the Result be some random JSON in the Subtasks table and let our relational purists weep.

Then you read this advice and you start to see that this allows for a lot of illegal states: things could contain both a result AND an error, or neither. You eventually decide that neither, is an allowed state. These are two boolean flags representing only 3 legal states and so they need to be factored into an enum: the enum is "Pending | Success[Result] | Failure[Error]".

Well, except the problem is a bit more nuanced because the pending-states also need to be consistent among the different subtasks: there is a dependency graph among them. So you should actually have an enum that says:

    Inventory_Pending
    Inventory_Failure[Error]
    Inventory_OK_Billing_Pending[InventoryData]
    Inventory_OK_Billing_Failure[InventoryData, Error]
    Inventory_OK_Billing_OK_Shipping_Pending[InventoryData, BillingData]
    Inventory_OK_Billing_OK_Shipping_Failure[InventoryData, BillingData, Error]
    Inventory_OK_Billing_OK_Shipping_OK[InventoryData, BillingData, ShippingData]

See, you would have had 3x3x3 = 27 valid states before for the Order but we have reduced to only the 7 legal states. Yay!

But now consider e.g. the following mutation. On Failure cases the executives at our company mandate that we never return a failed Order to a Pending status, rather we must always create a separate Order. This Order might skip inventory and/or billing and those need to be represented separately, as Inventory Skipped[OrderID] or InventoryAndBillingSkipped[OrderID]. So now our list of states following the "no unrepresentable state" logic, should really be:

    [... the 7 above, plus ...]
    Inventory_Skipped_Billing_Pending[OrderID]
    Inventory_Skipped_Billing_Failure[OrderID, Error]
    Inventory_Skipped_Billing_OK_Shipping_Pending[OrderID, BillingData]
    Inventory_Skipped_Billing_OK_Shipping_Failure[OrderID, BillingData, Error]
    Inventory_Skipped_Billing_OK_Shipping_OK[OrderID, BillingData, ShippingData]
    Inventory_And_Billing_Skipped_Shipping_Pending[OrderID]
    Inventory_And_Billing_Skipped_Shipping_Failure[OrderID, Error]
    Inventory_And_Billing_Skipped_Shipping_OK[OrderID, ShippingData]

Now someone else wants to add remediation actions, but only to remediate the exact error in the failure state, so _Failure is going to mean "no remediation taken" but we need to add some _Remediation with a boolean saying whether that process has completed or not. So we add:

    Inventory_Remediation[Error, Bool, Array[RemediationEvent]]
    Inventory_OK_Billing_Remediation[InventoryData, Error, Bool, Array[RemediationEvent]]
    Inventory_OK_Billing_OK_Shipping_Remediation[InventoryData, BillingData, Error, Bool, Array[RemediationEvent]]
    Inventory_Skipped_Billing_Remediation[OrderID, Error, Bool, Array[RemediationEvent]]
    Inventory_Skipped_Billing_OK_Shipping_Remediation[OrderID, BillingData, Error, Bool, Array[RemediationEvent]]
    Inventory_And_Billing_Skipped_Shipping_Remediation[OrderID, Error, Bool, Array[RemediationEvent]]

We're only up to 21 total states so far which is still probably manageable? But these changes do demonstrate exponential growth, which is a technical term that means that the growth of each step is some small fraction of the total growth that has happened up until that point. Because everything depends on Inventory (it's at the root of the tree), when we add a new state that the Inventory can be in (Skipped) we have to add enum cases for all of the other states, and we pay a cost proportional to the size of the tree. Similarly when everything can have an error (at the leaves of the tree), when we add a new uniform requirement for errors we have to add new leaves all throughout the tree and we pay a cost proportional to the size of the tree. (Another thing to notice about the Remediation state is that it is a Pending state for another Subtask that could have been added to the original Order whenever something moves into Failure mode.)

You get something powerful by reducing the 256ish-or-whatever states into the 21 legal states; you have a compile-time assurance that no bugs in your code have created weird states that can propagate their weirdnesses throughout the system. But you also have to maintain the 21 legal states all at once, instead of maintaining 4 subtasks each having one of 4 statuses.

▲ reval 8 months ago | parent | prev | next [-]

This is technically correct but disingenuous. This reminds of the climate change comic where a scientist asks “What if climate change is a big hoax and we create a better world god nothing?”

▲ joe_the_user 8 months ago | parent | prev [-]

This is an excellent illustration. I feel functional programming provides great tools and concept for a cohesive system (one with consistent requirements, which accomplishes a single thing, etc). But many programs you write involve parts that aren't very cohesive and trying to make them cohesive is far more work than just bolting on a case statement or similar things.

Parent currently downvoted without comments. Seems like a sad way to respond.

	▲	kccqzy 8 months ago \| parent [-]
		I feel like you might be missing the point. A case statement is not to be treated as something you bolt on as a hack. It is the right tool for the job the vast majority of the times. When you use case statements, you refine and reduce your state space. It makes code easier to understand. When you combine this with the idea of making illegal states unrepresentable, a case statement gives you an exhaustive listing of what could happen. Even a lot of Haskell programmers, after using things like the `maybe` function to eliminate Maybe types and things like `fmap` over Maybe, eventually find that using case expressions produces the clearest code even though it may be one line longer. I really hope HN enforces a rule that a downvote must be accompanied by a reply.