Remix.run Logo
andsoitis 3 days ago

> My problem is that I cannot see how control flow works in Forth, e.g. a simple if-then-else.

Hopefully this is helpful: https://www.forth.com/starting-forth/4-conditional-if-then-s...

bxparks 3 days ago | parent [-]

Thanks for that. I kinda know how it "works" at the user-level. I meant to say, I don't know how it is implemented.

My mental model of Forth is that there is a simple parser that consumes space-delimited keywords. The interpreter looks up that keyword in a dictionary, which gives the address of the machine code that handles that word. The interpreter either makes a subroutine call to that address (subroutine threaded), or jumps to that address (called "direct threaded" if I recall, where the handler jumps back to the interpreter instead of executing a RET).

But that's where my mental model of Forth breaks down, because IF-THEN-ELSE cannot be implemented in that model. So there must be something else fundamental in the Forth interpreter that I don't understand.

addaon 3 days ago | parent | next [-]

> So there must be something else fundamental in the Forth interpreter that I don't understand.

The missing bit is IMMEDIATE mode. Words can be tagged as IMMEDIATE, which means that they get executed at compile time (or parse time, for an interpreter), rather than a call to them getting compiled (executed at run time, for an interpreter). IF/ELSE/THEN are then "just" IMMEDIATE mode words -- but you can add your own. The "special sauce" is that IF compiles (easier to talk about for a compiler; generalize as needed) a conditional branch to an unknown address, and puts the address of that branch instruction (or an equivalent) on the /compile time/ data stack; THEN then looks at the address on the /compile time/ data stack and patches that instruction to branch to the correct address. Plenty of subtlety possible, but the basic primitive of IMMEDIATE mode is the key.

bxparks 3 days ago | parent [-]

Ah I see, this is peeling back a few layers of obscurity about the Forth interpreter for me. Let's stick with a Forth interpreter because that seems easier to think about for me.

Are you saying that the Forth interpreter is a 2-pass interpreter? Or does the interpreter go into a special IMMEDIATE mode upon hitting the IF keyword, then it just consumes subsequent tokens without doing any dispatching, until it hits the THEN token? It sounds like nested IF-THEN-ELSE becomes tricky to handle.

How does the FORTH interpreter handle loops? Does the interpreter hit the WHILE token, goes into IMMEDIATE mode, remembers the location of the WHILE, then dispatches all the subsequent code, until it hits the REPEAT token, then branches back to the WHILE?

addaon 3 days ago | parent | next [-]

Oh, and because I didn't address it directly in the longer answer...

> Does the interpreter hit the WHILE token, goes into IMMEDIATE mode, remembers the location of the WHILE, then dispatches all the subsequent code, until it hits the REPEAT token, then branches back to the WHILE?

Yes. The beauty is that, in the context of a threaded code compiler (which, again, I encourage you to use as your default model for Forth, even though other options are possible), WHILE just pushes the address of the output code stream onto the compile-time stack. REPEAT expects the address to be on the compile-time stack and compiles a conditional jump to that address. This obviously and trivially provides support for nested control structures of all types; as long as the word that pushes compile-time data onto the stack is correctly paired with the word the pops it, we have stack semantics for nested control, which is exactly the expectation. So while your description is completely correct, "remembers" is almost trivial here -- "puts data on stack" is the primitive operation of remembering anything in Forth, and that's all that's needed here, no fancy data structures or look-aside buffers or anything. (Note that the compiler does require at least two non-stack data structures, the symbol table and the output code stream, but those reflect real domain complexity.)

addaon 3 days ago | parent | prev | next [-]

I've actually never worked with a "pure" interpreter in Forth, only compilers of various levels of complexity. Threaded code compilers are (in my experience) by far the most common way to deal with forth -- and they are very much 2-pass. Even when used as an "interpreter," they generate (trivial, usually) machine code, then jump to it.

Consider a definition (in some ill-defined Forth variant) like

    : abs-sqr ( n -- |n^2| ) * 0 < if neg then ;
We can categorize things:

    IMMEDIATE words used here are : ( if then ;
    Normal words are * < neg
    Literals are 0
    Tokens that are not seen by the compiler directly (!) are abs-sqr, the contents of the comment, and )
So the compiler goes through one token (that it sees) at a time.

First up is `:`. `:` is an IMMEDIATE word, so the compiler just calls it now. `:` then consumes a symbol (`abs-sqr`) from the token stream so the compiler won't see it (think of calling next() on an iterator in python or equivalent), then creates a symbol table entry from that symbol to the /current compiled code output stream pointer/ -- that is, just after the last piece of code that was compiled.

Next up is `(`, since we already consumed `abs-sqr`. This is an IMMEDIATE word again -- and it just consumes tokens until one of them is exactly `)`, discarding them -- that is, it defines a comment.

Finally we get to the "easy" case, `*`. The compiler finally compiles! It looks up this symbol in the symbol table, sees that it is /not/ IMMEDIATE, and compiles a call to this address.

Now the compiler sees `0`. This is a literal token, so we don't even bother with the symbol table; we special-case code to push this value on the stack.

'<' is a non-IMMEDIATE symbol, we already know this case.

We've already discussed `if`, `neg`, and `then`. And `;` is an IMMEDIATE word that just puts a return instruction into the code stream.

Clear as mud?

There's one more step from here that's important to make, which is that the choice of what's IMMEDIATE or not is not strictly defined. Some words must be IMMEDIATE for correctness, if they interact with the compiler in interesting ways, like consuming tokens or back-patching instructions. But suppose we want to be clever... `<` works fine as a non-IMMEDIATE word. If we want to inline it, we /could/ have the compiler generalize by looking at the instructions pointed to by it, seeing how long they are (or tracking that in the symbol table), and deciding whether to inline... or we can just re-implement `<` as an immediate word that adds the appropriate instructions directly into the code stream. Combined with assembly words, this can be pretty trivially expressed, and it really changes the paradigm a bit.

Someone 2 days ago | parent | next [-]

> We can categorize things: > IMMEDIATE words used here are : ( if then ;

`:` normally isn’t immediate

> First up is `:`. `:` is an IMMEDIATE word, so the compiler just calls it now

`:` gets executed because the interpreter, when it isn’t compiling, goes through a loop:

  1) read a token until the next space in the input
  2) look up that token in the dictionary
    3a) if a word is found: call it
    3b) if no word is found: try interpreting the token as a number
      4a) if it can be interpreted such: push that number on the stack
      4b) if it cannot: bail out with an error message
  
So, `:` gets called in step 3a.

> Now the compiler sees `0`. This is a literal token, so we don't even bother with the symbol table; we special-case code to push this value on the stack.

As indicated above, that’s not how ‘normal’ forths work. A lookup is done for a word named `0`, and if it exists, a call to it is compiled.

Many forths had words named after small constants such as `0`, `1`, `2` or `-1` because compiling a call to a function took less memory than compiling a call to the “LIT” function and compiling the constant value.

bxparks 2 days ago | parent | prev [-]

> I've actually never worked with a "pure" interpreter in Forth, only compilers of various levels of complexity. Threaded code compilers are (in my experience) by far the most common way to deal with forth -- and they are very much 2-pass. Even when used as an "interpreter," they generate (trivial, usually) machine code, then jump to it.

Lots of good info, thank you. I don't think I will fully understand what you wrote until I implement a Forth interpreter myself.

So a side question: If most Forth "interpreters" are compilers, how does a Forth interpreter work in a Harvard architecture microprocessor (with separate memory space for data and instructions) instead of a Von Neumann architecture with a unified memory layout? In other words, in a Harvard architecture (e.g. AVR microcontrollers), the Forth compiler will live in read-only flash ROM, and it cannot generate machine code into RAM and execute it, because the data memory is not executable.

addaon 2 days ago | parent [-]

> how does a Forth interpreter work in a Harvard architecture microprocessor

You compile to "direct threaded code" in data memory; direct threaded code represents a sequence of calls as a sequence of addresses to call. So while "normal" threaded code (what Wikipedia calls "subroutine threading") would just have

    call word_a
    call word_b
    call word_c
And then executing that means jumping to the first instruction, direct threaded code would have

    &word_a
    &word_b
    &word_c
And then there's a suuuuper tiny runtime (like four of five instructions, literally) that has a "runtime instruction pointer" or whatever you want to call it, and just increments that and does an indirect call through to the next word whenever it's returned to.
vdupras 3 days ago | parent | prev [-]

No, that's not it. It's much simpler than that, yet has much deeper implications than you think. You don't see it in other languages. The closest thing would maybe be compile-time macros in Zig? But in Forth, the power it unlocks comes in its purest form, without any fluff around it.

vdupras 3 days ago | parent | prev | next [-]

As @addaon writes, your missing ingredient is immediateness. This is one of the most powerful, yet mind-boggling aspects of Forth. I encourage you to check it out, it will make you grow as a developer.

bxparks 3 days ago | parent [-]

I will definitely look into that.

If understanding this special IMMEDIATE mode is required to understand the Forth interpreter for something as fundamental as control-flow, it seems fair to say that Forth is not a simple language. It's not just an advanced programmable RPN calculator An RPN calculator has a program counter, which makes control-flow easy to understand.

In comparison, C is a high level language, but the mapping from C code to assembly language is relatively simple. (Yes, compiler optimizations against the C "abstract machine" can make the resulting code completely obscure. But if we turn off optimization, the resulting assembly code matches the C code fairly directly.)

kragen 2 days ago | parent | next [-]

Very much the contrary! In C all the syntax and control structures have to be built into the language; this makes C a much more complex language than Forth, because in Forth the language and interpreter don't even have to support things like comments, string literals, variables, and control flow. Because of immediate words, all of that can be built on top of the base language in high-level Forth, and almost always is. This allows the language itself to be enormously simpler.

It's also generally the case that in a native-code-compiling Forth the mapping from the Forth source to the machine code emitted is very much simpler and more direct than in C; as Virgil implicitly pointed out, the machine code is generally more or less in the same order as the source code, which in C it is not, and you don't have a bunch of implicit type conversions, ad-hoc polymorphic arithmetic operators, and so on. (It doesn't have to be more direct, since you can do arbitrary computation at compile time, but it usually is.)

addaon 3 days ago | parent | prev | next [-]

> If understanding this special IMMEDIATE mode is required to understand the Forth interpreter for something as fundamental as control-flow, it seems fair to say that Forth is not a simple language. It's not just an advanced programmable RPN calculator An RPN calculator has a program counter, which makes control-flow easy to understand.

"Simple" is not a well-defined threshold but rather a continuum, so it's hard to agree or disagree with this. I think it's perfectly valid to observe that Forth is more complex than an RPN calculator, though.

But think of it this way: An RPN calculator has two types of tokens, literals and symbols. When seeing a literal, it evaluates a push of that literal to the stack. When seeing a symbol, it evaluates an immediate call to the behavior associated with that symbol.

Forth adds exactly one more concept: non-IMMEDIATE words. Everything an RPN calculator can do can be done as IMMEDIATEs in Forth. But by adding one metadata bit to the symbol table (IMMEDIATE or not), and adding a threaded call to any non-IMMEDIATE words to the output code stream, Forth gains function definition, full generic control flow support, compiler extensibility, support for embedding domain-specific languages (just IMMEDIATE words that consume the interesting tokens), and more.

I don't know if this counts as "simple" compared to C, but it surely counts as "parsimonious." It's hard to think of a more cleanly defined single semantic change that adds as much power to a language.

(And of course in C, once you understand the language understanding the runtime library is mostly about understanding runtime behavior, some macros not withstanding; but in Forth, the runtime library and the language are conflated through IMMEDIATE symbols, so this separation is much less clear; totally accept that this could be considered less "simple", although in practice most Forths have about as many pre-defined IMMEDIATE words as C has keywords.)

vdupras 3 days ago | parent | prev [-]

The mapping to assembly of:

42 = if ."hey!" then

is much more straightforward than

if (n == 42) printf("hey!");

I understand that to the newcomer, it might not appear that way, but implementing a Forth is really eye-opening in that regard.

If I might allow myself a bit of promotion, I wrote https://tumbleforth.hardcoded.net/ as such an eye-opening process. It's less "gentle" than Easy Forth here, but it digs deeper.

bxparks 2 days ago | parent [-]

From the comments in this thread, it seems that to understand how Forth implements a simple IF-THEN-ELSE control-flow, I have to understand the difference between non-immediate and immediate words. I also have to understand the difference between outer and inner interpreter. And I have to understand how Forth generates snippets of machine code (where does that get stored? I thought Forth only has 2 stacks, does it also have a general heap?). Then understand how the THEN token goes back and patches the placeholder address generated by the IF token. And understand the difference between the parsing phase and the interpreted phase of the Forth interpreter/compiler.

But you are saying that the Forth version is simpler than C version which will kinda look like this after it's compiled (Z80 assembly code, it's in my head right now):

    ld a, (variableN)
    cp 42
    ld hl, StringHey
    call z, Printf
    ...
 StringHey:
    .db "hey!", 0
I find that hard to believe, but I accept that you believe that.
vdupras 2 days ago | parent [-]

It's fine, I can't force you in either. Maybe one day you'll dive into the subject. From the look of the comments here, you have all the hints you need.

kragen 3 days ago | parent | prev [-]

You're describing the outer interpreter in interpretation state; Forth control flow words don't work properly in interpretation state, only in compile state. They're immediate words, so they execute at compile time instead of run time, so they can do arbitrary things to the code being compiled. Here's Mike Perry and Henry Laxen's implementation of the main control-flow words from F83, which is an indirect-threaded Forth:

    \ Run Time Code for Control Structures                04OCT83HHL \ \ Run Time Code for Control Structures                05MAR83HHL
    CODE BRANCH   (S -- )                                            \ BRANCH    Performs an unconditional branch.  Notice that we
    LABEL BRAN1   0 [IP] IP MOV   NEXT END-CODE                      \    are using absolute addresses insead of relative ones. (fast)
    CODE ?BRANCH   (S f -- )                                         \ ?BRANCH   Performs a conditional branch.  If the top of the
      AX POP   AX AX OR   BRAN1 JE   IP INC   IP INC   NEXT END-CODE \    parameter stack in True, take the branch.  If not, skip
                                                                     \    over the branch address which is inline.

    \ Extensible Layer            Structures              03Apr84map \ \ Extensible Layer            Structures              03Apr84map
    : ?CONDITION   (S f -- )                                         \ ?CONDITION
       NOT ABORT" Conditionals Wrong"   ;                            \    Simple compile time error checking.  Usually adequate
    : >MARK      (S -- addr )    HERE 0 ,   ;                        \ >MARK        Set up for a Forward Branch
    : >RESOLVE   (S addr -- )    HERE SWAP !   ;                     \ >RESOLVE     Resolve a Forward Branch
    : <MARK      (S -- addr )    HERE    ;                           \ <MARK        Set up for a Backwards Branch
    : <RESOLVE   (S addr -- )    ,   ;                               \ <RESOLVE     Resolve a Backwards Branch

    : ?>MARK      (S -- f addr )   TRUE >MARK   ;                    \ ?>MARK       Set up a forward Branch with Error Checking
    : ?>RESOLVE   (S f addr -- )   SWAP ?CONDITION >RESOLVE  ;       \ ?>RESOLVE    Resolve a forward Branch with Error Checking
    : ?<MARK      (S -- f addr )   TRUE   <MARK   ;                  \ ?<MARK       Set up for a Backwards Branch with Error Checking
    : ?<RESOLVE   (S f addr -- )   SWAP ?CONDITION <RESOLVE  ;       \ ?<RESOLVE    Resolve a backwards Branch with Error Checking

    : LEAVE   COMPILE (LEAVE)   ; IMMEDIATE                          \ LEAVE and ?LEAVE could be non-immediate in this system,
    : ?LEAVE  COMPILE (?LEAVE)  ; IMMEDIATE                          \   but the 83 standard specifies an immediate LEAVE, so they
                                                                     \   both are for uniformity.
     
    \ Extensible Layer            Structures              01Oct83map \ \ Extensible Layer            Structures              27JUL83HHL
    : BEGIN   ?<MARK                                   ; IMMEDIATE   \ These are the compiling words needed to properly compile
    : THEN    ?>RESOLVE                                ; IMMEDIATE   \ the Forth Conditional Structures.  Each of them is immediate
    : DO      COMPILE (DO)   ?>MARK                    ; IMMEDIATE   \ and they must compile their runtime routines along with
    : ?DO     COMPILE (?DO)  ?>MARK                    ; IMMEDIATE   \ whatever addresses they need.  A modest amount of error
    : LOOP                                                           \ checking is done.  If you want to rip out the error checking
        COMPILE (LOOP)  2DUP 2+ ?<RESOLVE ?>RESOLVE    ; IMMEDIATE   \ change the ?> and ?< words to > and < words, and
    : +LOOP                                                          \ all of the 2DUPs to DUPs and the 2SWAPs to SWAPs.  The rest
        COMPILE (+LOOP) 2DUP 2+ ?<RESOLVE ?>RESOLVE    ; IMMEDIATE   \ should stay the same.
    : UNTIL   COMPILE ?BRANCH    ?<RESOLVE             ; IMMEDIATE
    : AGAIN   COMPILE  BRANCH    ?<RESOLVE             ; IMMEDIATE
    : REPEAT  2SWAP [COMPILE] AGAIN   [COMPILE] THEN   ; IMMEDIATE
    : IF      COMPILE  ?BRANCH  ?>MARK                 ; IMMEDIATE
    : ELSE    COMPILE  BRANCH ?>MARK  2SWAP ?>RESOLVE  ; IMMEDIATE
    : WHILE   [COMPILE] IF                             ; IMMEDIATE
When the interpreter is toodling along in compile state, compiling a colon definition by stowing pointers one after another (at the pointer here) into the definition of some word you're compiling, and it encounters an if, it sees that if is immediate, and so instead of stowing a pointer to if it just runs it immediately. The definition of if is compile ?branch ?>mark. compile is also an immediate word [correction, no, it's not, see below comment, though the following is still correct]; compile ?branch stows a pointer to ?branch into the colon definition being compiled, and then ?>mark writes a 0 into the entry following the ?branch and pushes true and the address of the 0 on the operand stack, at compile time, with the sequence true here 0 ,. The interpreter toodles along compiling the body of the if and eventually gets to, for example, then, which is also immediate, and is defined as ?>resolve, which overwrites the 0 into the address of the indirect-threaded code that will be compiled following the then. It does this with swap ?condition here swap !. The swap ?condition part aborts with an error if there isn't an unresolved if or similar on the stack to resolve, consuming the true, leaving only the address of the 0 that ?>mark had pushed. So then here swap ! overwrites that 0 with the current value of here.

?branch is a word written in assembly which does a conditional jump in the inner interpreter (the one that interprets the indirect-threaded code); when it's executed, it pops a value off the stack and checks to see if it's zero, and if so, it changes the interpreter's execution pointer ip (which is defined elsewhere as the register si) to the number stored in the threaded code following the pointer to ?branch. If, on the other hand, the value it popped was nonzero, it increments ip twice to skip over that number. (Note that Laxen's comment on ?branch is incorrect in that it reverses the sense of the test.)

All the forward jumps work in pretty much the same way: when you begin a control structure you call ?>mark to write a zero placeholder and push its address, and later on you "resolve" that placeholder by popping its address off the stack and overwriting it with the correct address. leave (break) and ?leave (if (...) break) work slightly differently, but mostly the same.

Backward jumps work the other way around: when you begin a control structure, as in begin, you call ?<mark to save the current address on the stack so that you can jump to it later, which ends up just being true here. Then, to actually compile the jump, for example in until or again, you call ?<resolve, which ends up just being swap ?condition ,—the , pops the jump target address off the stack and compiles it into the indirect threaded code, serving as an argument the ?branch or branch instruction compiled immediately before it.

begin ... while ... repeat is handled, as you can see, by treating the while ... repeat part as an if ... then with an unconditional jump back to the begin jammed in right before the then.

Hopefully this is helpful!

BTW, for the above, I reformatted the block files from the F83 distribution with http://canonical.org/~kragen/sw/dev3/blk2unix.py, which you may find useful if you want to do the same thing.

alexisread 3 days ago | parent | next [-]

Oof, I forget that most forths are a bit mind bending with the compiler STATE. There are 2/3 alternatives to using compiler state aka IMMEDIATE.

https://github.com/dan4thewin/FreeForth2 This uses a two-pass search, for macros` and after that immediate words.

The most interesting one is Able forth https://github.com/ablevm which uses flow control to defer execution, aka quotations. I find using quotations instead of immediate modes easier to understand.

With both of these, they always compile expressions before executing them, so IF/THEN/ELSE can be used at any time.

andrewla 3 days ago | parent | prev | next [-]

Thanks for this expansion of the ideas involved. My question here is what does the COMPILE word do? What is the state of the VM / compiler / repl or whatever after it encounters that word?

That "IF" is implemented in terms of other more fundamental operators is fine, but can we write a program that just uses the fundamental operators that demonstrates IF-like behavior but doesn't introduce any intermediate words?

kragen 3 days ago | parent [-]

Actually compile is not an immediate word (at least in F83). I was wrong about that. Here's the F83 definition:

    : COMPILE   (S -- )   R> DUP 2+ >R   @ ,   ;                     \ COMPILE     Compile the following word when this def. executes
This takes its return address (which points to the following word in the colon definition that called it), dups it, adds 2 to it, and puts that back on the return stack as its new return address. Then, it fetches from its original return address with @ (thus getting the address of the word that followed it in the colon definition, such as ?branch in my if example above) and compiles it with , into whatever is currently being compiled. Then, when it returns, having added 2 to the return address means that we don't actually execute ?branch or whatever; we've skipped over it.

So it doesn't change the state of the interpreter at all!

I think you're asking if you can use things like ?branch usefully without writing any immediate words. In some sense I think the answer is yes in F83 but no in standard Forth. I think you can put a code sequence like ?branch [ here 0 , ] into a colon definition to do what if does, and then later on say [ here swap ! ] to do what then does. I just typed this definition into F83, and it seems to work†:

    : is3 3 = ?branch [ here 0 , ] ." yes" [ here swap ! ] ;
You could sort of think of if and then as being macros for ?branch [ here 0 , ] and [ here swap ! ] respectively (although I'm omitting the checks they use for proper control structure nesting).

On the other hand, this is only possible because [ is an immediate word, and because ?branch is exposed, and happens to take an absolute address in the next word in the colon definition (as opposed to a byte delta or something). As it happens, exactly the same definition of is3 appears to work in GForth 0.7.3 and PFE 0.33.71, but it definitely will not work on, for example, any native-code-compiling Forth.

The standard way to invoke things like ?branch is using if, while, and so on. And you don't have to define any immediate words to do that, either.

______

† By "work" I mean it seems to behave the same as

    : is3 3 = if ." yes" then ;
bxparks 3 days ago | parent | prev [-]

Wow, that's going to take some time and effort to digest, but thank you.

Yes, I think control-flow is easier to understand in assembly language than the implementation you showed in Forth. :-)

kragen 3 days ago | parent [-]

Happy to help!

I think you're mistaken about assembly language.

In assembly language, the thing that plays the role of these definitions like if and then and ?<resolve is the assembler's symbol table and relocation logic, which goes back and changes your jump instructions (etc.) to jump to the places where it finds that your labels have been defined to point. Typically this involves things like hash functions, hash table collision resolution, various operand encodings for things like short jumps and long jumps, and so on.

Although you can write an assembler that does all this in an afternoon, I don't think you will ever find an assembler whose implementation of all this functionality is easier to understand than the above 30 lines of code. It might be easier to understand per line of code but there will be a lot more lines of code to understand, like 10× or 100×.