Remix.run Logo
addaon 3 days ago

I've actually never worked with a "pure" interpreter in Forth, only compilers of various levels of complexity. Threaded code compilers are (in my experience) by far the most common way to deal with forth -- and they are very much 2-pass. Even when used as an "interpreter," they generate (trivial, usually) machine code, then jump to it.

Consider a definition (in some ill-defined Forth variant) like

    : abs-sqr ( n -- |n^2| ) * 0 < if neg then ;
We can categorize things:

    IMMEDIATE words used here are : ( if then ;
    Normal words are * < neg
    Literals are 0
    Tokens that are not seen by the compiler directly (!) are abs-sqr, the contents of the comment, and )
So the compiler goes through one token (that it sees) at a time.

First up is `:`. `:` is an IMMEDIATE word, so the compiler just calls it now. `:` then consumes a symbol (`abs-sqr`) from the token stream so the compiler won't see it (think of calling next() on an iterator in python or equivalent), then creates a symbol table entry from that symbol to the /current compiled code output stream pointer/ -- that is, just after the last piece of code that was compiled.

Next up is `(`, since we already consumed `abs-sqr`. This is an IMMEDIATE word again -- and it just consumes tokens until one of them is exactly `)`, discarding them -- that is, it defines a comment.

Finally we get to the "easy" case, `*`. The compiler finally compiles! It looks up this symbol in the symbol table, sees that it is /not/ IMMEDIATE, and compiles a call to this address.

Now the compiler sees `0`. This is a literal token, so we don't even bother with the symbol table; we special-case code to push this value on the stack.

'<' is a non-IMMEDIATE symbol, we already know this case.

We've already discussed `if`, `neg`, and `then`. And `;` is an IMMEDIATE word that just puts a return instruction into the code stream.

Clear as mud?

There's one more step from here that's important to make, which is that the choice of what's IMMEDIATE or not is not strictly defined. Some words must be IMMEDIATE for correctness, if they interact with the compiler in interesting ways, like consuming tokens or back-patching instructions. But suppose we want to be clever... `<` works fine as a non-IMMEDIATE word. If we want to inline it, we /could/ have the compiler generalize by looking at the instructions pointed to by it, seeing how long they are (or tracking that in the symbol table), and deciding whether to inline... or we can just re-implement `<` as an immediate word that adds the appropriate instructions directly into the code stream. Combined with assembly words, this can be pretty trivially expressed, and it really changes the paradigm a bit.

Someone 2 days ago | parent | next [-]

> We can categorize things: > IMMEDIATE words used here are : ( if then ;

`:` normally isn’t immediate

> First up is `:`. `:` is an IMMEDIATE word, so the compiler just calls it now

`:` gets executed because the interpreter, when it isn’t compiling, goes through a loop:

  1) read a token until the next space in the input
  2) look up that token in the dictionary
    3a) if a word is found: call it
    3b) if no word is found: try interpreting the token as a number
      4a) if it can be interpreted such: push that number on the stack
      4b) if it cannot: bail out with an error message
  
So, `:` gets called in step 3a.

> Now the compiler sees `0`. This is a literal token, so we don't even bother with the symbol table; we special-case code to push this value on the stack.

As indicated above, that’s not how ‘normal’ forths work. A lookup is done for a word named `0`, and if it exists, a call to it is compiled.

Many forths had words named after small constants such as `0`, `1`, `2` or `-1` because compiling a call to a function took less memory than compiling a call to the “LIT” function and compiling the constant value.

bxparks 2 days ago | parent | prev [-]

> I've actually never worked with a "pure" interpreter in Forth, only compilers of various levels of complexity. Threaded code compilers are (in my experience) by far the most common way to deal with forth -- and they are very much 2-pass. Even when used as an "interpreter," they generate (trivial, usually) machine code, then jump to it.

Lots of good info, thank you. I don't think I will fully understand what you wrote until I implement a Forth interpreter myself.

So a side question: If most Forth "interpreters" are compilers, how does a Forth interpreter work in a Harvard architecture microprocessor (with separate memory space for data and instructions) instead of a Von Neumann architecture with a unified memory layout? In other words, in a Harvard architecture (e.g. AVR microcontrollers), the Forth compiler will live in read-only flash ROM, and it cannot generate machine code into RAM and execute it, because the data memory is not executable.

addaon 2 days ago | parent [-]

> how does a Forth interpreter work in a Harvard architecture microprocessor

You compile to "direct threaded code" in data memory; direct threaded code represents a sequence of calls as a sequence of addresses to call. So while "normal" threaded code (what Wikipedia calls "subroutine threading") would just have

    call word_a
    call word_b
    call word_c
And then executing that means jumping to the first instruction, direct threaded code would have

    &word_a
    &word_b
    &word_c
And then there's a suuuuper tiny runtime (like four of five instructions, literally) that has a "runtime instruction pointer" or whatever you want to call it, and just increments that and does an indirect call through to the next word whenever it's returned to.