In a CPU, there are 2 kinds of branch predictors, predictors for conditional branches, which must predict only whether the branch is taken or not taken, and predictors for indirect branch instructions, which must predict the address of the next instruction that will be executed after the indirect branch instruction.
The indirect branch predictor stores information about indirect branches in a structure that is similar to a cache memory.
When the CPU loads some data from the main memory, the loaded data together with the address from where it was loaded is stored in the cache memory.
When another load instruction is executed later, the load address is used to query the cache memory and if the query is successful the data value is returned by the query and it is used instead of accessing again the main memory.
Similarly, (simplifying a lot) when an indirect branch instruction is executed the address where the execution continues after the indirect branch is stored together with the address of the branch instruction in an associative memory similar to a cache memory.
When an indirect branch instruction is executed later, the address of the branch instruction is used to query the associative memory and if the query is successful it will return the address of the next instruction to be executed and execution will continue there speculatively, until it is confirmed by reading from the main memory that this is the correct jump address. If the jump address is wrong then all the work done is discarded and execution continues at the right address.
This description is simplified, because modern branch descriptors may store for a given branch instruction address not only the last jump address, but a sequence of past jump addresses, together with a pattern about how they have been used (e.g. alternating between jumping twice to the first address and jumping once to the second address). Thus successive queries for the branch instruction address will retrieve different jump addresses, based on their activation pattern.
The point of TFA is that if the dispatch loop contains a single indexed branch instruction, then the associative memory of the indirect branch predictor contains a single record, keyed by the address of that instruction. Depending on the CPU, the record will contain one or more past addresses, but it can predict correctly only a short periodic pattern of jump addresses, i.e. of opcodes used in dispatching.
This could work to predict correctly a short loop in the interpreted program, but typically most of the predictions will be wrong and each branch misprediction takes much more time than the execution of any instruction. In smaller and cheaper CPUs, which might store a single jump address per record, the correct predictions would be extremely rare, as they would happen only if the interpreted program contains repeated instructions.
In the optimized program, the indexed branch instruction is replicated into each "case" so now the associative memory will store separate records, one for each "case", containing the address or addresses of the next cases where execution has continued in the past.
Because after each interpreted instruction the probabilities of the next instructions are not equal, but some instructions are more likely to follow, that greatly increases the chances that the indirect branch predictor will make correct predictions from time to time.