Remix.run Logo
bob1029 5 days ago

Still plugging away on my linear genetic programming experiments.

The big debate in my head right now is whether a next byte prediction architecture is better or worse than full sequence prediction.

The benefit of next byte prediction is that we only expect 1 byte of information to be produced per execution of the UTM program. The implication here is that the program probably doesn't need a whole lot of interpreter cycles to figure out this single byte each time (given reasonable context size). However, the downside is that you only get 256 levels of signal to work with at tournament selection time. There isn't much gradient when comparing candidates on a specific task.

The full sequence prediction architecture is expected to produce the entire output (i.e., context window size) for each UTM program invocation. This implies that we may need a much larger # of interpreter cycles to play with each time. However, we get a far richer gradient to work with at fitness compare time (100-1000 bytes).

Other options could involve bringing BPE into my life, but I really want to try to think different for now. If I take the bitter lesson as strongly as possible, tokenization and related techniques (next token prediction) could be framed as clever tricks that a different computational model could avoid.