apparently you can straight up duplicate/add/rearrange layers without changing any of the weights and get better results as well - https://dnhkng.github.io/posts/rys/

▲

quotemstr 6 hours ago | parent | next [-]

Neat!

> This is probably due to the way larger numbers are tokenised, as big numbers can be split up into arbitrary forms. Take the integer 123456789. A BPE tokenizer (e.g., GPT-style) might split it like: ‘123’ ‘456’ ‘789’ or: ‘12’ ‘345’ ‘67’ ‘89’

One of the craziest LLM hacks that doesn't get love is https://polymathic-ai.org/blog/xval/

xVal basically says "tokenizing numbers is hard: what if instead of outputting tokens that combine to represent numbers, we just output the numbers themselves, right there in the output embedding?"

It works! Imagine you're discussing math with someone. Instead of saying "x is twenty five, which is large" in words, you'd say "x is", then switch to making a whistling noise in which the pitch of your whistle, in its position within your output frequency range, communicated the concept of 25.00 +/- epsilon. Then you'd resume speech and say "which is large".

I think the sentiment is that today's models are big and well-trained enough that receiving and delivering quantities as tokens representing numbers doesn't hurt capabilities much, but I'm still fascinated by xVal's much more elegant approach.

	▲	khalic 6 hours ago \| parent [-]
		I was having some issues with IP addresses representation, this might solve it

▲

khalic 9 hours ago | parent | prev [-]

This is crazy, thank you for the link!