Instead of using just positional encodings, we absolutely should have speaker encodings added on top of tokens.