Remix clone Hacker News

	▲	bionhoward 2 days ago
		How does this compare with Byte Latent Transformer [1]? This happens with convolution post-embedding while BLT happens with attention at embedding time? 1. https://ai.meta.com/research/publications/byte-latent-transf...
	▲	janalsncm 2 days ago \| parent [-]
		As I understand it, BLT uses a small nn to tokenize but doesn’t change the attention mechanism. MTA uses traditional BPE for tokenization but changes the attention mechanism. You could use both (latency be damned!)