Remix clone Hacker News

new | show | ask | jobs Github

	▲	jayd16 3 hours ago
		So split the difference and start encoding input at the words or phrases level?
	▲	calebkaiser 3 hours ago \| parent [-]
		Lots of researchers have done just this! There's a really rich history of research + lots of contemporary work on different encoding/representation strategies. This might be interesting to you: https://sbert.net/ What makes the DeepSeek-OCR and related results exciting to some researchers is less about the fact that you could devise a tokenization scheme that has fewer tokens, and more about how well it works.