Probably a naive question, but: couldn't you precompute some vector representation of the string once, and reduce collation to a vector comparison? Basically move the cost upfront and get back to the "fast" byte-comparison case?

▲

theobeers 5 hours ago | parent [-]

Well, that's sort of like what the "sort key" of a given string is: a vector of its nonzero primary-level collation weights, then secondaries, then tertiaries... And once you know what collation options you're using (tailoring etc.), you could just compute the sort key of each string as you encounter it and cache that. Then you would be able to reach a collation decision rapidly for any two strings whose sort keys you already have. It does boil down to vector comparison at that point.

I experimented with adding an LRU cache to my Rust UCA implementation, but I saw essentially no performance benefit (on the workloads I had), and I decided the feature wasn't worth the complexity and removed it.

Something I found about Unicode collation is that, once the fast paths are added, they get hit a surprisingly large percentage of the time. I'm thinking in particular of the way that performant UCA implementations build sort keys lazily, stopping once a collation decision is reached. The average "point of difference" is at the primary level and within a few characters of the start of each string. Only a small portion of the sort keys ever get built.

I am definitely interested in finding more ways to avoid work in the collation routine. Many times, I've had what I thought was a clever idea and found that it didn't pan out in benchmarks. Thank you for your comment!

	▲	adaptit 5 hours ago \| parent [-]
		Ah ok, the lazy construction part is what I was missing, if you basically never build the full key, there's nothing to cache. Makes sense now why the LRU didn't help. I'll think about it over the nextdays.