Remix.run Logo
marginalia_nu 4 days ago

As part of the problem domain in index lookups, issuing multiple requests at the same time is not possible, unless as part of some entirely guess-based readahead scheme thay may indeed drive up disk utilization but are unlikely to do much else. Large blocks are a solution with that constraint as a given.

That paper seems to mostly focus on throughput via concurrent independent queries, rather than single-query performance. It's arriving at a different solution because it's optimizing for a different variable.

throwaway81523 4 days ago | parent | next [-]

In most search engines the top few tree layers are in ram cache, and can also have disk addresses for the next levels. So maybe that can let you start some concurrent requests.

Veserv 4 days ago | parent | prev [-]

Large block reads are just a readahead scheme where you prefetch the next N small blocks. So you are just stating that contiguous readahead is close enough to arbitrary readahead especially if you tune your data structure appropriately to optimize for larger regions of locality.

marginalia_nu 4 days ago | parent [-]

Well I mean yes, you can use io_uring to read the 128KB blocks as 8 4KB blocks, but that's a very roundabout way of doing it that doesn't significantly improve your performance since with either method, the operation time is more or less the same. If a 128 KB read takes roughly the same time as a 4K read, 8 parallel 4K reads isn't going to be faster with io_uring.

Also, an index with larger block sizes is not equivalent to a structure with smaller block sizes with readahead. The index structure is not the same since having larger coherent blocks gives you better precision in your indexing structure for the same number of total forward pointers, as there's no need to index within each 128 KB block, the forward pointer resolution that would have gone to distinguishing between 4K blocks can instead help you rapidly find the next relevant 128 KB block.