Remix.run Logo
jrockway 2 hours ago

sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.

Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/

It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.

EdSchouten 2 hours ago | parent | next [-]

It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.

Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.

For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.

grumbelbart2 2 hours ago | parent | prev [-]

Is that even when using the SHA256 hardware extensions? https://en.wikipedia.org/wiki/SHA_instruction_set