That will probably never happen because of the fundamental nature of blob storage.

Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.

Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.

What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.

▲

texthompson 10 months ago | parent | next [-]

Why would you PUT an object, then download it again to a central server in the first place? If a service is accepting an upload of the bytes, it is already doing a pass over all the bytes anyway. It doesn't seem like a ton of overhead to calculate SHA256 in the 4092-byte chunks as the upload progresses. I suspect that sort of calculation would happen anyways.

▲

willglynn 10 months ago | parent | next [-]

You're right, and in fact S3 does this with the `ETag:` header… in the simple case.

S3 also supports more complicated cases where the entire object may not be visible to any single component while it is being written, and in those cases, `ETag:` works differently.

> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.

> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.

> * If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption. If an object is larger than 16 MB, the AWS Management Console will upload or copy that object as a Multipart Upload, and therefore the ETag will not be an MD5 digest.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.h...

▲

danielheath 10 months ago | parent | prev [-]

S3 supports multipart uploads which don’t necessarily send all the parts to the same server.

▲

texthompson 10 months ago | parent [-]

Why does it matter where the bytes are stored at rest? Isn't everything you need for SHA-256 just the results of the SHA-256 algorithm on every 4096-byte block? I think you could just calculate that as the data is streamed in.

	▲	jiggawatts 10 months ago \| parent \| next [-]
		The data is not necessarily "streamed" in! That's a significant design feature to allow parallel uploads of a single object using many parts ("blocks"). See: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMu...
	▲	Dylan16807 10 months ago \| parent \| prev [-]
		> Isn't everything you need for SHA-256 just the results of the SHA-256 algorithm on every 4096-byte block? No, you need the hash of the previous block before you can start processing the next block.

▲

flakes 10 months ago | parent | prev | next [-]

You have just re-invented IPFS! https://en.m.wikipedia.org/wiki/InterPlanetary_File_System

▲

losteric 10 months ago | parent | prev | next [-]

Why does the architect of blob storage matter? The hash can be calculated as data streams in for the first write, before data gets dispersed into multiple physically stored blocks.

	▲	willglynn 10 months ago \| parent [-]
		It is common to use multipart uploads for large objects, since this both increases throughput and decreases latency. Individual part uploads can happen in parallel and complete in any sequence. There's no architectural requirement that an entire object pass through a single system on either S3's side or on the client's side.

▲

Salgat 10 months ago | parent | prev [-]

Isn't that the point of the metadata? Calculate the hash ahead of time and store it in the metadata as part of the atomic commit for the blob (at least for S3).