| ▲ | woodruffw 7 hours ago |
| I think the framing in the post is that it's specific to Rust, relative to what Python packaging tools are otherwise written in (Python). It's not very easy to do zero-copy deserialization in pure Python, from experience. (But also, I think Rust can fairly claim that it's made zero-copy deserialization a lot easier and safer.) |
|
| ▲ | zahlman 5 hours ago | parent | next [-] |
| I can't even imagine what "safety" issue you have in mind. Given that "zero-copy" apparently means "in-memory" (a deserialized version of the data necessarily cannot be the same object as the original data), that's not even difficult to do with the Python standard library. For example, `zipfile.ZipFile` has a convenience method to write to file, but writing to in-memory data is as easy as with zipfile.ZipFile(archive_name) as a:
with a.open(file_name) as f, io.BytesIO() as b:
b.write(f.read())
return b.getvalue()
(That does, of course, copy data around within memory, but.) |
| |
| ▲ | woodruffw 5 hours ago | parent | next [-] | | > Given that "zero-copy" apparently means "in-memory" (a deserialized version of the data necessarily cannot be the same object as the original data), that's not even difficult to do with the Python standard library This is not what zero-copy means. Here's a working definition[1]. Specifically, it's not just about keeping things in memory; copying in memory is normal. The goal is to not make copies (or more precisely, what Rust would call "clones"), but to instead convey the original representation/views of that representation through the program's lifecycle where feasible. > a deserialized version of the data necessarily cannot be the same object as the original data rust-asn1 would be an example of a Rust library that doesn't make any copies of data unless you explicitly ask it to. When you load e.g. a Utf8String[2] in rust-asn1, you get a view into the original input buffer, not an intermediate owning object created from that buffer. > (That does, of course, copy data around within memory, but.) Yes, that's what makes it not zero-copy. [1]: https://rkyv.org/zero-copy-deserialization.html [2]: https://docs.rs/asn1/latest/asn1/struct.Utf8String.html | | |
| ▲ | zahlman 4 hours ago | parent [-] | | > Yes, that's what makes it not zero-copy. Yeah, so you'd have to pass around the `BytesIO` instead. I know that zero-copy doesn't ordinarily mean what I described, but that seemed to be how TFA was using it, based on the logic in the rest of the sentence. | | |
| ▲ | woodruffw 4 hours ago | parent [-] | | > Yeah, so you'd have to pass around the `BytesIO` instead. That wouldn’t be zero-copy either: BytesIO is an I/O abstraction over a buffer, so it intentionally masks the “lifetime” of the original buffer. In effect, reading from the BytesIO creates new copies of the underlying data by design, in new `bytes` objects. (This is actually a great capsule example of why zero-copy design is difficult in Python: the Pythonic thing to do is to make lots of bytes/string/rich objects as you parse, each of which owns its data, which in turn means copies everywhere.) | | |
| ▲ | zahlman 4 hours ago | parent [-] | | Fair. (You can `.getbuffer` but you still have to keep the underlying BytesIO object "open" somehow.) I'm not convinced this is going to bottleneck things, though. (On the flip side, I guess the OS is likely to cache any disk write in memory anyway.) |
|
|
| |
| ▲ | SpaceNugget 3 hours ago | parent | prev [-] | | As a quick and kind of oversimplified example of what zero copy means, imagine you read the following json string from a file/the network/whatever: json = '{"user":"nugget"}' // from somewhere
A simple way to extract json["user"] to a new variable would be to copy the bytes. In pythony/c pseudo code let user = allocate_string(6 characters)
for i in range(0, 6)
user[i] = json["user"][i]
// user is now the string "nugget"
instead, a zero copy strategy would be to create a string pointer to the address of json offset by 9, and with a length of 6. {"user":"nugget"}
^ ]end
The reason this can be tricky in C is that when you call free(json), since user is a pointer to the same string that was json, you have effectively done free(user) as well.So if you use user after calling free(json), You have written a classic _memory safety_ bug called a "use after free" or UAF. Search around a bit for the insane number of use after free bugs there have been in popular software and the havoc they have wreaked. In rust, when you create a variable referencing the memory of another (user pointing into json) it keeps track of that (as a "borrow", so that's what the borrow checker does if you have read about that) and won't compile if json is freed while you still have access to user. That's the main memory safety issue involved with zero-copy deserialization techniques. |
|
|
| ▲ | stefan_ 6 hours ago | parent | prev [-] |
| I suppose it can fairly claim that now every other library and blog post invokes "zero-copy" this and that, even in the most nonsensical scenarios. It's a technique for when you can literally not afford the memory bandwidth, because you are trying to saturate a 100Gbps NIC or handling 8k 60Hz video, not for compromising your data serialization schemes portability for marketing purposes while all applications hit the network first, disk second and memory bandwidth never. |
| |
| ▲ | vlovich123 5 hours ago | parent | next [-] | | You’ve got this backward. The vast majority of time due to spatial and temporal locality, in practice for any application you’re actually usually doing CPU registers first, cache second, memory third, disk fourth, network cache fifth, and network origin sixth. So this stuff does actually matter for performance. Also, aside from memory bandwidth, there’s a latency cost inherent in traversing object graphs - 0 copy techniques ensure you traverse that graph minimally, just what’s needed to actually be accessed which is huge when you scale up. There’s a difference between one network request and fetching 1 MB vs making 100 requests to fetch 10kib and this difference also appears in memory access patterns unless they’re absorbed by your cache (not guaranteed for object graph traversal that a package manager would be doing). | |
| ▲ | woodruffw 6 hours ago | parent | prev | next [-] | | Many of the hot paths in uv involve an entirely locally cached set of distributions that need to be loaded into memory, very lightly touched/filtered, and then sunk to disk somewhere else. In those contexts, there are measurable benefits to not transforming your representation. (I'm agnostic on whether zero-copy "matters" in every single context. If there's no complexity cost, which is what Rust's abstractions often provide, then it doesn't really hurt.) | |
| ▲ | zahlman 5 hours ago | parent | prev [-] | | The point is that the packaging tool can analyze files from within the archives it downloads, without writing them to disk. |
|