Remix.run Logo
Veserv 10 hours ago

The solution is that buffered accesses should almost always flush after a threshold number of bytes or after a period of time if there is at least one byte, “threshold or timeout”. This is pretty common in hardware interfaces to solve similar problems.

In this case, the library that buffers in userspace should set appropriate timers when it first buffers the data. Good choices of timeout parameter are: passed in as argument, slightly below human-scale (e.g. 1-100 ms), proportional to {bandwidth / threshold} (i.e. some multiple of the time it would take to reach the threshold at a certain access rate), proportional to target flushing overhead (e.g. spend no more than 0.1% time in syscalls).

Also note this applies for both writes and reads. If you do batched/coalesced reads then you likely want to do something similar. Though this is usually more dependent on your data channel as you need some way to query or be notified of “pending data” efficiently which your channel may not have if it was not designed for this use case. Again, pretty common in hardware to do interrupt coalescing and the like.

toast0 9 hours ago | parent | next [-]

I think this is the right approach, but any libc setting automatic timers would lead to a lot of tricky problems because it would change expectations.

I/O errors could occur at any point, instead of only when you write. Syscalls everywhere could be interrupted by a timer, instead of only where the program set timers, or when a signal arrives. There's also a reasonable chance of confusion when the application and libc both set timer, depending on how the timer is set (although maybe this isn't relevant anymore... kernel timer apis look better than I remember). If the application specifically pauses signals for critical sections, that impacts the i/o timers, etc.

There's a need to be more careful in accessing i/o structures because of when and how signals get handled.

Veserv 9 hours ago | parent | next [-]

You will generally only stall indefinitely if you are waiting for new data. So, you will actually handle almost every use case if your blocking read/wait also respects the timeout and does the flush on your behalf. Basically, do it synchronously at the top of your event loop and you will handle almost every case.

You could also relax the guarantee and set a timeout that is only checked during your next write. This still allows unbounded latency, but as long as you do one more write it will flush.

If neither of these work, then your program issues a write and then gets into a unbounded or unreasonably long loop/computation. At which point you can manually flush what is likely the last write your program is every going to make which would be a trivial overhead since that is a single write compared to a ridiculously long computation. That or you probably have bigger problems.

toast0 7 hours ago | parent [-]

Yeah, these are all fine to do, but a libc can really only do the middle one. And then, at some cost.

If you're already using an event loop library, I think it's reasonable for that to manage flushing outputs while waiting for reads, but I don't think any of the utilities in this example do; maybe tcpdump does, but I don't know why grep would.

Veserv an hour ago | parent [-]

Sure, but the article is talking about grep, not write() or libc implementations.

grep buffers writes with no flush timeout resulting in the problem in the article.

grep should probably not suffer from the problem and can use a write primitive/library/design that avoids such problems with relatively minimal extra complexity and dependencies while retaining the performance advantages of userspace buffering.

Most programs (that are minimizing dependencies so can not pull in a large framework, like grep or other simple utilities) would benefit from using such modestly more complex primitives instead of bare buffered writes/reads. Such primitives are relatively easy to use and understand, being largely a drop-in replacement in most common use cases, and resolve most remaining problems with buffered accesses.

Essentially, this sort of primitive should be your default and you should only reach for lower level primitives in your application if you have a good reason for it and understand the problems the layers were designed to solve.

nine_k 9 hours ago | parent | prev [-]

I don't follow. Using a pipe sets an expectation of some amount of asynchronicity, because we only control one end of the pipe. I don't see a dramatic difference between an error occurring because of the process on the other end is having trouble, or because of a timeout handler is trying to push the bytes.

On the reading end, the error may occur at the attempt to read the pipe.

On the writing end, the error may be signaled at the next attempt to write to or close the pipe.

In either case, a SIGPIPE can be sent asynchronously.

What scenario am I missing?

toast0 7 hours ago | parent [-]

> In either case, a SIGPIPE can be sent asynchronously.

My expectation (and I think this is an accurate expecation) is that a) read does not cause a SIGPIPE, read on a widowed pipe returns a zero count read as indication of EOF. b) write on a widowed pipe raises SIGPIPE before the write returns. c) write to a pipe that is valid will not raise SIGPIPE if the pipe is widowed without being read from.

Yes, you could get a SIGPIPE from anywhere at anytime, but unless someone is having fun on your system with random kills, you won't actually get one except immediately after a write to a pipe. With a timer based asynchronous write, this changes to potentially happening any time.

This could be fine if it was well documented and expected, but it would be a mess to add it into the libcs at this point. Probably a mess to add it to basic output buffering in most languages.

asveikau 9 hours ago | parent | prev | next [-]

I think doing those timeouts transparently would be tricky under the constraints of POSIX and ISO C. It would need to have some cooperation from the application layer

jart 8 hours ago | parent [-]

The only way you'd be able to do it is by having functions like fputc() call clock_gettime(CLOCK_MONOTONIC_COARSE) which will impose ~3ns overhead on platforms like x86-64 Linux which have a vDSO implementation. So it can be practical sort of although it'd probably be smarter to just use line buffered or unbuffered stdio. In practice even unbuffered i/o isn't that bad. It's the default for stderr. It actually is buffered in practice, since even in unbuffered mode, functions like printf() still buffer internally. You just get assurances whatever it prints will be flushed by the end of the call.

asveikau 8 hours ago | parent [-]

That's just for checking the clock. You'd also need to have a way of getting called back when the timeout expires, after fputc et al are long gone from the stack and your program is busy somewhere else, or maybe blocked.

Timeouts are usually done with signals (a safety nightmare, so no thanks) or an event loop. Hence my thought that you can't do it really transparently while keeping current interfaces.

jart 7 hours ago | parent [-]

Signals aren't a nightmare it's just that fflush() isn't defined by POSIX as being asynchronous signal safe. You could change all your stdio functions to block signals while running, but then you'd be adding like two system calls to every fputc() call. Smart thing to do would probably be creating a thread with a for (;;) { usleep(10000); fflush(stdout); } loop.

asveikau 7 hours ago | parent [-]

Signals are indeed a nightmare. Your example of adding tons of syscalls to make up for lack of safety shows that you understand that to be true.

And no, creating threads to solve this fringe problem in a spin loop with a sleep is not what I'd call "smart". It's unnecessary complexity and in most cases, totally wasted work.

jart 5 hours ago | parent [-]

The smartest thing to do is still probably not buffering. What's wrong with the thread? It would take maybe 15 lines of code to implement. It would be correct without rarely occurring bugs. It doesn't need signals or timers. It wouldn't add overhead to stdio calls. It's a generalized abstraction. You won't need to change your program's event loop code. Create the thread with a tiny 64kb stack and what's not to like? Granted, it would rub me the wrong way if libc did this by default, since I wouldn't want mystery threads appearing in htop for my hello world programs. But for an app developer, this is a sure fire solution.

YZF 2 hours ago | parent | prev | next [-]

I would tend to disagree. The buffering here is doing what's it's supposed to be. The mix of something that's supposed to be interactive with a contract that's not meant to be interactive is the source of the problem (tail following into a pipe). There's no real problem to solve here. The "hardware" analog is a tank of water that accumulates rainwater and you only move it somewhere when it fills up. I'm not sure what examples you have in mind but time based flushes aren't common in hardware AFAIK.

The proposed fix makes the contract a lot more complicated.

dullcrisp 2 hours ago | parent | next [-]

How is there no problem to solve when the post clearly identifies a problem, as well as why the behavior is confusing to a user?

“The system is working as it was designed,” is always true but unhelpful.

YZF 7 minutes ago | parent [-]

I get that. But it's like complaining why your pliers aren't good at unscrewing a bolt. I'm willing to accept the UX isn't great but `tail -f | grep` is like `vim | grep` or `rogue | grep` (I'm exaggerating to try and make a point). `tail -f` is also if I'm not mistaken a much more recent development vs. the origins of the Unix command line pipes.

So sure, it would maybe be a better UX to be able to combine things and have them work, but there is fundamental tension between building something that's optimized for moving chunks of data and building things that's interactive. And trying to force one into the other, in my humble opinion, is not the solution.

Dylan16807 2 hours ago | parent | prev [-]

What are you saying is "not meant to be interactive"? That's not true of pipes in general, or of grep in general.

Or, even if it is true of pipes, then we need an alternate version of a pipe that signals not to buffer, and can be used in all the same places.

It's a real problem either way, it just changes the nature of the problem.

> The proposed fix makes the contract a lot more complicated.

How so? Considering programs already have to deal with terminals, I'm skeptical a way to force one aspect of them would be a big deal.

YZF 5 minutes ago | parent [-]

Sure. An alternative for combining interactive terminal applications might be interesting. But I think there is tension between the Unix mechanisms and interactive applications that's not easy to resolve. What's `less | grep` or `vim | grep`... do we need to send input back through the pipe now?

It's one of those things you get used to when you've used Unix-like systems long enough. Yes, it's better things just work as someone who is not a power user expects them to work but that's not always possible and I'd say it's not worth it to try to always meet that goal, especially if it leads to more complexity.

vlovich123 9 hours ago | parent | prev | next [-]

Typical Linux alarms are based on signals and are very difficult to manage and rescheduling them may have a performance impact since it requires thunking into the kernel. If you use io_uring with userspace timers things can scale much better, but it still requires you to do tricks if you want to support a lot of fast small writes (eg > ~1 million writes per second timer management starts to show up more and more and you have to do some crazy tricks I figured out to get up to 100M writes per second)

Veserv 9 hours ago | parent [-]

You do not schedule a timeout on each buffered write. You only schedule one timeout on the transition from empty to non-empty that is retired either when the timeout occurs or when you threshold flush (you may choose to not clear on threshold flush if timeout management is expensive). So, you program at most one timeout per timeout duration/threshold flush.

The point is to guarantee data gets flushed promptly which only fails when not enough data gets buffered. The timeout is a fallback to bound the flush latency.

vlovich123 9 hours ago | parent [-]

Yes that can work but as I said that has trade offs.

If you flush before the buffer is full, you’re sacrificing throughput. Additionally the timer firing has additional performance degradation especially if you’re in libc land and only have a sigalarm available.

So when an additional write is added, you want to push out the timer. But arming the timer requires reading the current time among other things and at rates of 10-20Mhz and up reading the current wall clock gets expensive. Even rdtsc approaches start to struggle at 20-40Mhz. You obviously don’t want to do it on every write but you want to make sure that you never actually trigger the timer if you’re producing data at a relatively fast enough clip to otherwise fill the buffer within a reasonable time.

Source: I implemented write coalescing in my nosql database that can operate at a few gigahertz for 8 byte writes/s into an in memory buffer. Once the buffer is full or a timeout occurs, a flush to disk is triggered and I net out at around 100M writes/s (sorting the data for the LSM is one of the main bottlenecks). By comparison DBs like RocksDB can do ~2M writes/s and SQLite can do ~800k.

Veserv 8 hours ago | parent [-]

You are not meaningfully sacrificing throughput because the timeout only occurs when you are not writing enough data; you have no throughput to sacrifice. The threshold and timeout should be chosen such that high throughput cases hit the threshold, not the timeout. The timeout exists to bound the worst-case latency of low access throughput.

You only lose throughput in proportion to the handling cost of a single potentially spurious timeout/timeout clear per timeout duration. You should then tune your buffering and threshold to cap that at a acceptable overhead.

You should only really have a problem if you want both high throughput and low latency at which point general solutions are probably not not fit for your use case, but you should remain aware of the general principle.

vlovich123 8 hours ago | parent [-]

> You should only really have a problem if you want both high throughput and low latency at which point general solutions are probably not not fit for your use case, but you should remain aware of the general principle.

Yes you’ve accurately summarized the end goal. Generally people want high throughput AND low latency, not to just cap the maximum latency.

The one shot timer approach only solves a livelock risk. I’ll also note that your throughput does actually drop at the same time as the latency spike because your buffer stays the same size but you took longer to flush to disk.

Tuning correctly turns out to be really difficult to accomplish in practice which is why you really want self healing/self adapting systems that behave consistently across all hardware and environments.

worksonmymach 2 hours ago | parent | prev [-]

I prefer a predictable footgun. Your idea is good but it would need to be another flag, so you have to know it exists. Not knowing the semantics is the issue, rather than the semantics themselves.