Jepsen: NATS 2.12.1

▲ Jepsen: NATS 2.12.1(jepsen.io)

187 points by aphyr 4 hours ago | 60 comments

▲ stmw 2 hours ago | parent | next [-]

Every time someone builds one of these things and skips over "overcomplicated theory", aphyr destroys them. At this point, I wonder if we could train an AI to look over a project's documentation, and predict whether it's likely to lose commmitted writes just based on the marketing / technical claims. We probably can.

	▲	awesome_dude 15 minutes ago \| parent [-]
		/me strokes my long grey beard and nods People always think "theory is overrated" or "hacking is better than having a school education" And then proceed to shoot themselves in the foot with "workarounds" that break well known, well documented, well traversed problem spaces

▲ johncolanduoni an hour ago | parent | prev | next [-]

Wow. I’ve used NATS for best-effort in-memory pub/sub, which it has been great for, including getting subtle scaling details right. I never touched their persistence and would have investigated more before I did, but I wouldn’t have expected it to be this bad. Vulnerability to simple single-bit file corruption is embarrassing.

▲ rishabhaiover 10 minutes ago | parent | prev | next [-]

NATS be trippin, no CAP.

	▲	veverkap a few seconds ago \| parent [-]
		Underrated

▲ vrnvu 3 hours ago | parent | prev | next [-]

Sort of related. Jepsen and Antithesis recently released a glossary of common terms which is a fantastic reference.

https://jepsen.io/blog/2025-10-20-distsys-glossary

▲ merb 3 hours ago | parent | prev | next [-]

> 3.4 Lazy fsync by Default

Why? Why do some databases do that? To have better performance in benchmarks? It’s not like that it’s ok to do that if you have a better default or at least write a lot about it. But especially when you run stuff in a small cluster you get bitten by stuff like that.

▲

aaronbwebber 2 hours ago | parent | next [-]

It's not just better performance on latency benchmarks, it likely improves throughput as well because the writes will be batched together.

Many applications do not require true durability and it is likely that many applications benefit from lazy fsync. Whether it should be the default is a lot more questionable though.

	▲	johncolanduoni an hour ago \| parent [-]
		It’s like using a non-cryptographically secure RNG: if you don’t know enough to look for the fsync flag off yourself, it’s unlikely you know enough to evaluate the impact of durability on your application.

▲

mrkeen an hour ago | parent | prev | next [-]

One of the perks of being distributed, I guess.

The kind of failure that a system can tolerate with strict fsync but can't tolerate with lazy fsync (i.e. the software 'confirms' a write to its caller but then crashes) is probably not the kind of failure you'd expect to encounter on a majority of your nodes all at the same time.

	▲	johncolanduoni 9 minutes ago \| parent [-]
		It is if they’re in the same physical datacenter. Usually the way this is done is to wait for at least M replicas to fsync, but only require the data to be in memory for the rest. It smooths out the tail latencies, which are quite high for SSDs.

▲

millipede 2 hours ago | parent | prev | next [-]

I always wondered why the fsync has to be lazy. It seems like the fsync's can be bundled up together, and the notification messages held for a few millis while the write completes. Similar to TCP corking. There doesn't need to be one fsync per consensus.

▲

aphyr an hour ago | parent [-]

Yes, good call! You can batch up multiple operations into a single call to fsync. You can also tune the number of milliseconds or bytes you're willing to buffer before calling `fsync` to balance latency and throughput. This is how databases like Postgres work by default--see the `commit_delay` option here: https://www.postgresql.org/docs/8.1/runtime-config-wal.html

▲

to11mtm 29 minutes ago | parent [-]

> This is how databases like Postgres work by default--see the `commit_delay` option here: https://www.postgresql.org/docs/8.1/runtime-config-wal.html

I must note that the default for Postgres is that there is NO delay, which is a sane default.

> You can batch up multiple operations into a single call to fsync.

Ive done this in various messaging implementations for throughput, and it's actually fairly easy to do in most languages;

Basically, set up 1-N writers (depends on how you are storing data really) that takes a set of items containing the data to be written alongside a TaskCompletionSource (Promise in Java terms), when your stuff wants to write it shoots it to that local queue, the worker(s) on the queue will write out messages in batches based on whatever else (i.e. tuned for write size, number of records, etc for both throughput and guaranteeing forward progress,) and then when the write completes you either complete or fail the TCS/Promise.

If you've got the right 'glue' with your language/libraries it's not that hard; this example [0] from Akka.NET's SQL persistence layer shows how simple the actual write processor's logic can be... Yeah you have to think about queueing a little bit however I've found this basic pattern very adaptable (i.e. queueing op can just send a bunch of ready-to-go-bytes and you work off that for threshold instead, add framing if needed, etc.)

[0] https://github.com/akkadotnet/Akka.Persistence.Sql/blob/7bab...

▲

aphyr 24 minutes ago | parent [-]

Ah, pardon me, spoke too quickly! I remembered that it fsynced by default, and offered batching, and forgot that the batch size is 0 by default. My bad!

	▲	to11mtm a minute ago \| parent [-]
		Well the write is still tunable so you are still correct. Just wanted to clarify that the default is still at least safe in case people perusing this for things to worry about, well, were thinking about worrying. Love all of your work and writings, thank you for all you do!

▲

thinkharderdev 3 hours ago | parent | prev | next [-]

> To have better performance in benchmarks

Yes, exactly.

▲

dilyevsky 2 hours ago | parent | prev [-]

Massively improves benchmark performance. Like 5-10x

▲

speedgoose 2 hours ago | parent [-]

/dev/null is even faster.

▲

formerly_proven an hour ago | parent [-]

/dev/null tends to lose a lot more data.

	▲	onionisafruit 33 minutes ago \| parent [-]
		Just wait until the jepsen report on /dev/null. It's going to be brutal.

▲ rdtsc 2 hours ago | parent | prev | next [-]

> By default, NATS only flushes data to disk every two minutes, but acknowledges operations immediately. This approach can lead to the loss of committed writes when several nodes experience a power failure, kernel crash, or hardware fault concurrently—or in rapid succession (#7564).

I am getting strong early MongoDB vibes. "Look how fast it is, it's web-scale!". Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

Coordinated failures shouldn't be a novelty or a surprise any longer these days.

I wouldn't trust a product that doesn't default to safest options. It's fine to provide relaxed modes of consistency and durability but just don't make them default. Let the user configure those themselves.

▲ KaiserPro 2 hours ago | parent | next [-]

NATS is very upfront in that the only thing that is guaranteed is the cluster being up.

I like that, and it allows me to build things around it.

For us when we used it back in 2018, it performed well and was easy to administer. The multi-language APIs were also good.

▲

traceroute66 an hour ago | parent [-]

> NATS is very upfront in that the only thing that is guaranteed is the cluster being up.

Not so fast.

Their docs makes some pretty bold claims about JetStream....

They talk about JetStream addressing the "fragility" of other streaming technology.

And "This functionality enables a different quality of service for your NATS messages, and enables fault-tolerant and high-availability configurations."

And one of their big selling-points for JetStream is the whole "stora and replay" thing. Which implies the storage bit should be trustworthy, no ?

▲

KaiserPro an hour ago | parent [-]

oh sorry I was talking about NATS core. not jetstream. I'd be pretty sceptical about persistence

▲

billywhizz an hour ago | parent [-]

the OP was specifically about jetstream so i guess you just didn't read it?

	▲	KaiserPro 14 minutes ago \| parent [-]
		just imagine I'm claude, smoke bomb

▲ lubesGordi an hour ago | parent | prev | next [-]

I don't know about Jetstream, but redis cluster would only ack writes after replicating to a majority of nodes. I think there is some config on standalone redis too where you can ack after fsync (which apparently still doesn't guarantee anything because of buffering in the OS). In any case, understanding what the ack implies is important, and I'd be frustrated if jetstream docs were not clear on that.

▲ Thaxll an hour ago | parent | prev | next [-]

I don't think there is a modern database that have the safest options all turned on by default. For instance the default transaction model for PG is read commited not serializable

One of the most used DB in the world is Redis, and by default they fsync every seconds not every operations.

	▲	hobs an hour ago \| parent [-]
		Pretty sure SQL Server won't acknowledge a write until its in the WAL (you can go the opposite way and turn on delayed durability though.)

▲ 0xbadcafebee 2 hours ago | parent | prev | next [-]

Not flushing on every write is a very common tradeoff of speed over durability. Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted. You can often prevent this by enabling an option or tuning a parameter.

> I wouldn't trust a product that doesn't default to safest options

This would make most products suck, and require a crap-ton of manual fixes and tuning that most people would hate, if they even got the tuning right. You have to actually do some work yourself to make a system behave the way you require.

For example, Postgres' isolation level is weak by default, leading to race conditions. You have to explicitly enable serialization to avoid it, which is a performance penalty. (https://martin.kleppmann.com/2014/11/25/hermitage-testing-th...)

▲

TheTaytay 33 minutes ago | parent | next [-]

> Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted.

Woah, those are _really_ strong claims. "Lost writes are accepted"? Assuming we are talking about "acknowledged writes", which the article is discussing, I don't think it's true that this is a common default for databases and filesystems. Perhaps databases or K/V stores that are marketed as in-memory caches might have defaults like this, but I'm not familiar with other systems that do.

I'm also getting MongoDB vibes from deciding not to flush except once every two minutes. Even deciding to wait a second would be pretty long, but two minutes? A lot happens in a busy system in 120 seconds...

▲

zbentley an hour ago | parent | prev | next [-]

I think “most people will have to turn on the setting to make things fast at the expense of durability” is a dubious assertion (plenty of system, even high-criticality ones, do not have a very high data rate and thus would not necessarily suffer unduly from e.g. fsync-every-write).

Even if most users do turn out to want “fast_and_dangerous = true”, that’s not a particularly onerous burden to place on users: flip one setting, and hopefully learn from the setting name or the documentation consulted when learning about it that it poses operational risk.

▲

to11mtm an hour ago | parent | prev [-]

In the defense of PG, for better or worse as far as I know, the 'what is RDBMS default' falls into two categories;

- Read Committed default with MVCC (Oracle, Postgres, Firebird versions with MVCC, I -think- SQLite with WAL falls under this)

- Read committed with write locks one way or another (MSSQL default, SQLite default, Firebird pre MVCC, probably Sybase given MSSQL's lineage...)

I'm not aware of any RDBMS that treats 'serializable' as the default transaction level OOTB (I'd love to learn though!)

....

All of that said, 'Inconsistent read because you don't know RDBMS and did not pay attention to the transaction model' has a very different blame direction than 'We YOLO fsync on a timer to improve throughput'.

If anything it scares me that there's no other tuning options involved such as number of bytes or number of events.

If I get a write-ack from a middleware I expect it to be written one way or another. Not 'It is written within X seconds'.

AFAIK there's no RDBMS that will just 'lose a write' unless the disk happens to be corrupted (or, IDK, maybe someone YOLOing with chaos mode on DB2?)

	▲	hansihe 30 minutes ago \| parent [-]
		CockroachDB does Serializable by default

▲ gopalv an hour ago | parent | prev | next [-]

> Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.

The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.

Two minutes is a bit too too much (also fdatasync vs fsync).

▲ CuriouslyC 2 hours ago | parent | prev [-]

NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here. If you wanted something fully durable with a stronger persistence story you'd probably use Kafka anyhow.

▲ nchmy 2 hours ago | parent | next [-]

Core nats is ephemeral. Jetstream is meant to be persisted, and presented as a replacement for kafka

▲ traceroute66 an hour ago | parent | prev | next [-]

> NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here

Dude ... the guy was testing JetStream.

Which, I quote from the first phrase from the first paragraph on the NATS website:

    NATS has a built-in persistence engine called JetStream which enables messages to be stored and replayed at a later time.

▲ petre 2 hours ago | parent | prev [-]

So is MQTT, why bother with NATS then?

▲

KaiserPro an hour ago | parent [-]

MQTT doesn't have the same semantics. https://docs.nats.io/nats-concepts/core-nats/reqreply request reply is really useful if you need low latency, but reasonably efficient queuing. (making sure to mark your workers as busy when processing otherwise you get latency spikes. )

▲

RedShift1 an hour ago | parent [-]

You can do request/reply with MQTT too, you just have to implement more bits yourself, whilst NATS has a nice API that abstracts that away for you.

	▲	KaiserPro an hour ago \| parent [-]
		oh indeed, and clusters nicely.

▲ maxmcd 2 hours ago | parent | prev | next [-]

> > You can force an fsync after each messsage [sic] with always, this will slow down the throughput to a few hundred msg/s.

Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?

	▲	scottlamb 2 hours ago \| parent [-]
		> Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent? Yes, and you shouldn't even need a fixed interval. Just queue up any writes while an `fsync` is pending; then do all those in the next batch. This is the same approach you'd use for rounds of Paxos, particularly between availability zones or regions where latency is expected to be high. You wouldn't say "oh, I'll ack and then put it in the next round of Paxos", or "I'll wait until the next round in 2 seconds then ack"; you'd start the next batch as soon as the current one is done.

▲ clemlesne 3 hours ago | parent | prev | next [-]

NATS is a fantastic piece of software. But doc’s unpractical and half backed. That’s a shame to be required to retro engineer the software from GitHub to know the auth schemes.

▲

belter 3 hours ago | parent [-]

> NATS is a fantastic piece of software.

- ACKed messages can be silently lost due to minority-node corruption.

- Single-bit corruption can erase up to 78% of stored messages on some replicas.

- Snapshot corruption may trigger full-stream deletion across the cluster.

- Default lazy-fsync wipes minutes of acknowledged writes on crash.

- Crash + delay can produce persistent split-brain and divergent logs

Are you the Mother? Because only a Mother could love such a ugly baby....

▲

mring33621 2 hours ago | parent | next [-]

NATS was originally made for simple, fast, ephemeral messaging.

The persistence stuff is kinda new and it's not a surprise that there are limitations and bugs.

You should see this report as a good thing, as it will add pressure for improvements.

	▲	njuw 20 minutes ago \| parent [-]
		> The persistence stuff is kinda new and it's not a surprise that there are limitations and bugs. It's not really that new. The precursor to JetStream was NATS Streaming Server [1], which was first tagged almost 10 years ago [2]. [1] https://github.com/nats-io/nats-streaming-server [2] https://github.com/nats-io/nats-streaming-server/releases/ta...

▲

Thaxll 2 hours ago | parent | prev | next [-]

"PostgreSQL used fsync incorrectly for 20 years"

https://archive.fosdem.org/2019/schedule/event/postgresql_fs...

It did not prevent people from using it. You won't find a database that has the perfect durability, ease of use, performance ect.. It's all about tradeoffs.

▲

dijit 2 hours ago | parent [-]

Realistically speaking, postgresql wasn’t handling a failed call to fsync, which is wrong: but materially different from a bad design or errors in logic stemming from many areas.

Postgresql was able to fix their bug in 3 lines of code, how many for the parent system?

I understand your core thesis (sometimes durability guarantees aren’t as needed as we think) but in postgresql’s case, the edge was incredibly thin. It would have had to have been: a failed call to fsync and a system level failure of the host before another call to fsync (which are reasonably common).

It’s far too apples to oranges to be meaningful to bring up I am afraid.

	▲	Thaxll 2 hours ago \| parent [-]
		NATS allows you to fsync every calls, it's not just the default value.

▲

hurturue 2 hours ago | parent | prev | next [-]

do you have a better solution?

as they would say, NATS is a terrible message bus system, but all the others are worse

	▲	adhamsalama 2 hours ago \| parent [-]
		Are RabbitMQ's durable queues worse?

▲

cedws 2 hours ago | parent | prev | next [-]

Interested to know if you found these issues yourself or from a source. Is Kafka any more robust?

	▲	rockwotj 2 hours ago \| parent [-]
		Redpanda is https://jepsen.io/analyses/redpanda-21.10.1

▲

tptacek an hour ago | parent | prev | next [-]

This is just a tl;dr of the article with a mean-spirited barb added.

▲

KaiserPro an hour ago | parent | prev [-]

NATS is ephemeral. if you can accept that, then you'll be fine.

▲ dzonga an hour ago | parent | prev | next [-]

nats jetstream vs say redis streams - which one have people found easier to work with ?

▲ gostsamo 3 hours ago | parent | prev [-]

Thanks, those reports are always a quiet pleasure to read even if one is a bit far from the domain.