I work in infosec and several popular platforms use elasticsearch for log storage and analysis.

I would never. Ever. Bet my savings on ES being stable enough to always be online to take in data, or predictable in retaining the data it took in.

It feels very best-effort and as a consultant, I recommend orgs use some other system for retaining their logs, even a raw filesystem with rolling zips, before relying on ES unless you have a dedicated team constantly monitoring it.

▲

kentm 11 hours ago | parent | next [-]

Do you happen to know if ES was the only storage? Its been almost 8 years, but if I was building a log storage and analysis system, then I'd push the logs to S3 or some other object store and build an ES index off of that S3 data. From the consumer's perspective, it may look like we're using ES to store the data, but we have a durable backup to regenerate ES if necessary.

▲

toenail 12 hours ago | parent | prev | next [-]

Dunno, I've had three node clusters running very stable for years. Which issues did you have that require a full team?

▲

PedroBatista 11 hours ago | parent | next [-]

Even most toy databases "built in a weekend" can be very stable for years if:

- No edge-case is thrown at them

- No part of the system is stressed ( software modules, OS,firmware, hardware )

- No plug is pulled

Crank the requests to 11 or import a billion rows of data with another billion relations and watch what happens. The main problem isn't the system refusing to serve a request or throwing "No soup for you!" errors, it's data corruption and/or wrong responses.

▲

toenail 11 hours ago | parent [-]

I'm talking about production loads, but thanks.

	▲	pixl97 10 hours ago \| parent [-]
		Production loads mean a lot of different things to a lot of different people.

▲

unethical_ban 11 hours ago | parent | prev [-]

To be fair, I think it is chronically underprovisioned clusters that get overwhelmed by log forwarding. I wasn't on the team that managed the ELK stack a decade ago, but I remember our SOC having two people whose full time job was curating the infrastructure to keep it afloat.

Now I work for a company whose log storage product has ES inside, and it seems to shit the bed more often than it should - again, could be bugs, could be running "clusters" of 1 or 2 instead of 3.

	▲	xeraa 10 hours ago \| parent \| next [-]
		There are no 2-node clusters (it needs a quorum). If your setup has 2-node clusters, someone is doing this horribly wrong.
	▲	toenail 11 hours ago \| parent \| prev [-]
		I'm not even sure "get overwhelmed" is a problem, unless you need real time analytics. But yeah, sounds like a resources issue.

▲

1_1xdev1 5 hours ago | parent | prev | next [-]

You have to slap something durable and a queue in front of it.

Elastic’s own consultants will tell you this …

▲

cyberpunk 11 hours ago | parent | prev [-]

Meh i run hundreds of es nodes, its gotten a lot more friendly these days, but yes it can be a bit unforgiving at times.

Turns out running complicated large distributed systems requires a bit more than a ./apply, who would have guessed it?