Remix.run Logo
cj 3 hours ago

When you’re in the middle of a production down event and your whole team is diagnosing the issue, and your log server is unresponsive, who do you contact for support?

No one, you pull an engineer off the production issue to debug the log server, because you need the log server to debug the production servers.

See the problem?

Edit: to be clear I’m no fan of Datadog and I wish self hosting were an option. I want this path for our company, but at least on our team we just don’t have enough (redundant) expertise to deploy and manage these systems. We’d have to hire an extra FTE.

phil21 3 hours ago | parent | next [-]

If you’re having a correlated outage like that, then it’s likely you fix the prod issue before the cloud engineers at some giant cloud company even respond to an internal escalation much less fixes an issue. More than likely your prod issue is causing the logging problem.

If you mean you are experiencing two totally unrelated issues at the same time, then I don’t think that’s a reasonable thing to really assign much value to as it’s incredibly unlikely.

Half of $30k/mo trivially pays for an engineer you hire to only manage such a cluster for you and just works an hour a week unless a pager goes off if you truly need that level of peace of mind. If you’re hiring for such a position I have a few rock star level folks who would love such a job.

The hypothetical problems people imagine for on-prem infrastructure get really strange to me. I could come up with the same sort of scenarios for cloud based SaaS infrastructure just as easily.

cj 2 hours ago | parent | next [-]

> I don’t think that’s a reasonable thing to really assign much value to as it’s incredibly unlikely.

In my experience the systems/tools needed to debug production issues are often only used when they’re needed.

Which now means you need health and uptime monitoring on your log server since without that, it might break randomly and no one notices until you need it.

> The hypothetical problems people imagine for on-prem infrastructure get really strange to me

It really comes down to the people and whether you have the expertise on the team. And whether the team can realistically manage the system long term. It’s typically safer to spend more money for the managed service.

(It’s a safer decision, not necessarily better)

_heimdall 2 hours ago | parent | prev [-]

100% agree. If I am using a cloud log provider I wouldn't expect them to solve my logging issue(s) as fast as I need, more importantly I have no real way to put more resources on that fix.

More importantly, with a third party service I'd be very surprised if both went down at the same time and it wasn't a further upstream issue like AWS. If its my own logging service and it went down during a prod outage, I likely didn't properly isolate my logging service in the first place.

sgustard 3 hours ago | parent | prev | next [-]

The old argument for being locked in to legacy software costing 6-8 figures a year was that you had no choice. Now you have a choice! Clearly that is better, and everyone should evaluate that choice on its merits, and the stock market sees that people are voting with their dollars. If your whole sales pitch is "good luck when it breaks!" you might want to reevaluate your business model.

camdenreslink an hour ago | parent [-]

The stock market is trying to predict that people will vote with their dollars in the future. I’m not quite sure people are really replacing enterprise Saas at large corporations yet. It’s more of a projection.

mschild 3 hours ago | parent | prev | next [-]

Fair, however at some point of a companies size/spending the complexity of integrating with a SaaS becomes as large as the one to run your own open source tool.

Beyond that, and Im aware this is very much application/company dependent, theres plenty of SaaS companies that offer horrendous or no support no matter what you pay. We used to use splunk for monitoring and logging. Paid a ton of money because we were handling financial data and needed tracibility and reliability. We constantly had to put out fires that were caused by their unreliable platform. It was not a good experience.

Ultimately, we jumped ship to Prometheus. We pay a fraction of the price and spent less time on it.

dyauspitr 3 hours ago | parent | prev | next [-]

You don’t, you just look at the log like us old timers and solve the problem. It’s literally no different than solving the problem on the cloud.

eagsalazar2 3 hours ago | parent | prev | next [-]

Boogeyman

iamleppert 3 hours ago | parent | prev [-]

Have you ever tried to contact their support?

The problem is all these SaaS companies have cut costs so much that all their support has been reduced to useless offshore at best and at worst a chatbot. They do go down and don't work and often times there's simply nothing you can do. The worst offenders will seize upon the moment and force you to upgrade a support plan before they will even talk to you, even if the issue is their own making.

Unless you're a huge customer and already paying them tons of money, expect to receive no support. Your only line of defense if something happens and you're not a whale is that some whale is upset and they actually have their people working on the problem. If you're a small company, startup, or even mid-size, good luck on getting them to care. You'll probably be sent a survey when you don't renew and may eventually be a quotient in their risk calculus at some point in the distant future, but only if you represent a meaningful mass of customers they lost.