Remix.run Logo
jcgrillo 5 hours ago

> This can be a ton of data though, so we're trying to figure out what to compress and how. We also have the challenge of figuring out how to scrub logs of any potentially sensitive information.

This is fundamentally a data modeling problem. Currently computer telemetry data are just little bags of utf-8 bytes, or at best something like list<map<bytes, bytes>>. IMO this needs to change from the ground up. Logging libraries should emit structured data, conforming to a user supplied schema. Not some open-ended schema that tries to be everything to everyone. Then it's easy to solve both problems--each field is a typed column which can be compressed optimally, and marking a field as "safe" is something encoded in its type. So upon export, only the safe fields make it off the box, or out of the VPC, or whatever--note you can have a richer ACL structure than just "safe yes/no".

I applaud the industry for trying so hard for so long to make everything backwards compatible with the unstructured bytes base case, but I'm not sure that's ever really been the right north star.

quesera 4 hours ago | parent [-]

Grand solutions require broad coordination, and they often devolve back into a modified-but-equivalent version of the previous problem. :(

Stream-of-bytes is classically difficult model to escape. Many have tried.

jcgrillo 2 hours ago | parent [-]

Yeah. There are good reasons things are bad. But there's also a foolish consistency. Like, you can just do things! If you decide monitoring is important you can decide not to outsource it. Most everyone doesn't, though. Probably because they don't think it's very important, and the existing tools get it done well enough, and it's the muscle memory of the subjectively familiar (if objectively fantastically overpriced).

quesera 2 hours ago | parent [-]

Well, in the early days of infrastructure growth, when designing bespoke monitoring systems and protocols would be relatively low-cost, it's still nowhere near the highest-ROI way to spend your tech team's time and energy.

And to do it right (i.e. low-risk of of having it blow up with negative effects on the larger business goals), you need someone fairly experienced or maybe even specialized in that area. If you have that person, they are on the team because of their other skills, which you need more urgently.

SaaS, COTS, and open source monitoring tools have to cater to the existing customers. The sales pitch is "easy to integrate". So even they are not incentivized to build something new.

It boils down to the fact that stream-of-bytes is extremely well-understood, and almost always good enough. Infinitely flexible, low-ceremony, no patents, and comes preinstalled on everything (emitters and consumers). It's like HTTP in that way.

And the evolution is similar too. It'll always be stream-of-bytes, but you can emit in JSON or protobuf etc, if it's worth the cognitive overhead to do so. All the hyperscalers do this, even when the original emitter (web servers, etc) is just blindly spewing atrocious CLF/quirky-SSV text.