| ▲ | dietr1ch 2 days ago | ||||||||||||||||
Analytics is done in "batches" (daily, weekly) anyways, right? We know you can't get both, row and column orders at the same time, and that continuously maintaining both means duplication and ensuring you get the worst case from both worlds. Local, row-wise writing is the way to go for write performance. Column-oriented reads are the way to do analytics at scale. It seems alright to have a sync process that does the order re-arrangement (maybe with extra precomputed statistics, and sharding to allow many workers if necessary) to let queries of now historical data run fast. | |||||||||||||||||
| ▲ | delaminator 2 days ago | parent | next [-] | ||||||||||||||||
It's not just about row versus column. OLAPs are potentially denormalised as well, and sometimes pre-aggregation, such as rolling up by day, by customer. If you really need to get performance you'll be building a star schema. | |||||||||||||||||
| ▲ | sanderjd 2 days ago | parent | prev [-] | ||||||||||||||||
Not all olap-like queries are for daily reporting. I agree that the basic architecture should be row order -> delay -> column order, but the question (in my mind) is balancing the length of that delay with the usefulness of column order queries for a given workload. I seem to keep running into workloads that do inserts very quickly and then batch reads on a slower cadence (either in lockstep with the writes, or concurrently) but not on the extremely slow cadence seen in the typical olap reporting type flow. Essentially, building up state and then querying the results. I'm not so sure about "continuously maintaining both means duplication and ensuring you get the worst case from both worlds". Maybe you're right, I'm just not so sure. I agree that it's duplicating storage requirements, but is that such a big deal? And I think if fast writes and lookups and fast batch reads are both possible at the cost of storage duplication, that would actually be the best case from both worlds? I mean, this isn't that different conceptually from the architecture of log-structured merge trees, which have this same kind of "duplication" but for good purpose. (Indeed, rocksdb has been the closest thing to what I want for this workload that I've found; I just think it would be neat if I could use sqlite+duckdb instead, accepting some tradeoffs.) | |||||||||||||||||
| |||||||||||||||||