Remix.run Logo
willvarfar 12 hours ago

I had a great euphoric epiphany feeling today. Doesn't come along too often, will celebrate with a nice glass of wine :)

Am doing data engineering for some big data (yeah, big enough) and thinking about efficiency of data enrichment. There's this classic trilemma with data enrichment where you can have good write efficiency, good read efficiency and/or good storage cost, pick two.

E.g. you have a 1TB table and you want to add a column that, say, will take 1GB to store.

You can create a new table that is 1.1TB and then delete the old table, but this is both write-inefficient and often breaks how normal data lake orchestration works.

You can create a new wide table that is 1.1TB and keep it along side the old table, but this is both write-inefficient and expensive to store.

You can create a narrow companion table that has just a join key and 1GB of data. This is efficient to write and store, but inefficient to query when you force all users to do joins on read.

And I've come up with a cunning forth way where you write a narrow table and read a wide table so its literally best of all worlds! Kinda staggering :) Still on a high.

Might actually be a conference paper, which is new territory for me. Lets see :)

/off dancing

Fazebooking 10 hours ago | parent | next [-]

Sounds off to me tbh.

Were your table is stored shouldn't matter that much if you have proper indezes which you need and if you change anything, your db is rebuilding the indezes anyway

nurettin 12 hours ago | parent | prev | next [-]

You mean you discovered parallel arrays?

willvarfar 12 hours ago | parent | next [-]

specifically I've discovered how to 'trick' mainstream cloud storage and mainstream query engines using mainstream table formats how to read parallel arrays that are stored outside the table without using a classic join and treat them as new columns or schema evolution. It'll work on spark, bigquery etc.

hahahahhaah 11 hours ago | parent | prev [-]

Whats a good place to see parallel arrays defined. I have no data lake expetience. Know how relational db works.

nurettin 11 hours ago | parent [-]

I mean,

    Table1 = {"col1": [1,2,3]}
    Table2 = {"epiphany": [1,1,1]}
    for i, r in enumerate(Table1["col1"]):
      print(r, Table2["epiphany"][i])

He's really happy he found this (Edit: actually it seems like Chang She talked about this while discussing the Lance data format[1]@12:00 in 2024 at a conference calling it "the fourth way") and will represent this in a conference.

[1] https://youtu.be/9O2pfXkCDmU?si=IheQl6rAiB852elv

willvarfar 10 hours ago | parent | next [-]

Seriously, this is not what big data does today. Distributed query engines don't have the primitives to zip through two tables and treat them as column groups of the same wider logical table. There's a new kid on the block called LanceDB that has some of the same features but is aiming for different use-cases. My trick retrofits vertical partitioning into mainstream data lake stuff. It's generic and works on the tech stack my company uses but would also work on all the mainstream alternative stacks. Slightly slower on AWS. But anyway. I guess HN just wants to see an industrial track paper.

hahahahhaah 9 hours ago | parent | prev [-]

That code is for in memory data right? I see no storage access.

What is really happening? Are these streaming off 2 servers and zipped into 1. Is this just columnar storage or something else?

anonu 11 hours ago | parent | prev [-]

look into vector databases. for most representations, a column is just another file on disk