Remix.run Logo
nurettin 14 hours ago

You mean you discovered parallel arrays?

willvarfar 14 hours ago | parent | next [-]

specifically I've discovered how to 'trick' mainstream cloud storage and mainstream query engines using mainstream table formats how to read parallel arrays that are stored outside the table without using a classic join and treat them as new columns or schema evolution. It'll work on spark, bigquery etc.

hahahahhaah 12 hours ago | parent | prev [-]

Whats a good place to see parallel arrays defined. I have no data lake expetience. Know how relational db works.

nurettin 12 hours ago | parent [-]

I mean,

    Table1 = {"col1": [1,2,3]}
    Table2 = {"epiphany": [1,1,1]}
    for i, r in enumerate(Table1["col1"]):
      print(r, Table2["epiphany"][i])

He's really happy he found this (Edit: actually it seems like Chang She talked about this while discussing the Lance data format[1]@12:00 in 2024 at a conference calling it "the fourth way") and will represent this in a conference.

[1] https://youtu.be/9O2pfXkCDmU?si=IheQl6rAiB852elv

willvarfar 12 hours ago | parent | next [-]

Seriously, this is not what big data does today. Distributed query engines don't have the primitives to zip through two tables and treat them as column groups of the same wider logical table. There's a new kid on the block called LanceDB that has some of the same features but is aiming for different use-cases. My trick retrofits vertical partitioning into mainstream data lake stuff. It's generic and works on the tech stack my company uses but would also work on all the mainstream alternative stacks. Slightly slower on AWS. But anyway. I guess HN just wants to see an industrial track paper.

hahahahhaah 10 hours ago | parent | prev [-]

That code is for in memory data right? I see no storage access.

What is really happening? Are these streaming off 2 servers and zipped into 1. Is this just columnar storage or something else?