Remix.run Logo
hahahahhaah 11 hours ago

Whats a good place to see parallel arrays defined. I have no data lake expetience. Know how relational db works.

nurettin 11 hours ago | parent [-]

I mean,

    Table1 = {"col1": [1,2,3]}
    Table2 = {"epiphany": [1,1,1]}
    for i, r in enumerate(Table1["col1"]):
      print(r, Table2["epiphany"][i])

He's really happy he found this (Edit: actually it seems like Chang She talked about this while discussing the Lance data format[1]@12:00 in 2024 at a conference calling it "the fourth way") and will represent this in a conference.

[1] https://youtu.be/9O2pfXkCDmU?si=IheQl6rAiB852elv

willvarfar 10 hours ago | parent | next [-]

Seriously, this is not what big data does today. Distributed query engines don't have the primitives to zip through two tables and treat them as column groups of the same wider logical table. There's a new kid on the block called LanceDB that has some of the same features but is aiming for different use-cases. My trick retrofits vertical partitioning into mainstream data lake stuff. It's generic and works on the tech stack my company uses but would also work on all the mainstream alternative stacks. Slightly slower on AWS. But anyway. I guess HN just wants to see an industrial track paper.

hahahahhaah 9 hours ago | parent | prev [-]

That code is for in memory data right? I see no storage access.

What is really happening? Are these streaming off 2 servers and zipped into 1. Is this just columnar storage or something else?