How about you enlighten us rather than just taunt us with your superior knowledge?

bilbo-b-baggins 6 days ago | parent | next [-]

If the “Task” is outputting the JSON for terms to a file, it can be streamed one row at a time - with memory reused after each row is read and the output written. That could be done with a few KB of program space assuming you’re parsing the CSV and outputting the JSON manually instead of using a larger library.

The problem isn’t well constrained because it seems to imply that for some reason it needs to be all accessible in memory, doesn’t specify the cardinality of terms, doesn’t specify whether Get(i) is used in a way that requires that particular interface for accessing a row by number.

If I were to do it, I’d just parse a Page at a time and update a metadata index saying Page P contains entries starting at N. The output file could be memmapped and only the metadata loaded, allowing directly index into the correct Page which could be quickly scanned for a record, and would maybe use 1-2MB of RAM for metadata and whatever Pages are actually being touched.

But like I said the problem is not well constrained enough for even a solution like that to be optimal, since it would suffer from full dataset sequential or random access, as opposed to hot Pages and a long tail.

/shrug specs matter if you’re in the optimization phase

▲

userbinator 6 days ago | parent | prev [-]

Apparently you're not interested in thinking either, which is another thing I've noticed with many developers these days...

The sibling comment provided a good hint already. All you need to store are some file offsets, amounting to a few dozen bytes.

	▲	userbinator 6 days ago \| parent [-]
		Thank you for demonstrating your ignorance.