The given task can be accomplished with not more than a few kilobytes of RAM, a constant independent of the input and output sizes, but unfortunately I suspect the vast majority of programmers now have absolutely no idea how to do so.

▲

shoo 6 days ago | parent | next [-]

i can see how it'd be possible to transform from the input tabular format to the json format, streaming record by record, using a small constant amount of memory, provided the size of input records was bounded independent of the record count. need to maintain position offset into the input across records, but that's about it

but, maybe we'd need to know more about how the output data is consumed to know if this would actually help much in the real application. if the next stage of processing wants to randomly access records using Get(int i), where i is the index of the item, then even if we transform the input to JSON with a constant amount of RAM, we still have to store this output JSON somewhere so we can Get those items.

the blog post mentioned "padding", i didn't immediately understand what that was referring to (padding in the output format?) but i guess it must be talking about struct padding, where the items were previously stored as an array of structs, while the code in the article transposed everything into homogeneous arrays, eliminating the overhead of padding

	▲	vrnvu 6 days ago \| parent [-]
		Padding in the post refers to memory alignment. If we had an "array of structs" instead of "struct of arrays" it would be: string(8) + long(8) + int(4) + padding(4) = 24 bytes

▲

xnorswap 6 days ago | parent | prev | next [-]

How about you enlighten us rather than just taunt us with your superior knowledge?

▲

bilbo-b-baggins 6 days ago | parent | next [-]

If the “Task” is outputting the JSON for terms to a file, it can be streamed one row at a time - with memory reused after each row is read and the output written. That could be done with a few KB of program space assuming you’re parsing the CSV and outputting the JSON manually instead of using a larger library.

The problem isn’t well constrained because it seems to imply that for some reason it needs to be all accessible in memory, doesn’t specify the cardinality of terms, doesn’t specify whether Get(i) is used in a way that requires that particular interface for accessing a row by number.

If I were to do it, I’d just parse a Page at a time and update a metadata index saying Page P contains entries starting at N. The output file could be memmapped and only the metadata loaded, allowing directly index into the correct Page which could be quickly scanned for a record, and would maybe use 1-2MB of RAM for metadata and whatever Pages are actually being touched.

But like I said the problem is not well constrained enough for even a solution like that to be optimal, since it would suffer from full dataset sequential or random access, as opposed to hot Pages and a long tail.

/shrug specs matter if you’re in the optimization phase

▲

userbinator 6 days ago | parent | prev [-]

Apparently you're not interested in thinking either, which is another thing I've noticed with many developers these days...

The sibling comment provided a good hint already. All you need to store are some file offsets, amounting to a few dozen bytes.

	▲	userbinator 6 days ago \| parent [-]
		Thank you for demonstrating your ignorance.

▲

01HNNWZ0MV43FF 6 days ago | parent | prev | next [-]

Okay Fermat

▲

Radle 6 days ago | parent | prev [-]

Only real programmers know how to do that.