Remix.run Logo
torginus 6 hours ago

I used Newtonsoft.Json which takes in a stream, and while it can give you objects, it can also expose it as a stream of tokens.

The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens.

I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping).

It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe.

I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation).

The preprocessor was able to run on the build machines and DSes consumed the CSV directly.

zahlman an hour ago | parent | next [-]

Would for example https://pypi.org/project/json-stream/ have met your needs?

torginus an hour ago | parent [-]

I'm going to go out on a limb and say no - this library seems to do the parsing in Python, and Python is slow, like many times slower than Java, C# or languages in this class - which you find out if you try to do heavy data processing with it, which is one of the reasons I dislike the language. It's also very hard to parallelize - in C# if you feed stuff into LINQ and entries are independent, you can make the work parallel with PLINQ very quickly, while threads aren't really a thing in Python (or at least they werent back then).

I've seen so many times that data processing quickly became a bottleneck and source of frustration with Python that stuff needed to be rewritten, that I came to not bother writing stuff in Python in the first place.

You can make Python fast by relying on NumPy and pandas with array programming, but doing so can be quite challenging to format and massage the data so that the things that you want can be expressed as array programming ops, that usually it became too much of a burden for me.

I wish Python was at least as fast as Node (which also can have its own share of performance cliffs)

It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.

briHass 5 hours ago | parent | prev [-]

This sounds similar to how in C#/.NET there are (at least) 3 methods to reading XML: XmlDocument, XPathDocument, or XmlReader. The first 2 are in-memory object models that must parse the entire document to build up an object hierarchy, which you then access object-oriented representations of XML constructs like elements and attributes. The XmlReader is stream-based, where you handle tokens in the XML as they are read (forward-only.)

Any large XML document will clobber a program using the in-memory representations, and the solution is to move to XmlReader. System.Text.Json (.NET built-in parsing) has a similar token-based reader in addition to the standard (de)serialization to objects approach.