| ▲ | torginus 6 hours ago | |||||||
I used Newtonsoft.Json which takes in a stream, and while it can give you objects, it can also expose it as a stream of tokens. The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens. I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping). It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe. I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation). The preprocessor was able to run on the build machines and DSes consumed the CSV directly. | ||||||||
| ▲ | zahlman an hour ago | parent | next [-] | |||||||
Would for example https://pypi.org/project/json-stream/ have met your needs? | ||||||||
| ||||||||
| ▲ | briHass 5 hours ago | parent | prev [-] | |||||||
This sounds similar to how in C#/.NET there are (at least) 3 methods to reading XML: XmlDocument, XPathDocument, or XmlReader. The first 2 are in-memory object models that must parse the entire document to build up an object hierarchy, which you then access object-oriented representations of XML constructs like elements and attributes. The XmlReader is stream-based, where you handle tokens in the XML as they are read (forward-only.) Any large XML document will clobber a program using the in-memory representations, and the solution is to move to XmlReader. System.Text.Json (.NET built-in parsing) has a similar token-based reader in addition to the standard (de)serialization to objects approach. | ||||||||