| ▲ | commandersaki 7 hours ago | ||||||||||||||||||||||
How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial. | |||||||||||||||||||||||
| ▲ | torginus 6 hours ago | parent | next [-] | ||||||||||||||||||||||
I used Newtonsoft.Json which takes in a stream, and while it can give you objects, it can also expose it as a stream of tokens. The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens. I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping). It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe. I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation). The preprocessor was able to run on the build machines and DSes consumed the CSV directly. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | giovannibonetti 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
You assume it is valid, until it isn't and you can have different strategies to handle that, like just skipping the broken part and carrying on. Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it. The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector. | |||||||||||||||||||||||
| ▲ | rented_mule 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I don't know what the GP was referring too, but often this is about "JSONL" / "JSON Lines" - files containing one JSON object per line. This is common for things like log files. So, process the data as each line is deserialized rather than deserializing the entire file first. | |||||||||||||||||||||||
| ▲ | shakna 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
There's a whole heap of approaches, each with their own tradeoffs. But most of them aren't trivial, no. And most end up behaving erratically with invalid json. You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others. One Python library that handles pretty much all of them, as a place to start learning, would be: https://github.com/daggaz/json-stream | |||||||||||||||||||||||
| ▲ | bob1029 3 hours ago | parent | prev [-] | ||||||||||||||||||||||
https://devblogs.microsoft.com/dotnet/the-convenience-of-sys... https://learn.microsoft.com/en-us/dotnet/standard/serializat... | |||||||||||||||||||||||