Remix.run Logo
commandersaki 7 hours ago

How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial.

torginus 6 hours ago | parent | next [-]

I used Newtonsoft.Json which takes in a stream, and while it can give you objects, it can also expose it as a stream of tokens.

The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens.

I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping).

It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe.

I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation).

The preprocessor was able to run on the build machines and DSes consumed the CSV directly.

zahlman an hour ago | parent | next [-]

Would for example https://pypi.org/project/json-stream/ have met your needs?

torginus an hour ago | parent [-]

I'm going to go out on a limb and say no - this library seems to do the parsing in Python, and Python is slow, like many times slower than Java, C# or languages in this class - which you find out if you try to do heavy data processing with it, which is one of the reasons I dislike the language. It's also very hard to parallelize - in C# if you feed stuff into LINQ and entries are independent, you can make the work parallel with PLINQ very quickly, while threads aren't really a thing in Python (or at least they werent back then).

I've seen so many times that data processing quickly became a bottleneck and source of frustration with Python that stuff needed to be rewritten, that I came to not bother writing stuff in Python in the first place.

You can make Python fast by relying on NumPy and pandas with array programming, but doing so can be quite challenging to format and massage the data so that the things that you want can be expressed as array programming ops, that usually it became too much of a burden for me.

I wish Python was at least as fast as Node (which also can have its own share of performance cliffs)

It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.

briHass 5 hours ago | parent | prev [-]

This sounds similar to how in C#/.NET there are (at least) 3 methods to reading XML: XmlDocument, XPathDocument, or XmlReader. The first 2 are in-memory object models that must parse the entire document to build up an object hierarchy, which you then access object-oriented representations of XML constructs like elements and attributes. The XmlReader is stream-based, where you handle tokens in the XML as they are read (forward-only.)

Any large XML document will clobber a program using the in-memory representations, and the solution is to move to XmlReader. System.Text.Json (.NET built-in parsing) has a similar token-based reader in addition to the standard (de)serialization to objects approach.

giovannibonetti 7 hours ago | parent | prev | next [-]

You assume it is valid, until it isn't and you can have different strategies to handle that, like just skipping the broken part and carrying on.

Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.

The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.

rented_mule 7 hours ago | parent | prev | next [-]

I don't know what the GP was referring too, but often this is about "JSONL" / "JSON Lines" - files containing one JSON object per line. This is common for things like log files. So, process the data as each line is deserialized rather than deserializing the entire file first.

shakna 7 hours ago | parent | prev | next [-]

There's a whole heap of approaches, each with their own tradeoffs. But most of them aren't trivial, no. And most end up behaving erratically with invalid json.

You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others.

One Python library that handles pretty much all of them, as a place to start learning, would be: https://github.com/daggaz/json-stream

bob1029 3 hours ago | parent | prev [-]

https://devblogs.microsoft.com/dotnet/the-convenience-of-sys...

https://learn.microsoft.com/en-us/dotnet/standard/serializat...