Remix.run Logo
torginus 8 hours ago

When I worked as a data engineer, I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON at 10s of MB/s - creating a huge bottleneck.

By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).

Just how much data do you need when these sort of clustered approaches really start to make sense?

embedding-shape 7 hours ago | parent | next [-]

> I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON

Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!

> Just how much data do you need when these sort of clustered approaches really start to make sense?

I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.

toast0 an hour ago | parent | prev | next [-]

> Just how much data do you need when these sort of clustered approaches really start to make sense?

You really need an enormous amount of data (or data processing) to justify a clustered setup. Single machines can scale up rather quite a lot.

It'll cost money, but you can order a 24x128GB ram, 24x30TB ssd system which will arrive in a few days and give you 3 TB ram, 720 TB (fast) disk. You can go bigger, but it'll be a little exotic and the ordering process might take longer.

If you need more storage/ram than around that, you need clustering. Or if the processing power you get in your single system storage isn't enough, you would need to cluster, but ~ 256 cores of cpu is enough for a lot of things.

noufalibrahim 6 hours ago | parent | prev | next [-]

I remember a panel once at a PyCon where we were discussing, I think, the anaconda distribution in the context of packaging and a respected data scientist (whose talks have always been hugely popular) made the point that he doesn't like Pandas because it's not excel. The latter was his go to tool for most of his exploratory work. If the data were too big, he'd sample it and things like that but his work finally was in Excel.

Quick Python/bash to cleanup data is fine too I suppose and with LLMs, it's easier than ever to write the quick throwaway script.

acomjean 4 hours ago | parent | next [-]

I took a bio statistic class. The tools were Excel/ R or Stata.

I think most people used R. Free and great graphing. Though the interactivity of Excel is great for what ifs. I never got R till I took that class. Though RStudio makes R seem like scriptable excel.

R/Python are fast enough for most things though a lot of genomic stuff (Blast alignments etc..) are in compiled languages.

dapperdrake 5 hours ago | parent | prev [-]

Whenever I had to use anaconda it was slow as molasses. Was that ever fixed?

zahlman 2 hours ago | parent [-]

What tasks were slow?

jtbaker 5 minutes ago | parent | prev | next [-]

you didn't need to read to rewrite to C# to do that - python should be able to handle streaming that amount/velocity of data fine, at least through a native extension like msgspec or pydantic. additionally, you made it much harder for other data engineers that need to maintain/extend the project in the future to do so.

rented_mule 7 hours ago | parent | prev | next [-]

I like the peer comment's answer about a processing time threshold (e.g., a day). Another obvious threshold is data that doesn't conveniently fit on local disks. Large scale processing solutions can often process directly from/to object stores like S3. And if it's running inside the same provider (e.g., AWS in the case of S3), data can often be streamed much faster than with local SSDs. 10GB/s has been available for a decade or more, and I think 100GB/s is available these days.

betaby 4 hours ago | parent [-]

> data can often be streamed much faster than with local SSDs. 10GB/s has been available for a decade or more, and I think 100GB/s is available these days.

In practice most AWS instances are 10Gbps capped. I have seen ~5Gbps consistently read from GCS and S3. Nitro based images are in theory 100Gbps capable, in practice I've never seen that.

sgarland 3 hours ago | parent [-]

Also, anything under 16 vCPUs generally has baseline / burst bandwidth, with the burst being best-effort, 5-60 minutes.

This has, at multiple companies for me, been the cause of surprise incidents, where people were unaware of this fact and were then surprised when the bandwidth suddenly plummeted by 50% or more after a sustained load.

KolmogorovComp 7 hours ago | parent | prev | next [-]

> Just how much data do you need when these sort of clustered approaches really start to make sense?

I did not see your comment earlier, but to stay with Chess see https://news.ycombinator.com/item?id=46667287, with ~14Tb uncompressed.

It's not humongous and it can certainly fit on disk(s), but not on a typical laptop.

3 hours ago | parent | prev | next [-]
[deleted]
commandersaki 7 hours ago | parent | prev | next [-]

How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial.

torginus 6 hours ago | parent | next [-]

I used Newtonsoft.Json which takes in a stream, and while it can give you objects, it can also expose it as a stream of tokens.

The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens.

I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping).

It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe.

I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation).

The preprocessor was able to run on the build machines and DSes consumed the CSV directly.

zahlman 2 hours ago | parent | next [-]

Would for example https://pypi.org/project/json-stream/ have met your needs?

torginus an hour ago | parent [-]

I'm going to go out on a limb and say no - this library seems to do the parsing in Python, and Python is slow, like many times slower than Java, C# or languages in this class - which you find out if you try to do heavy data processing with it, which is one of the reasons I dislike the language. It's also very hard to parallelize - in C# if you feed stuff into LINQ and entries are independent, you can make the work parallel with PLINQ very quickly, while threads aren't really a thing in Python (or at least they werent back then).

I've seen so many times that data processing quickly became a bottleneck and source of frustration with Python that stuff needed to be rewritten, that I came to not bother writing stuff in Python in the first place.

You can make Python fast by relying on NumPy and pandas with array programming, but doing so can be quite challenging to format and massage the data so that the things that you want can be expressed as array programming ops, that usually it became too much of a burden for me.

I wish Python was at least as fast as Node (which also can have its own share of performance cliffs)

It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.

briHass 5 hours ago | parent | prev [-]

This sounds similar to how in C#/.NET there are (at least) 3 methods to reading XML: XmlDocument, XPathDocument, or XmlReader. The first 2 are in-memory object models that must parse the entire document to build up an object hierarchy, which you then access object-oriented representations of XML constructs like elements and attributes. The XmlReader is stream-based, where you handle tokens in the XML as they are read (forward-only.)

Any large XML document will clobber a program using the in-memory representations, and the solution is to move to XmlReader. System.Text.Json (.NET built-in parsing) has a similar token-based reader in addition to the standard (de)serialization to objects approach.

giovannibonetti 7 hours ago | parent | prev | next [-]

You assume it is valid, until it isn't and you can have different strategies to handle that, like just skipping the broken part and carrying on.

Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.

The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.

rented_mule 7 hours ago | parent | prev | next [-]

I don't know what the GP was referring too, but often this is about "JSONL" / "JSON Lines" - files containing one JSON object per line. This is common for things like log files. So, process the data as each line is deserialized rather than deserializing the entire file first.

shakna 7 hours ago | parent | prev | next [-]

There's a whole heap of approaches, each with their own tradeoffs. But most of them aren't trivial, no. And most end up behaving erratically with invalid json.

You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others.

One Python library that handles pretty much all of them, as a place to start learning, would be: https://github.com/daggaz/json-stream

bob1029 3 hours ago | parent | prev [-]

https://devblogs.microsoft.com/dotnet/the-convenience-of-sys...

https://learn.microsoft.com/en-us/dotnet/standard/serializat...

zjaffee 7 hours ago | parent | prev | next [-]

It's not about how much data you have, but also the sorts of things you are running on your data. Joins and group by's scale much faster than any aggregation. Additionally, you have a unified platform where large teams can share code in a structured way for all data processing jobs. It's similar in how companies use k8s as a way to manage the human side of software development in that sense.

I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).

The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.

dapperdrake 5 hours ago | parent [-]

Other way around. Aggregation is usually faster than a join.

sgarland 3 hours ago | parent [-]

Disagree, though in practice it depends on the query, cardinality of the various columns across table, indices, and RDBMS implementation (so, everything).

A simple equijoin with high cardinality and indexed columns will usually be extremely fast. The same join in a 1:M might be fast, or it might result in a massive fanout. In the case of the latter, if your RDBMS uses a clustering index, and if you’ve designed your schemata to exploit this fact (e.g. a table called UserPurchase that has a PK of (user_id, purchase_id)) can still be quite fast.

Aggregations often imply large amounts of data being retrieved, though this is not necessarily true.

dapperdrake 2 hours ago | parent [-]

That level of database optimization is rare in practice. As soon as a non-database person gets decision making authority there goes your data model and disk layout.

And many important datasets never make it into any kind of database like that. Very few people provide "index columns" in their CSV files. Or they use long variable length strings as their primary key.

OP pertains to that kind of data. Some stuff in text files.

dapperdrake 5 hours ago | parent | prev [-]

Adam Drake's example (OP) also streams from disk. And the unix pipeline is task-parallel.