Remix.run Logo
cHaOs667 5 days ago

That's what you call a DOM Parser - the problem with them is, as they serialize all the elements into objects, bigger XML files tend to eat up all of your RAM. And this is where SAX2 parsers come into play where you define tree based callbacks to process the data.

mort96 5 days ago | parent [-]

The solution is simple: don't have XML files that are many gigabytes in size.

iberator 5 days ago | parent | next [-]

A lot of teleco stuff dumps multi-gb stuff of xml hourly. Per BTS. Processing few TB of XML files on one server daily

It's doable, just use the right tools and hacks :)

Processing schema-less or broken schema stuff is always hilarious.

Good times.

senorrib 5 days ago | parent [-]

Lol I love the upbeat tone here. Helps me deal with my PTSD after working with XML files.

cHaOs667 5 days ago | parent | prev | next [-]

Depending on the XML structure and the servers RAM - it can already happen while you approach 80-100 MB file sizes. And to be fair, in the Enterprise context, you are quite often not in a position to decide how big the export of another system is. But yes, back in 2010 we built preprocessing systems that checked XMLs and split them up in smaller chunks if they exceeded a certain size.

lyu07282 5 days ago | parent | prev | next [-]

Tell that to wikimedia, I've used libxml's SAX parser in the past to parse 80GB+ xml dumps.

stuaxo 5 days ago | parent | prev [-]

Some formats are this and they are historical formats.