▲ | cogman10 4 days ago | |||||||
I agree with all this and will just add. CSV is slow. Like really slow. Partially due to the bloat, but also partially because the format doesn't allow for speed. And because CSV is untyped, you have to either trust the producer or put in mountains of guards to ensure you can handle the weird garbage that might come through. My company deals with a lot of CSV and we literally built tools and hire multiple full time employees whose entire job is handling CSV sucking in new and interesting ways. Parquet literally eliminates 1/2 of our our data ingestion pipeline simply by being typed, consistent, and fast to query. One example of a problem we constantly run into is that nobody likes to format numbers the same way. Scientific notation, no scientific notation, commas or periods, sometimes mixed formats (scientific notation when a number is big enough, for example). Escaping is also all over the board. CSV SEEMS simple, but the lack of a real standard means it's anything but. I'd take xml over CSV. | ||||||||
▲ | IanCal 4 days ago | parent [-] | |||||||
> My company deals with a lot of CSV and we literally built tools and hire multiple full time employees whose entire job is handling CSV sucking in new and interesting ways. Truly shocking how many ways people manage to construct these files. I don't think people really get this if they've mostly been moving files from one system to another and not dealing with basically the union of horrors all the various systems that write CSV can make. > Parquet literally eliminates 1/2 of our our data ingestion pipeline simply by being typed, consistent, and fast to query. Parquet has been a huge step forwards. It's not perfect, but it is really good. Most improvements I'd like actually are served well stepping up from that to groups of them in larger tables. > One example of a problem we constantly run into is that nobody likes to format numbers the same way. Scientific notation, no scientific notation, commas or periods, sometimes mixed formats (scientific notation when a number is big enough, for example). That's a new one on me, but makes loads of sense. Dates were my issue - you hit 13/2/2003 and 5/16/2001 in the same file. What date is 1/2/2003? For anyone that's never dealt with this before, let me paint a picture - You have a programming language you're working in. You import new packages every single day, written by people you start to consider adversarial after a few weeks in the job. You need to keep your system running, with new imports added every day. There are only string types. Nothing more. You must interpret them correctly. This is an understatement for what CSV files coming from random customers actually means. I did it for a decade and was constantly shocked at what new and inventive ways people had to mess up a simple file. > I'd take xml over CSV. Not to rag on xml but because it feels like you're in or have been in the same boat as me and it's nice to share horror stories - I've had to manually dig through a multi gig xml file to deal with parsing issues as somewhere somehow someone managed to combine files of different encodings and we had control characters embedded in. Just a random ^Z here and there. It's been years so I don't remember the details of exactly how we reconstructed what had happened but there was something due to encodings and mixing things together that messed it up. This isn't xmls fault, and was a smaller example but since then I've had a strong mistrust of anything that lets humans manipulate files outside of something that can validate them as being parsable. Also would take XML over CSV. | ||||||||
|