▲ | IanCal 4 days ago | |
> My company deals with a lot of CSV and we literally built tools and hire multiple full time employees whose entire job is handling CSV sucking in new and interesting ways. Truly shocking how many ways people manage to construct these files. I don't think people really get this if they've mostly been moving files from one system to another and not dealing with basically the union of horrors all the various systems that write CSV can make. > Parquet literally eliminates 1/2 of our our data ingestion pipeline simply by being typed, consistent, and fast to query. Parquet has been a huge step forwards. It's not perfect, but it is really good. Most improvements I'd like actually are served well stepping up from that to groups of them in larger tables. > One example of a problem we constantly run into is that nobody likes to format numbers the same way. Scientific notation, no scientific notation, commas or periods, sometimes mixed formats (scientific notation when a number is big enough, for example). That's a new one on me, but makes loads of sense. Dates were my issue - you hit 13/2/2003 and 5/16/2001 in the same file. What date is 1/2/2003? For anyone that's never dealt with this before, let me paint a picture - You have a programming language you're working in. You import new packages every single day, written by people you start to consider adversarial after a few weeks in the job. You need to keep your system running, with new imports added every day. There are only string types. Nothing more. You must interpret them correctly. This is an understatement for what CSV files coming from random customers actually means. I did it for a decade and was constantly shocked at what new and inventive ways people had to mess up a simple file. > I'd take xml over CSV. Not to rag on xml but because it feels like you're in or have been in the same boat as me and it's nice to share horror stories - I've had to manually dig through a multi gig xml file to deal with parsing issues as somewhere somehow someone managed to combine files of different encodings and we had control characters embedded in. Just a random ^Z here and there. It's been years so I don't remember the details of exactly how we reconstructed what had happened but there was something due to encodings and mixing things together that messed it up. This isn't xmls fault, and was a smaller example but since then I've had a strong mistrust of anything that lets humans manipulate files outside of something that can validate them as being parsable. Also would take XML over CSV. | ||
▲ | cogman10 4 days ago | parent [-] | |
Yeah, the xml thing wasn't about how good xml is, just how terrible CSV is :). Particularly for tabular data, parquet is really good. Even a SQLite database isn't a terrible way send that sort of data. At least with XML, all the problems of escaping are effectively solved. And since it's (usually) tool generated it's likely valid. That means I can feed it into my favorite xml parser and pound the data out of it (usually). There's still a lot of the encoding issues I mentioned with CSV. |