| ▲ | chaps 4 hours ago | |
"What does it mean to clean the data?" This isn't possible to answer generally, but I'm sure you know that. Look -- I've been in nonstop litigation for data through FOIA for the past ten years. During litigation I can definitely push back on messy data and I have, but if I were to do that on every little "obviously wrong" point, then my litigation will get thrown out for me being a twat of a litigant. Again, I'd rather have the data and publish it with known gotchas. Here's an example: https://mchap.io/using-foia-data-and-unix-to-halve-major-sou... Should I have told the Department of Finance to fuck off with their messy data? No -- even if I want to. Instead, we learn to work with its awfulness and advocate for cleaner data. Which is exactly what happened here -- once me and others started publishing stuff about tickets data and more journalists got involved, the data became cleaner over time. | ||
| ▲ | torginus 2 hours ago | parent [-] | |
Sorry I meant to say that usually it's not always possible to clean the data if the data is corrupt in the first place, because it was collected in a buggy manner. And having a few inexplicable outliers in datasets can often erode confidence in the rest. Since this is not the data you collected, I understand you have to work with what you have, by the way very interesting post, and nice job! | ||