Remix.run Logo
chaps 5 hours ago

I have mixed feelings about this. On one hand, yeah stop publishing garbage data, but as a FOIA nerd... I'll take the data in any state it is. I'm not personally going to be able to clean the data before I receive it. Does that mean I shouldn't release the unsanitized (public) data knowing that it has garbage data within? Hell no. Instead, we should learn and cultivate techniques to work with shit data. Should I attempt to clean it? Sure. But it becomes a liability problem very, very quickly.

torginus 4 hours ago | parent | next [-]

What does it mean to clean the data?

Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?

If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?

Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.

chaps 4 hours ago | parent [-]

"What does it mean to clean the data?"

This isn't possible to answer generally, but I'm sure you know that.

Look -- I've been in nonstop litigation for data through FOIA for the past ten years. During litigation I can definitely push back on messy data and I have, but if I were to do that on every little "obviously wrong" point, then my litigation will get thrown out for me being a twat of a litigant.

Again, I'd rather have the data and publish it with known gotchas.

Here's an example: https://mchap.io/using-foia-data-and-unix-to-halve-major-sou...

Should I have told the Department of Finance to fuck off with their messy data? No -- even if I want to. Instead, we learn to work with its awfulness and advocate for cleaner data. Which is exactly what happened here -- once me and others started publishing stuff about tickets data and more journalists got involved, the data became cleaner over time.

torginus 2 hours ago | parent [-]

Sorry I meant to say that usually it's not always possible to clean the data if the data is corrupt in the first place, because it was collected in a buggy manner. And having a few inexplicable outliers in datasets can often erode confidence in the rest.

Since this is not the data you collected, I understand you have to work with what you have, by the way very interesting post, and nice job!

hermitcrab 5 hours ago | parent | prev [-]

So you expect the 1000s of people trying to use the fuel price data to each individually clean and validate it, rather than the supplier doing it?

yorwba 4 hours ago | parent | next [-]

One of those people can republish their cleaned and validated version and the 999 others can compare it to the original to decide whether they agree with the way it was cleaned or not.

chaps 5 hours ago | parent | prev [-]

What...?