Remix.run Logo
stared 6 hours ago

I dislike the premise. I mean, good data is wonderful.

But if institutions are expected to release clear data or nothing, almost always it is the later.

What is important, is to offer as much methodology and caveats as possible, even if in an informal way. Because there is a difference between "data covers 72% of companies registered in..." vs expecting that data is full and authoritative, whereas it is missing.

(Source: 10 years ago I worked a lot with official data. All data requires cleaning.)

Mordisquitos 5 hours ago | parent | next [-]

But surely we should expect some basic sanity checks on published data? This isn't some petrol stations being placed in the middle of a field due to minor typos or bad rounding, or some petrol stations' prices being listed as all 1.00 £/l out of laziness, or even a case of all unknown locations being listed as 0°0'0" N, 0°0'0" E by default. What the author reports appear to be mistakes which should be rather trivially detectable on input.

ZiiS 4 hours ago | parent | next [-]

The problem is stats can actually do more with all the data including obvious errors. If you start filtering out data where they miss entered lat log you might introduce a new bias.

chaps 5 hours ago | parent | prev [-]

Sure we should indeed expect that they do that. But look at enough data and you'll learn that those expectations are a path towards never-ending frustration. I've been there, spending >100 hours cleaning data... that never got published because I was too damn focused on the dozens of years of errors that many, many people created.

To be clear, I'm not saying that we should accept messy data. Just, reality is messy and it's naive to think we can catch and remove all of reality's messiness -- which includes the bureaucratic slop that led to the data being published in the first place.

freehorse 5 hours ago | parent | prev | next [-]

I don't think these issues are close to the issues the article talks about. The author does not talk about data coverage, data collection methodologies or missing values or whatever, but data that is actually wrong, ie location coordinates, prices, numbers that make no sense. Including swapping latitude/longitude and wrong decimal points in numbers.

On the other hand, I agree that bad (but usually fixable) data is better than no data.

stared 4 hours ago | parent [-]

Yep, expect in real data actually confusing columns, NaNs casted to values like 1673, duplicates, etc, etc.

I prefer to get data with swapped lat/lng (a trivial fix), or prices said in dollars but being in cents, to no data.

sd9 6 hours ago | parent | prev | next [-]

Agreed, pretty much all data is flawed. I still want my hands on it.

readthenotes1 5 hours ago | parent | prev [-]

I read the premises as "1. at least look at it 2. Have a way to fix it"

Those seem reasonable asks.

Edit to add: the tragedy of the school in Minab is an example of how bad things can go--and it just hints at how much worse bad data can bem