Remix.run Logo
freehorse 4 hours ago

> omit the rows that are obviously wrong

This can skew the dataset and lead to misinterpreted results, if which rows are wrong is not completely random.

Eg if all data from a specific location (or year etc) comes wrong, then this kind of cleaning would just completely exclude this location, which depending on the context may or may not be a problem. Or if values come wrong above a specific threshold. Or any other way that the errors are not in some way randomly distributed.

Removing data is never a neutral choice, and it should always be taken into consideration (which data is removed).

hermitcrab 4 hours ago | parent [-]

>Removing data is never a neutral choice, and it should always be taken into consideration (which data is removed).

Absolutely. If you have obviously wrong data your choices are generally:

1. Leave the bad data in.

2. Leave the bad data in and flag it as suspect.

3. Omit the dad data.

4. Correct the bad data.

Which is the best choice depends on context and requires judgement. But I find it hard to imagine any situation where option 1 is the right choice.

Obviously the best solution is to do basic validation as the data is entered, so that people can't add a location in the Indian ocean to a UK dataset. It seems rather negligent that they didn't do this.

chaps 3 hours ago | parent [-]

Like I said in a different post, there are legal reasons for why you would want the original data. Deleting the data from the dataset is negligent.

If you want something to blame, blame the system that allowed the data to be bad in the first place. You're pointing your finger at the wrong people and it's unreasonable of you to call them negligent.