| ▲ | hermitcrab 4 hours ago | ||||||||||||||||
Or just omit the rows that are obviously wrong (and document the fact). | |||||||||||||||||
| ▲ | freehorse 3 hours ago | parent | next [-] | ||||||||||||||||
> omit the rows that are obviously wrong This can skew the dataset and lead to misinterpreted results, if which rows are wrong is not completely random. Eg if all data from a specific location (or year etc) comes wrong, then this kind of cleaning would just completely exclude this location, which depending on the context may or may not be a problem. Or if values come wrong above a specific threshold. Or any other way that the errors are not in some way randomly distributed. Removing data is never a neutral choice, and it should always be taken into consideration (which data is removed). | |||||||||||||||||
| |||||||||||||||||
| ▲ | chaps 4 hours ago | parent | prev | next [-] | ||||||||||||||||
"obviously wrong" is a never ending rabbit hole and you'll never, ever be satisfied because there will always be something "obviously wrong" with the data. Messy data is a signal. You're wrong to omit signal. | |||||||||||||||||
| |||||||||||||||||
| ▲ | GMoromisato 4 hours ago | parent | prev [-] | ||||||||||||||||
Deleting the row loses some information, such as the existence of that gas station. A better solution is to add a field to indicate that "the row looks funny to the person who published the data". Which, I guess is useful to someone? But deleting data or changing data is effectively corrupting source data, and now I can't trust it. | |||||||||||||||||