Remix.run Logo
pavel_lishin 21 hours ago

Anonymizing data is incredibly difficult to do: https://www.theguardian.com/technology/2014/jun/27/new-york-...

> New York City has released data of 173m individual taxi trips – but inadvertently made it "trivial" to find the personally identifiable information of every driver in the dataset.

afarah1 20 hours ago | parent | next [-]

Interesting read, thanks. The related article shows that even more robust anonymization techniques may still be insufficient (in the case of the taxi rides, spatial-temporal analysis could still lead to de-anonymization). More reason to reduce data collection. Unfortunately the trend is the opposite for governments all around the world.

wtallis 20 hours ago | parent | prev | next [-]

That example only demonstrates leaked information of the drivers, not the passengers/customers. And the "anonymized" driver and license data wouldn't need to be released in any form at all to produce a dataset useful for public transportation planning purposes: approximate time of day and approximate location are sufficient to estimate demand, and there's no need to keep track of who is making which trips.

jadyoyster 19 hours ago | parent [-]

Exactly, all you need to start is "a significant amount of people from this area want to go to those areas".

the_sleaze_ 19 hours ago | parent | prev [-]

It's really not unless of course you are dis-incentivized to provide anonymous data. The ground is thick with prior art and existing solutions.

https://www.hhs.gov/hipaa/for-professionals/special-topics/d...

wtallis 19 hours ago | parent [-]

Well, there are pitfalls, and it's easy for an "anonymization" scheme to leak more detail than it would seem at first glance. But I agree that motive plays a big role. If your purpose in sharing data is to make money off it, then you'll be trying to share as much data as possible, and will try to convince yourself that your anonymization is "good enough".

If you're sharing data for a specific purpose, then it's much easier to limit the data sharing to suit that purpose: omit irrelevant data, aggregate where possible, and anonymize individual data points only when you actually need to share that level of detail.