Remix.run Logo
saithir 12 hours ago

Because unlike the authors of this set - who went and stripped the posts out of usernames and permalinks to anonymize it - that set you mention just grabbed data out of the API as-is (at least based on its huggingface description that's left over).

That's the difference.

spiffytech 10 hours ago | parent [-]

Just a reminder that anonymization is much harder than merely removing metadata:

Every time I hear "anonymous data", I think of that time AOL published anonymized search logs (for academic research). The anonymization was negligent, and an NYT reporter de-anonymized and tracked down one of the users with the local & personal info present in the search queries.

https://en.wikipedia.org/wiki/AOL_search_log_release

https://web.archive.org/web/20130404175032/http://www.nytime...