To be clear, we are making a clear concession here that the people weren't truly anonymous. But we did use an LLM to remove any identifying information from HN making them quasi-anonymous, this is more described in the appendix Table 2.

We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.

The blog post might be more approachable for a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...

▲

dang 4 hours ago | parent | next [-]

Thanks for that link! I'll put in the top text.

Edit: actually I've re-upped your submission of that link and moved the links to the paper to the toptext instead. Hopefully this will ground the discussion more in the actual study.

▲

ranger_danger 4 hours ago | parent | prev [-]

But you also relied on people giving away too much personal information about themselves... which won't always be the case.

▲

majorchord 4 hours ago | parent | next [-]

Yeah my first thought was "of course an LLM can do that, we didn't need a paper to tell us". I would be more impressed if it could do it without that information, such as by analyzing writing styles and other cues that aren't direct PII.

	▲	intended 4 hours ago \| parent [-]
		It’s the same thing as theft and locks. Any motivated attacker will overcome any rudimentary obstacle. We still use locks because most opportunistic attackers are the most prevalent. Even the paper on improved phishing showed that LLMs reduce the cost to run phishing attacks, which made previously unprofitable targets (lower income groups), profitable. The most common deterrent is inconvenience, not impossibility.

▲

DalasNoin 4 hours ago | parent | prev | next [-]

I agree that these accounts probably on average still contain more information than the average pseudonymous account. I think we could try to use the LLM to increasingly ablate more information and see how it performance decays – to be clear we already heavily remove such information, see Table 2 appendix. But I don't expect that to change the basic conclusions.

	▲	ranger_danger 2 hours ago \| parent [-]
		I also wonder how well the LLM would do with less direction e.g. just ask it to analyze someone's posts and "figure out what city they live in based on everything you know about how to identify someone from online posts".

▲

famouswaffles 4 hours ago | parent | prev [-]

Over a large enough timeframe (often a couple years at most), almost everyone online gives too much information about themselves. A seemingly innocuous statement can pin you to an exact city and so on.

	▲	ranger_danger 2 hours ago \| parent [-]
		I would be quite impressed if someone could figure out what city I live in from my 4.5 year old account, but I highly doubt it.