LLMs Can Unmask Pseudonymous Users at Scale With Surprising Accuracy
upstart writes:
LLMs can unmask pseudonymous users at scale with surprising accuracy:
Burner accounts on social media sites can increasingly be analyzed to identify the pseudonymous users who post to them using AI in research that has far-reaching consequences for privacy on the Internet, researchers said.
The finding, from a recently published research paper, is based on results of experiments correlating specific individuals with accounts or posts across more than one social media platform. The success rate was far greater than existing classical deanonymization work that relied on humans assembling structured data sets suitable for algorithmic matching or manual work by skilled investigators. Recall-that is, how many users were successfully deanonymized-was as high as 68 percent. Precision-meaning the rate of guesses that correctly identify the user-was up to 90 percent.
The findings have the potential to upend pseudonymity, an imperfect but often sufficient privacy measure used by many people to post queries and participate in sometimes sensitive public discussions while making it hard for others to positively identify the speakers. The ability to cheaply and quickly identify the people behind such obscured accounts opens them up to doxxing, stalking, and the assembly of detailed marketing profiles that track where speakers live, what they do for a living, and other personal information. This pseudonymity measure no longer holds.
"Our findings have significant implications for online privacy," the researchers wrote. "The average online user has long operated under an implicit threat model where they have assumed pseudonymity provides adequate protection because targeted deanonymization would require extensive effort. LLMs invalidate this assumption."
The researchers collected several datasets from public social media sites to test the techniques while preserving the privacy of the speakers. One of them collected posts from Hacker News and LinkedIn profiles and then linked them by using cross-platform references that appeared in user profiles. They then stripped all identifying references from the posts and ran a large language model on them. A second dataset was obtained from a Netflix release of micro-identities, such as individual preferences, recommendations, and transaction records. A 2008 research paper showed the list could identify users and ID their political affiliations and other personal information. The last technique split a single user's Reddit history.
"What we found is that these AI agents can do something that was previously very difficult: starting from free text (like an anonymized interview transcript) they can work their way to the full identity of a person," Simon Lermen, a co-author of the paper, told Ars. "This is a pretty new capability, previous approaches on re-identification generally required structured data, and two datasets with a similar schema that could be linked together."
Read more of this story at SoylentNews.