Last summer, data researchers published an open-access paper on their “white hat” hacking—they were trying to see how easy it is to “de-anonymize” data that we think is anonymous. It was disturbingly easy, according to study co-author Carlo Ratti:
First, they combined two anonymized datasets of people in Singapore, one of mobile phone logs and the other of transit trips, each containing “location stamps” detailing just the time and place of each data point. Then they used an algorithm to match users whose data overlapped closely between each set–in other words, they had phone logs and transit logs with similar time and location stamps–and tracked how closely those stamps matched up over time, eliminating false positives as they went. In the end, it took a week to match up 17% of the users and 11 weeks to get to a 95% rate of accuracy. (With the added GPS data from smartphones, it took less than a week to hit that number.)Kelsey Campbell-Dollaghan, “Sorry, your data can still be identified even if it’s anonymized” at Fast Company (December 10, 2018)
As the researchers explain, “We show that with higher frequency data collection becoming more common, we can expect much higher success rates in even shorter intervals.”
The more we live online, the more “anonymous” information creates a complete picture, as another recent study showed:
Now researchers from Belgium’s Université catholique de Louvain (UCLouvain) and Imperial College London have built a model to estimate how easy it would be to deanonymise any arbitrary dataset. A dataset with 15 demographic attributes, for instance, “would render 99.98% of people in Massachusetts unique”. And for smaller populations, it gets easier: if town-level location data is included, for instance, “it would not take much to reidentify people living in Harwich Port, Massachusetts, a city of fewer than 2,000 inhabitants”.Alex Hern, “Anonymised’ data can never be totally anonymous, says study” at The Guardian
The authors of the open-access Belgian study warn, “Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and forget model.” The General Data Protection Regulation is a European Union policy on data protection.
This data, which can easily be “de-anonymized,” is often for sale, they point out:
“Modern datasets contain a large number of points per individuals,” write the authors. “For instance, the data broker Experian sold [data science and analytics company] Alteryx access to a de-identified dataset containing 248 attributes per household for 120M Americans.”
That anonymized data sets can be de-anonymized isn’t itself news. In 2018, researchers at the DEF CON hacking conference demonstrated how they were able to legally and freely acquire the apparently anonymous browsing history of 3 million Germans and then quickly de-anonymize portions of it. The researchers were able to uncover, for example, the porn habits of a specific German judge.Jack Morse, “Sorry, your ‘anonymized’ data probably isn’t anonymous” at Mashable (July 23, 2019)
People used to think that Big Data would solve our problems but this particular problem is a direct result of having so many records easily accessible in a few places. And there is no one simple answer, as a Cambridge professor of security engineering explains,
Knowing that people in Acacia Avenue are more likely to buy big cars, and that forty-three year olds are too, is of almost no value compared with knowing that the forty-three year old who lives in Acacia Avenue is looking to buy a new car right now. Knowing how much he’s able to spend opens the door to ever more price discrimination, which although unfair is both economically efficient and profitable. We know of no technological silver bullet, no way to engineer an equilibrium between surveillance and privacy; boundaries will have to be set by other means.Ross Anderson, “2017 : What scientific term or concept ought to be more widely known?” at Edge
One proposed remedy is an approach to privacy known as differential privacy:
Differential privacy is a rigorous mathematical definition of privacy. In the simplest setting, consider an algorithm that analyzes a dataset and computes statistics about it (such as the data’s mean, variance, median, mode, etc.). Such an algorithm is said to be differentially private if by looking at the output, one cannot tell whether any individual’s data was included in the original dataset or not. In other words, the guarantee of a differentially private algorithm is that its behavior hardly changes when a single individual joins or leaves the dataset — anything the algorithm might output on a database containing some individual’s information is almost as likely to have come from a database without that individual’s information. Most notably, this guarantee holds for any individual and any dataset. Therefore, regardless of how eccentric any single individual’s details are, and regardless of the details of anyone else in the database, the guarantee of differential privacy still holds. This gives a formal guarantee that individual-level information about participants in the database is not leaked.“Differential Privacy” at Harvard University Privacy Tools Project
That is probably only one of many tools we will need in decades to come.
Further reading on why we are not nearly as “anonymous” as we think: Our anonymity may be an illusion. Because we talk about ourselves so much online, only a few leaked pieces may even be required to identify us.
Many parents ignore the risks of posting their kids’ data online. The lifelong digital footprint, which starts before birth, makes identity theft much easier.
Ad exec quit the industry over Big Tech’s relentless snooping. He was shocked by the brazen attitude to invasion of privacy.
Featured image: Deleting data/ freshidea, Adobe Stock