de Montjoye: In 2010 I joined the
Santa Fe Institute in New Mexico, which is a complex system research institute. I was working with large datasets, and, at the same time, I was also really interested in the topic of privacy. I was reading studies that were raising privacy concerns, and I was fascinated that there was this constant rebuttal to the studies that you shouldn’t worry because the data is anonymous. This surprised me because I was working with location data and, seeing users moving around, my intuition was that there was a disconnect between claims that data was anonymous and what was probably possible to do in terms of re-identification.
This is where the idea for the
Unique in the Crowd paper came from. We wanted to develop a statistical approach to quantify what it would take, on average, to identify someone from an anonymous location dataset, and we were able to show that four data points of approximate place and time of where someone was was enough, in a data set of 1.5 million people, to uniquely identify someone 95 percent of the time.
Angwin: One thing I took away from this study was that location data is a special category of data that is inherently sensitive. Is that what you took away?
de Montjoye: What I find fascinating with location is that I see location as a universal identifier. Data on where you are simultaneously exists in a large number of datasets. If I wanted to start reconciling identities across datasets, location would be how I would do it because it is very rich and exists in so many datasets that are collected over completely different modalities, from my phone to my credit card.
When we thought of data in the past, we used to think of excel spreadsheets with thousands of people and a handful of columns. Now data means where you were every 10 minutes for over a year. In these kinds of high dimensional datasets, a few pieces of information are going to be sufficient to identify someone with high likelihood.
Angwin: The 2013 study
stirred up some debates, specifically whether an attacker would or would not have access to those four data points about someone. How realistic do you think this is?
de Montjoye: Yes, some people thought that gathering four points was completely unrealistic. To that I say, “Have you been on social media at all?” I do not think it is going to be that hard to find four or six or eight points about someone.
I think we were hitting on a bit of an inconvenient truth. We were showing that de-identification techniques just didn’t really scale to the new world of big data that we are in.
Angwin: It is now 2022. Have there been any new developments in re-identification?
de Montjoye: What I am interested in at the moment is the potential for what we call profiling attacks. The vast majority of re-identification attacks have been based on matching, based on me knowing where you were during a given time, on a specific Sunday at 4 p.m., and then matching what I know about you with pseudonyms in the data set.
In my opinion, the next frontier is harnessing machine learning to develop a model for how someone usually behaves and using this model to identify someone a month from now. This is
something we started to do in collaboration with Michael Bronstein. We show that the way you communicate on WhatsApp, for example, is specific enough that we can learn the way you, as a person, communicate with other people without knowing who these people are, and use this to identify you even six months later.
Essentially, the way you behave is so specific—the way you answer quickly to messages, the way you exchange with a certain number of people, etc.—that we can actually learn a profile from a period of time. What we are showing is that the way we behave is very stable over time. I think this will be the next type of re-identification attacks.
Angwin: You’ve been
quoted saying, “Anonymity is not a property of a data set, but is a property of how you use it.” I’m curious what you mean by that?
However, there is also this false but very convenient notion that you just need to take a dataset, modify it one way or the other, and then it’s anonymous forever. This is not true. We need to move to a system in which we honestly acknowledge that the data is pseudonymous, and it’s not going to be super difficult for someone who has access to the data to re-identify someone.
There are some pretty good solutions out there to mitigate these risks, but they’re not perfect. We need to start seeing how we can rely on these solutions while acknowledging the remaining risks, including the fact that the data still exists in pseudonymous format. We must combine hard technical solutions with access control mechanisms, logging mechanisms, and governance mechanisms for how this data is being used.
Angwin: What is your preferred model for mitigating privacy risks while allowing scientists to answer important research questions?
de Montjoye: The notion of using data anonymously has a lot of value if protections are applied properly and correctly. To me, this means there is an ethics committee on top of the strong technical guarantees. If this is the case, I think we can get a lot of good—from a scientific perspective—information out of this data while limiting the risks. I think then the question is how to do it best technically, and there is no silver bullet.
One option is
what Google has been doing with COVID—putting out
“on average” mobility, probably with differential privacy applied, but that limits what you can do with the data. Another option is allowing researchers to formulate their hypotheses on synthetic data, what’s known as test data or fake data. Once your scientific hypothesis is well developed, you can send that piece of code to a server where the data exists in a pseudonymized fashion, and then after being run, only the relevant anonymized and aggregate data is sent back to you. That’s how I would technically imagine the system to work, but then the hard question is the ethics on top of it; how do you build a system that will prevent abuse.
We conduct our research in order to demonstrate what is possible. To me, it’s crucial to generate all the evidence so that we can have an honest, informed conversation and examine, as a society, how we want to balance the potential and the risks. We have to make sure we’re standing on solid ground from a scientific perspective so we can have an informed conversation.