The article by Jules Polonetsky (Executive Director, Future of Privacy Forum) on "Advice to White House on Big Data", which was published today (April 1st, 2014) , brought home an important point to me.
The right conversation is not being had. A honest discussion is not taking place about the difficulty of the issues to be addressed with regards to Big Data Privacy. When leaders and policy makers, who are not technical experts in the space, are provided with guidance that obfuscates the real issues, it will not lead to situations that are good for the general public. Thus, I feel compelled to speak up.
The segment of Jules' article that forced me to comment was "While the Federal Trade Commission (FTC) has acknowledged that data that is effectively de-identified poses no significant privacy risk, there remains considerable debate over what effective de-identification requires." There is so much nuanced truth and falsity in that statement. It makes me wonder why it was specifically phrased that way. Why lead with the FTC's assertion? Why not simply state the truth? Is it more important to be polite and show deferrence than it is to have a honest conversation?
The current chief technologist of the FTC, Latanya Sweeney, demonstrated over a decade ago that re-identification of de-identified data sets was possible with around 80+ percentage of supposedly safe/de-identified data (read more). This is a fact that I am highly confident that most of members of the Future of Privacy Forum are well aware of. So, why lead with a statement with limited to no validity? This confusion lead me to my comment on the article. However, let me re-state it here and provide a bit more detail.
What is De-Identification?
Simply put, de-identification is the process of stripping identifying information from a data collection. ALL the current techniques to enable de-identification leverage the same basic principle - how can one hide an arbitrary individual (in the data set) in a (much larger) crowd of data, such that it is difficult to re-identify that individual. This is the foundation of the two most popular techniques
k-anonymity (and it various improvements) use generalization and suppression techniques to make multiple data descriptors have similar values. Differential privacy (and its enhancements) add noise to the results of a data mining request made by an interested party.
At this point, you are probably saying to yourself "This sounds good so far. What is your problem, Tyrone?"
The problem is that the fundamental assumption upon which de-identification algorithms are built on is that you can separate the world into a set of distinct groups: private and not-private*.
Once you have done this categorization, then you easily apply a clever algorithm to the data and then you are safe. Viola, there is nothing to worry about. Unfortunately, this produces a false sense of safety/privacy, because you are not really safe from risk.
Go to Google Scholar and search on any of these terms: "Re-identification risk", "De-identification", "Re-Identification". Read Nate Anderson's article from 2009 -
“Anonymized” data really isn’t—and here’s why not. Even better, get Ross Anderson's slides on "Why Anonymity Fails?" from his talk at the Open Data Institute on April 4th, 2014.
In Big Data sets (and in general data sets), the attributes/descriptors that are private change depending on the surrounding context. Thus, things that were thought to be not-private today, may become private a second after midnight when you receive new information or context.
For Big Data sets (assuming you are merging information from multiple sources), there is no de-identification algorithm that will do anything more than provide people with a warm and fuzzy feeling that "we did something". There is no real protection.
Let's have a honest discussion about the topic for once. De-identification as a means of offering protection, especially in the context of Big Data, is a myth.
I would love to hear about the practical de-identification techniques/algorithms that the Future of Privacy Forum recommends that will provide a measurably strong level of privacy.
I am kind of happy that this article was published (assuming it is not an Aprils' Fool Joke) because it provides us with the opportunity to engage in a frank discourse about de-identification. Hopefully.
*I am stating the world in the simplest terms possible; for the general audience. No need to post any comments or send hate mail about quasi-identifiers and or other sub-categories of data types in the not-private category.
NB. You will also notice that I have not brought up the utility of de-identified data sets, which is still a thorny subject for computer science researchers.