Wired reports a great new opportunity to make money online by suing internet companies for revealing the data:
An in-the-closet lesbian mother is suing Netflix for privacy invasion, alleging the movie rental company made it possible for her to be outed when it disclosed insufficiently anonymous information about nearly half-a-million customers as part of its $1 million contest to improve its recommendation system.
I'm not sure whether the litigators have read this particular section of the Netflix prize rules:
To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates.
So yes, you can match a set of reviews with someone else, but how will you know that it's really a person and not a random coincidence? The Netflix dataset contains almost half a million anonymous users, and there is plenty of opportunity for a false positive match (an example of which is the birthday paradox). Netflix learned from AOL's data release disaster, which resulted in a few people getting fired.
But this theme is important. Many internet companies provide free services in return for the ability to employ user data for profit. Andrew Parker looked at which companies make profit out of user data. Usually, the data is never given away, but just used to make other people's lives easier. Let's say that you bookmark a particular page - others won't see that you've done it directly, but they will see indirectly that there are people that find that page worthy of saving. Because it's worthy, it can be listed on the first page of search results.
A more problematic area is medicine. Wired reports that there is a market out there for medical records, and that anonymity protection isn't very secure.
Keeping medical data public would allow massive advances in medicine. For example, the Personal Genomes project seeks to analyze a number of volunteers in a lot of detail (see, for example, Steven Pinker's medical record). If a few million people did that, we'd know so much more about disease, risks, factors affecting it, effectiveness of drugs, diet, the effects of genome.
One-sided disclosure gets many people worried - their insurance rates might go up, they might not get a job. It would help if everyone was doing that: nobody feels well being naked when others wear swimsuits.
But we should also ask ourselves as a society - what is insurance? Is insurance a protection against uncontrollable risk or is it an instrument of equality? Is genome our destiny or an uncontrollable risk?
The line between users sharing information about themselves and feeling their privacy violated is indeed thin. I have often felt the same conflict myself, though my own personal standard (it is a very personal standard) is that I am OK with sharing the information if it results in greater value for me as a consumer or if it adds to the overall well-being of society. But a necessary condition is that the safeguards are present to anonymize me as an individual.
A large problem is that the technology is evolving to eliminate anonymity even if someone has guaranteed it. Technology evolves quickly to use the most trivial of data in ways we never thought of when the data was collected. Multiple datasets can reveal amazing information and new datamining techniques are being developed.
Krish, anonymization can have varying levels of strength - and in every case it's giving up statistical power for privacy. So, 100% anonymized can easily mean that there will be no benefit for the society :)
Roger, you're right.
Sometimes because some of us...
Don't have the money but we can make it on our own way!
We don't need No Politicians, No Governance nor Lawyers
To point us in a better way.
Because sometimes it takes the best of us,
To do just what we can!