One of the truisms in data curation is “well, of course we don’t let sensitive data out into the wild woolly world.” We hold sensitive data internally. If we must let it out, we anonymize it; sometimes we anonymize it just on general principles. We’re not as dumb as the Google engineers, after all.
Only it turns out that data anonymization can be frighteningly easy to reverse-engineer. We’ve had some high-profile examples, such as the AOL search-data fiasco and the ongoing brouhaha over Netflix data. Paul Ohm’s working paper on the topic is a great way to get up to speed.
We librarians are fairly dogmatic about this sort of thing, owing to our professional-ethics commitment to your freedom to read. We wipe your checkout record clean after you turn your items back in. We do keep passive-voice usage records on our materials: “this book has been checked out X times since Y date.” But that’s it. (And no, we don’t keep track of when you visit the library, so it’s not possible to connect a formerly checked-out book with you based on the date of checkout.)
This long-standing design decision is being challenged on social-media grounds; it’s hard to build Web 2.0-ish applications around your library behavior if we don’t keep records of your library behavior! I used to be on the Web 2.0 side of this particular controversy, but as I’ve been reading about reidentification, my mind has changed. Information about which local public library one goes to isn’t precisely “zip code,” but it’s awfully, awfully close.
Anyway, the application to human-subjects data of all stripes is, I hope, obvious. It’s not as simple as anonymizing data; even aggregating it and only permitting queries may not solve the problem. Certain data breakdowns (e.g. from survey data) may be problematic.
Taking heed of the problem is the first step to solving it?but only the first. The sooner we have data-release guidelines that take reidentification into account, the happier I will feel about open data in the social sciences and medicine.
Incidentally, are you as sanguine about governments providing “linked data” as you were? Because I’m not.