My larger institution's (so not my place of work, but our parent org's) libraries had a fabulous get together Friday with a session on data curation. The speakers were: Clifford Lynch of the Coalition for Networked Information, Carole Palmer from UIUC, and Joel Bader from JHU and JHMI.
I tweeted, but there wasn't a hashtag, so there goes retrieval. These weren't live blogged but reconstructed from handwritten notes. These are my reconstructions of their points - so not my points and maybe not theirs.
Lynch spoke about institutions while Palmer spoke more about librarians. Bader spoke about his own experiences and some trickiness in his area.
Several trends leading towards data curation:
- big science, lots o' data
- simulation and modeling, lots o' data as well as model output
- distributed sensor systems
- digital instruments for observations and experiments, lots o' data
- data provides evidence and is also the product of scholarly work
- within the scholarly communication system, data and databases are becoming intertwined with the literature (traditional journal articles)
Throughout scholarship we see movement towards data, curation, it's just less uniformly distributed. The issues are pervasive across, not only dealing with large data sets, but the many more smaller ones. Big science projects have budgets and data scientists and resources to deal with this stuff - it's the little guys who don't. The large data sets with established communities might also have more standards and the best solution might be disciplinary repositories at the national or international level. The small projects might fall to the institutions. That means that they fall to the library (not 'round MPOW).
Funders and journals are also becoming aware of data access issues and are starting to require archiving. (question is about compliance, too, imho). We need good IT - security, reliability, and backups - so this shouldn't be a surplus computer kicked under a desk, but using enterprise machines that are professionally maintained. Better IT and metadata at the point of capture will make everything easier later. Libraries and other institutional partners need to work with scholars throughout the lifecycle and can help with required data management plans.
But we can't keep everything. His suggestions:
- is it replaceable
- were human or animal subjects involved (so not ethically replaceable)
- does it have personally identifiable information
Nice quotes from Taylor (1986) - add value to information to improve current use and potential future use and Shera (1972) - coordinating and integrating information in alignment with complex social structures and practices... (what I like about her is that she doesn't give up the L and our proud tradition and values to chase the IS)... Data curation does work with our core areas (hah! she said we have core areas, hah!): information behavior, collection development and management, and information organization and retrieval. She asks if data are the new special collections? There will be more work needed on use and searching for data as well as dealing with collections of collections.
I have very few notes here because I was utterly entranced. They have some tricky issues with their data
- an analysis is 100GB to 1TB
- human subjects
- can't anonymize - even if aggregated
- has to be kept on secure servers
There are lots and lots of data standards - various slots and no way to do a database join across. Who marks up data? Who annotates? It has to be the authors because they are the ones who know - at the time of publication, but not a year later. Who assesses the annotation? Editor and reviewer for publication - but they are busy, too. It's has to be made easier for the reviewer. Who enforces? Journals and funding agencies.