Data Curation - notes from a local meeting

My larger institution's (so not my place of work, but our parent org's) libraries had a fabulous get together Friday with a session on data curation. The speakers were: Clifford Lynch of the Coalition for Networked Information, Carole Palmer from UIUC, and Joel Bader from JHU and JHMI.

I tweeted, but there wasn't a hashtag, so there goes retrieval. These weren't live blogged but reconstructed from handwritten notes. These are my reconstructions of their points - so not my points and maybe not theirs.

Lynch spoke about institutions while Palmer spoke more about librarians. Bader spoke about his own experiences and some trickiness in his area.

Several trends leading towards data curation:

  • big science, lots o' data
  • simulation and modeling, lots o' data as well as model output
  • distributed sensor systems
  • digital instruments for observations and experiments, lots o' data
  • data provides evidence and is also the product of scholarly work
  • within the scholarly communication system, data and databases are becoming intertwined with the literature (traditional journal articles)

Throughout scholarship we see movement towards data, curation, it's just less uniformly distributed. The issues are pervasive across, not only dealing with large data sets, but the many more smaller ones. Big science projects have budgets and data scientists and resources to deal with this stuff - it's the little guys who don't. The large data sets with established communities might also have more standards and the best solution might be disciplinary repositories at the national or international level. The small projects might fall to the institutions. That means that they fall to the library (not 'round MPOW).

Funders and journals are also becoming aware of data access issues and are starting to require archiving. (question is about compliance, too, imho). We need good IT - security, reliability, and backups - so this shouldn't be a surplus computer kicked under a desk, but using enterprise machines that are professionally maintained. Better IT and metadata at the point of capture will make everything easier later. Libraries and other institutional partners need to work with scholars throughout the lifecycle and can help with required data management plans.

But we can't keep everything. His suggestions:

  • is it replaceable
  • were human or animal subjects involved (so not ethically replaceable)
  • does it have personally identifiable information

Nice quotes from Taylor (1986) - add value to information to improve current use and potential future use and Shera (1972) - coordinating and integrating information in alignment with complex social structures and practices... (what I like about her is that she doesn't give up the L and our proud tradition and values to chase the IS)... Data curation does work with our core areas (hah! she said we have core areas, hah!): information behavior, collection development and management, and information organization and retrieval. She asks if data are the new special collections? There will be more work needed on use and searching for data as well as dealing with collections of collections.

I have very few notes here because I was utterly entranced. They have some tricky issues with their data

  • an analysis is 100GB to 1TB
  • human subjects
  • can't anonymize - even if aggregated
  • has to be kept on secure servers

There are lots and lots of data standards - various slots and no way to do a database join across. Who marks up data? Who annotates? It has to be the authors because they are the ones who know - at the time of publication, but not a year later. Who assesses the annotation? Editor and reviewer for publication - but they are busy, too. It's has to be made easier for the reviewer. Who enforces? Journals and funding agencies.

More like this

A terrific new opportunity at my institution. I'm not in the reporting department or on the search committee, but I'd be happy to answer general questions about York and the environment. My email is jdupuis at yorku dot ca. The online job posting is here. Position Rank: Full Time Tenure Stream -…
stream of consciousness notes from this meeting I attended in DC, Wednesday December 16, 2009 Final panel Oren Beit-Arie (Ex Libris Group), Todd Carpenter (NISO),Lorcan Dempsey (OCLC),Tony Hey (Microsoft Research),Clifford Lynch (CNI),Don Waters (Andrew W. Mellon foundation) introduction from…
Many of my readers will already have seen the Nature special issue on data, data curation, and data sharing. If you haven't, go now and read; it's impossible to overestimate the importance of this issue turning up in such a widely-read venue. I read the opening of "Data sharing: Empty archives"…
A couple of weeks ago I was approached by to write a piece for them with some of my thoughts about the current controversy surrounding the government of Canada's closure of several Department of Fisheries and Oceans libraries. I have a link compilation here. I was happy to write up…