Category: Tidbits
Have some Friday tidbits!
- An important biology dataset is losing NSF funding and may fold. Nor (as the article explains) is it the only one. It is impossible to overstate the desperate gravity of the data-sustainability question. Academic libraries, if we are not the white knights here—and we certainly have been in the past; witness arXiv—who is?
- On a similar theme, Yahoo pulls the plug on GeoCities. O ye researchers relying on consumer-grade web services, or new startups, have an exit strategy! Consumer-grade services die when they lose money. Jason Scott may not come charging to your rescue.
- H1N1 science depends on a public database of flu immunity data. "As the researchers acknowledge in their paper, the work couldn't have taken place if it weren't for extensive data sharing within the community of flu virus researchers." Data sharing makes possible better, faster science.
- Data and the journal article. First: if you are saving your data as PDF, stop it. Second: as I suggested to Chris on FriendFeed, there's a serious structural issue with expecting journal publishers to cope with appropriate data archiving: by the time a researcher chooses a journal to publish in, all the decisions about data gathering and representation have already been made—and they may well have been made badly. The poor journal publisher can't go back in time and fix bad decisions! In our not-yet-standardized data age, early data interventions have to happen close to the researcher, which to me means they need to happen at the institution where the research happens.
- The need for clear data licenses. I haven't talked about data licensing here, partly because the current state of intellectual-property law makes me sick at heart, but there's no question that it's an important piece of the data puzzle.
- Peer-to-peer technology used for the forces of good: BioTorrents. Datasets vary in size; for the large ones, network latency becomes a sharing problem. Torrenting won't precisely solve the problem, but it certainly increases the size range within which datasets are portable.
- Fascinating data project of the week: National Center for Ecological Analysis and Synthesis. What caught my attention is that as I read the project description, it takes public data sharing for granted. NCEAS researchers are not generating data; they are mining existing data. I'm inordinately curious about the disciplinary culture that makes this a feasible thing: what price scooping?
Whew. I have a lot more, but it's Friday.
0 Comments • 0 TrackBacks
Category: Tactics
It can be difficult to convince present-focused researchers to give a long-term perspective, such as that of a librarian or archivist, the time of day. (So to speak.) Here's my favorite way to do it: the "… and then what?" game.
You have digital data. You think it's important. We'll start from there.
- Your grant runs out… and then what?
- The graduate student who's been doing all the data-management chores leaves with Ph.D in hand… and then what?
- Your favorite grant agency institutes a data-sustainability requirement for all grants… and then what?
- Your lab's PI retires… and then what?
- Your instrument manufacturer or favorite software's developer goes out of business… and then what?
- Your whomped-up next-door data center burns up, falls down, then sinks into the swamp… and then what?
You get the idea. No far-fetched catastrophizing, just all-too-plausible scenarios that researchers really ought to have thought about already but usually haven't. If your service can position itself as the "… and then what," you're on to something.
0 Comments • 0 TrackBacks
Category: Metablogging
I got a very nice email the other day thanking me for being a clearinghouse for e-research information. I'm not quite sure I am that, but just in case I've become it without noticing…
What I read in the area and think is worthwhile enough to keep around ends up in a few places, all of which have RSS feeds:
Happy to share these, and also happy to start up a Zotero group if anyone else is interested in contributing items thereto!
(By the way, one rather annoying thing about the Zotero feed—I almost always save copies of the item along with the item record, and Zotero dumps both into the RSS feed, which from the consuming end looks like a lot of unnecessary duplication. I apologize for this, and wish Zotero would fix it.)
3 Comments • 0 TrackBacks
Category: Tactics
BMC Bioinformatics published this article describing a "data publishing framework" for biodiversity data.
Stripped to its essentials, this article is about carrots for data sharing. Acknowledging that cultural inertia (some of it well-founded) militates against spontaneous data sharing, the authors suggest a way forward.
I'm calling this one out because it has implications for storage-system design. The authors want three things for their public data: persistent identifiers, citation mechanisms, and data usage information.
(For once, I feel good about institutional repositories: they swing two out of three at the minimum, and some manage all three!)
Persistent identifiers seem simple but aren't, necessarily. For example, does a constantly-changing dataset get a persistent identifier? How does that identifier know what it's identifying, in that case? Should a persistent identifier be just a URL? What if the domain name goes away or changes? (This is not an idle concern; the University of Illinois, for example, just changed its top-level domain, and the institutional repository I run is eventually going to lose its separate domain entirely.) What, exactly, gets a persistent identifier? The entire dataset? Files within it? Should a query performable on that dataset also be persistently identifiable? How does that work, exactly? And when does something get its persistent identifier? As soon as it hits the system? Or after it's done and blessed, if it ever is?
Anyway. All of this needs to be hashed out (so to speak). It's not optional, system designers.
Once that's sorted out, citation isn't actually a huge hurdle from where I'm sitting. It's not a technical problem; it's kicking the style manuals into acknowledging data and making citation formats for it.
Usage, now, that's a hurdle. It, too, is utterly necessary for cultural reasons, however. The culture of academia looks kindly on impact measurements, even hopelessly faulty ones. Somehow or other, research impact has to be measured for researchers' careers to advance. Data are no exception.
(In my professional neck of the woods, systems designers ignored the need for usage documentation entirely too long, which has made my life as an IR manager extraordinarily difficult. I make this post in hopes of avoiding the same mistake in this new arena.)
What counts as a "use" exactly? How does "use" get harmonized over different kinds of access schemes? How does an API "use" compare with an entire download?
I don't know. I encourage systems designers not to get too hung up on such questions. Record all accesses and make the best decisions you can right now about how to present them. Yes, you'll have to rewrite the event-analysis code, probably more than once, so comment it well.
Do not, however, wait until you have all the answers to write an analyzer. If you do that, you're strangling the open-data movement in its crib. BMC Bioinformatics explains why.
2 Comments • 0 TrackBacks
Category: Metablogging • Tidbits
By way of amplifying the signal: the 5th International Digital Curation Conference is coming up in London in December. I will be there in spirit only, I fear, but I hope there will be a Twitter hashtag I can follow?
Chris Rusbridge has blogged the program.
(If I seem more scatterbrained than usual, it's because most of my spare time and brainspace is currently devoted to building a course I will be teaching online in the spring for Illinois's GSLIS. It's a "Topics in Collection Development" course, which means I have to view things through a lens I'm almost completely unfamiliar with—I don't do normal collection development, and most of what I know about it is that it scares me to death! I am designing my version to be "how coll-dev is currently changing and may continue to change." Data curation will be included, as will scholarly communication and the serials crisis, institutional repositories, digital collections, digital preservation, and similar things that I actually do know something about. Wish me luck. I will need it.)
I've gotten some good comments to yesterday's poll. Please keep them coming. I know there's more out there!
0 Comments • 0 TrackBacks
Category: Miscellanea
This is a pushmi-pullyu post. I need some help with an environmental scan, so I'll get us started and the rest of you smart folks can amplify my knowledge.
I want to understand what's going on where with data curation specifically at the institutional level (no NOAA, no ICPSR, none of that) Stateside. Grant-funded is fine, though I'm doubly curious about programs that have been weaned (or are weaning themselves) off the grant money. Here are the programs I know about offhand:
Tell me what I'm missing, please and thank you.
6 Comments • 0 TrackBacks
Category: Miscellanea
So here's an interesting problem I ran into today. You have metadata in an XML file. You want to make the file self-describingly self-correcting, so you want to embed its checksum inside it. The problem is, you can't add the checksum to the XML file without changing the file's checksum!
Is there an XML verification tool not subject to this particular tail-chase? I don't know of one offhand.
7 Comments • 0 TrackBacks
Category: Tactics
Libraries do collaborative collection development, through consortia and increasingly via direct institution-to-institution arrangements. Reference and instruction are collaborative endeavors—look at any social-networking service with lots of librarians and you'll see on-the-spot crowdsourced reference responses.
Perhaps this collaboration instinct will help libraries respond to the challenge of domain expertise for data curation. Do I need to know cheminformatics, or do I just need to buy a cheminformaticist conference potations until I secure her business card?
Formalizing expertise-sharing arrangements strikes me as rather difficult. Nobody wants to be the person everybody across the country calls with questions about ChemML; when would there be time to get any work done? Still, I would have thought that collaborative collection development had too many moving parts to be practical, and it's being done.
In any case… I have "develop network of domain experts" in the back of my head as a wise thing to do.
0 Comments • 0 TrackBacks
Category: Tidbits
Starting off the week with some juicy tidbits:
That should keep everyone out of trouble a while…
0 Comments • 0 TrackBacks