When Steve Hitchcock says that "sustainability must precede preservation for institutional repositories," what does he mean? Not to put words in Steve's mouth (Steve has plenty of words, all well-chosen), but here's my one-sentence take on it: A service is sustainable as long as it has a constituency both willing to fight to keep it going and able to make that fight count. This is, I grant you, a somewhat cynical assessment; I welcome less cynical ones in the comments. The corollaries for nascent data-curation efforts I leave to readers.
The tidbits folder is out of control, so this linklist may be a bit epic. My apologies! There's a lot of great discussion in this area of late. Data repositories: the next new wave Steve Hitchcock is sensible, as usual. The answer to "are repositories changing?" is "they already changed," if one asks Carole Palmer. What's lagging, still, is institutional recognition and approval of those changes. See also ERIS's initial thoughts about repositories for researchers. Free the humanities data! says Adam Crymble. Ainsworth and Meredith describe e-Science for Medievalists, but do take a look even…
Some people watch football over Thanksgiving weekend; I get into discussions of disciplinary data regimes with fellow SciBling Christina and others on FriendFeed. Judge me if you must! Another common truism in both the repository and data-management fields is that disciplinary affiliation accounts for a lot of the variation in observed researcher behavior. For once, I have no quarrel with the truism; it is unassailably the case. The wise data curator, then, knows some things about disciplinary practices going in. But what things, exactly? I don't believe that taxonomy exists yet; it'd be an…
Book of Trogool has just been added to Planet Code4Lib, a library-technology blog reader. I am of course honored to be in some very fine company. I have a mixed readership here: librarians, technology pros, researchers from several disciplines. I encourage all my readers to pop over to take a look at Planet Code4Lib. If you're not a librarian, chances are that your image of the library and the librarians who staff it is… well, a bit fusty and out-of-date. Planet Code4Lib will open your eyes in a hurry. Do we do the things you think we do? Well, yes, probably. But that's not all we do. If you…
Another case of things connecting up oddly in my head— "How do we know whether a dataset is any good?" is a vexed question in this space. Because the academy is accustomed to answering quality questions with peer review, peer review is sometimes adduced as part of the solution for data as well. As Gideon Burton trenchantly points out, peer review isn't all it's cracked up to be, viewed strictly from the quality-metering point of view. It's known to be biased along various axes, fares poorly on consistency metrics, is game-able and gamed (more by reviewers than reviewees, but even so), and…
Some interesting ferment happening in repository-land, notably this discussion of various types and scales of repositories and how successful they can expect to be given the structural conditions in which they are embedded. I don't blog repositories per se any more, so I'm not going to address the paper in detail (though I do think it contains serious oversights). What I'm curious about in the Trogool context is the case of institutionally-hosted services aimed not specifically at the institution, but at a particular discipline. arXiv. ARTFL. PERSEUS. DRYAD. There's any number of these. One…
Have some Friday tidbits! An important biology dataset is losing NSF funding and may fold. Nor (as the article explains) is it the only one. It is impossible to overstate the desperate gravity of the data-sustainability question. Academic libraries, if we are not the white knights here—and we certainly have been in the past; witness arXiv—who is? On a similar theme, Yahoo pulls the plug on GeoCities. O ye researchers relying on consumer-grade web services, or new startups, have an exit strategy! Consumer-grade services die when they lose money. Jason Scott may not come charging to your rescue…
It can be difficult to convince present-focused researchers to give a long-term perspective, such as that of a librarian or archivist, the time of day. (So to speak.) Here's my favorite way to do it: the "… and then what?" game. You have digital data. You think it's important. We'll start from there. Your grant runs out… and then what? The graduate student who's been doing all the data-management chores leaves with Ph.D in hand… and then what? Your favorite grant agency institutes a data-sustainability requirement for all grants… and then what? Your lab's PI retires… and then what? Your…
I got a very nice email the other day thanking me for being a clearinghouse for e-research information. I'm not quite sure I am that, but just in case I've become it without noticing… What I read in the area and think is worthwhile enough to keep around ends up in a few places, all of which have RSS feeds: the Data Curation folder in my Zotero (you may also be interested in the Digital Humanities or Digital Preservation folders) the toblog and datacuration tags in my del.icio.us (items in the "toblog" tag end up in tidbits posts here—usually) Happy to share these, and also happy to start up a…
BMC Bioinformatics published this article describing a "data publishing framework" for biodiversity data. Stripped to its essentials, this article is about carrots for data sharing. Acknowledging that cultural inertia (some of it well-founded) militates against spontaneous data sharing, the authors suggest a way forward. I'm calling this one out because it has implications for storage-system design. The authors want three things for their public data: persistent identifiers, citation mechanisms, and data usage information. (For once, I feel good about institutional repositories: they swing…
By way of amplifying the signal: the 5th International Digital Curation Conference is coming up in London in December. I will be there in spirit only, I fear, but I hope there will be a Twitter hashtag I can follow? Chris Rusbridge has blogged the program. (If I seem more scatterbrained than usual, it's because most of my spare time and brainspace is currently devoted to building a course I will be teaching online in the spring for Illinois's GSLIS. It's a "Topics in Collection Development" course, which means I have to view things through a lens I'm almost completely unfamiliar with—I don't…
This is a pushmi-pullyu post. I need some help with an environmental scan, so I'll get us started and the rest of you smart folks can amplify my knowledge. I want to understand what's going on where with data curation specifically at the institutional level (no NOAA, no ICPSR, none of that) Stateside. Grant-funded is fine, though I'm doubly curious about programs that have been weaned (or are weaning themselves) off the grant money. Here are the programs I know about offhand: Institutional data curation: San Diego Supercomputer Center (right? I'm not entirely sure what they offer vis-a-vis…
So here's an interesting problem I ran into today. You have metadata in an XML file. You want to make the file self-describingly self-correcting, so you want to embed its checksum inside it. The problem is, you can't add the checksum to the XML file without changing the file's checksum! Is there an XML verification tool not subject to this particular tail-chase? I don't know of one offhand.
Libraries do collaborative collection development, through consortia and increasingly via direct institution-to-institution arrangements. Reference and instruction are collaborative endeavors—look at any social-networking service with lots of librarians and you'll see on-the-spot crowdsourced reference responses. Perhaps this collaboration instinct will help libraries respond to the challenge of domain expertise for data curation. Do I need to know cheminformatics, or do I just need to buy a cheminformaticist conference potations until I secure her business card? Formalizing expertise-sharing…
Starting off the week with some juicy tidbits: An extremely nerdy but (for nerds) fascinating examination of XML and its implications for data modeling. Do we have to reduce everything to a relational model? Really? Perhaps not… Notably, it seems to me, this article describes fairly nicely how Fedora works. (For more beating on the humble RDBMS, see this blog post.) White Dielectric Substance in Library Metadata. "Understanding the noise turned out to be more important than understanding the signal." What does that mean for efforts to decide which data to preserve? "I've observed that most…
I read the RIN report on life-sciences data with interest, a little cynicism, and much appreciation for the grounded and sensible approach I have come to expect from British reports. If you're interested in data services, you should read this report too. A warning to avoid preconceptions: If you pay too much attention to all the cyberinfrastructure and e-science hype, it's very easy to fall prey to the erroneous notion that most of science is crunching massive numbers via grid computing and throwing out terabytes of data per second. It ain't so. It never was so. Will it be so in future? Not…
There is a certain kind of digital project that strikes terror and dismay into the hearts of digital preservationists everywhere. Not a one of us hasn't seen many exemplars. They make me myself feel sad and tired. They're projects that, no matter their scholarly or design merit, are completely unpreservable because they were built from unsustainable tools, techniques, and materials. What's worse, even a cursory examination with an eye to sustainability would have at least signaled a problem. It's not the unpreservability so much. It's the obliviousness that makes me hurt inside. For various…
One phenomenon that will be—indeed, already is—utterly unavoidable in the data-curation space is the creation of standards. I once heard Andrew Pace say that standards are like toothbrushes: everybody thinks they're great, but nobody wants to use anybody else's. Be that as it may, standards development and compliance is one way to make everybody's data play nicely with everybody else's data. It's not the only way, to be sure; one very important way that I'm sure we'll also see more of is Being The Only Game In Town. ICPSR manages this quite successfully, and so does the Digital Sky Survey. If…
If you're not reading comments here, you're missing out. For reasons I don't entirely understand, some of the best in the business are seeing fit to comment here. They have more to teach than I do! Chris Rusbridge (of, among other things, this thought-provoking meditation on digital preservation) has been spotted here, and whenever he pops up he makes me think about things. This time, I was thinking about disciplinary expertise, and how I need to make a better case that less of it is necessary for data curation than generally admitted. I hope we can at least admit that data curators don't…
I pointed out Mike Lesk's slideshow in my last tidbits post, finding it a good critical précis of the data problem. It's pleasantly aware of human problems, human problems many treatments of cyberinfrastructure (including, unfortunately, this otherwise useful call to action from Educause) wholly ignore. So wince and flinch at the design (black Arial on white? really? in 2009?), but read the slideshow anyway. I do want to pick apart the slide from which I took the title of this post. I reproduce the said slide's text in full: Can we just give the problem to the libraries? As a professor in a…