ETDs as the data-curation wedge?

By dsalo on September 11, 2009.

Many doctoral institutions now accept and archive (or are planning to accept and archive) theses and dissertations electronically. Virginia Tech pioneered this quite some time ago, and it has caught on slowly but steadily for reasons of cost, convenience, access, and necessity.

Necessity? Afraid so. Some theses and dissertations are honest digital artifacts, unable to be faithfully represented in ink on paper or in other analog fashion. Others might be flattened into analog, but that wouldn't be their (or their author's) preference. Still others contain digital artifacts of various sorts. Source code. Multimedia. Data.

ETDs don't pose any special digital-preservation challenges over and above the usual. (I got into an exchange on Twitter yesterday about a dissertation presented with a web content-management system, raising the issue of the artifact's sustainability given the CMS dependency. But any CMS with any content involves those same issues.) What they do present, given their popularity among faculty, students, administrators, and even (some) librarians, is an opportunity.

Institutions consider dissertations to be vital institutional history. (Master's theses—well, that varies from institution to institution, and even within institutions.) There can be no question of throwing away a dissertation simply because it's digital; an institution receiving digital dissertations has no choice but to do something about them.

Now, a lot of institutions, it seems to me, aren't doing much or are doing the wrong things. (If your institution has an unaudited pile of CD-ROMs, that's the wrong thing. Perfectly understandable given the circumstances, but still wrong in today's technology environment.) This shouldn't be surprising or terrifying, nor is it excuse to excoriate the institutions. We all do our best with what we have and what we know at the time.

However… the tools now exist for us to step up our digital-preservation game, and ETDs give us an unassailable, mission-critical reason to. Remember, the problems aren't specific to ETDs, so if we solve them for ETDs, we've solved them for a wide swathe of other kinds of documents and data as well.

Perhaps instead of spinning jargon-laden webs of words such as "cyberinfrastructure," we should start with an easy-to-recognize problem that we already know we have.

More like this

Virginia Tech might require an ETD for the university, but in my department they still wanted a hard copy. And my advisor wanted 2 copies, 1 for him and 1 for the lab.

That was expensive and painful :p

I'm disappointed, but not surprised. There is still a widespread belief in academia that digital files are not to be trusted, and paper and microfilm are the only appropriate archival formats.

This is also an unfortunate side-effect of faculty governance; nobody can tell your department NOT to require print.

Sorry that happened to you. I agree it was excessive and unnecessary.

At the University of Southern California, the libraries no longer archives print copies of TDs. Starting in 2006, we decided to only archive digital copies. Most of these are pdf files but everyone once in a while we get a movie file, image collection, ppt, or even a .exe.

We catalog them according to Dublic Core and ISBD standards and keep them all on locally managed servers. Considering we receive just under 1k TDs a year, this is a huge space saver. Of course, as JohnV mentioned above, some of the individual schools still require print copies for their own collections.

McGill University has now mandated that all theses be submitted electronically to the Graduate & Postdoctoral Studies Office. The GPSO then sends them on to the library to have the PDF/As (and soon - hopefully - data sets, a/v material, what have you) deposited in our institutional repository. At the same time, the library is actively digitizing earlier theses and uploading those to the IR.
The ultimate goal is to have all graduate, doctoral, and honours undergraduate (when sponsored by a supervising faculty member) available in our IR - eScholarship@McGill.

John, Amy -- any thoughts of taking the data work you're doing beyond ETDs?

Dorothea - indeed! Though we're only at the early evaluation stages, but the dream is to create a VRE linked to the IR. VRE for live data, IR for archiving. Ideally looking into one system to manage the whole thing (VRE-IR shift), tag data with researcher info, etc etc etc.

Dorothea, I'm very happy to find you again after suspending CavLec, which I always enjoyed. Your blog posts mentions data in ETDs but the other comments seem to relate mostly to ETDs themselves, not the data. I strongly agree with you that ETDs (and theses in general) are an opportunity for data. I'm particularly concerned about the envelope glued into the back cover of a printed thesis, containing disks or CDs of data. Or the links to the candidate's web site, sometimes based on a username that will shortly be deleted automatically once the candidate graduates and leaves.

Many theses come with data, often simple things like Excel spreadsheets, or videos or sound recordings, or SPSS datasets, or... Getting to grips with these, rather than just leaving them as a pile of CD-ROMs, is definitely the Right Thing, as you suggest. Just trying, and sharing the results, would be extremely helpful. Finding somewhere reasonably permanent (even in the repository itself), and giving it a URL to link from the thesis, and some independent metadata, would be great. So, thanks for this post, and I guess I'll need to find some time to have a look at your local ETD repository to see how you've done it!

We haven't done it yet, Chris! We're still very much in the planning stages where I am... and also amidships of a total technology-platform shift.

I can say, though, that the possibility of archiving data alongside theses immediately captured the goodwill of the engineering professor on the ETD committee. We've had the experience of sending those CDs merrily off to ProQuest, where they appear to... drop into an oubliette. We're less than pleased about that.

The backlog of CDs is a problem in its own right. I'd like to go back and deal with it, but we'll see how it goes... it's always easier to do the right thing going forward than retrospectively.

(Which is a topic that may deserve a post in its own right!)

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…