<?xml version="1.0"?>
<rss version="2.0">
   <channel>
      <title>The Book of Trogool</title>
      <link>http://scienceblogs.com/bookoftrogool/</link>
      <description>E-research, cyberinfrastructure, data curation... an academic librarian confronts the way computers are changing academic research.</description>
      <language>en</language>
      <copyright>Copyright 2009</copyright>
      <lastBuildDate>Fri, 20 Nov 2009 16:18:16 -0600</lastBuildDate>
      <generator>http://www.sixapart.com/movabletype/?v=4.261</generator>
      <docs>http://blogs.law.harvard.edu/tech/rss</docs> 

      
      <item>
         <title>Tidbits, 20 November 2009</title>
          <description><![CDATA[<p>Have some Friday tidbits!</p>

<ul><li><a href="http://www.nature.com/nature/journal/v462/n7271/full/462252a.html">An important biology dataset</a> is losing NSF funding and may fold. Nor (as the article explains) is it the only one. It is impossible to overstate the desperate gravity of the data-sustainability question. Academic libraries, if we are not the white knights here&#8212;and we certainly have been in the past; witness arXiv&#8212;who is?</li>
<li>On a similar theme, <a href="http://www.time.com/time/business/article/0,8599,1936645,00.html">Yahoo pulls the plug on GeoCities</a>. O ye researchers relying on consumer-grade web services, or new startups, <em>have an exit strategy!</em> Consumer-grade services die when they lose money. Jason Scott may not come charging to <em>your</em> rescue.</li>
<li><a href="http://arstechnica.com/science/news/2009/11/partial-h1n1-immunity-can-come-without-exposure-to-virus.ars">H1N1 science depends on a public database of flu immunity data.</a> "As the researchers acknowledge in their paper, the work couldn't have taken place if it weren't for extensive data sharing within the community of flu virus researchers." Data sharing makes possible better, faster science.</li>
<li><a href="http://digitalcuration.blogspot.com/2009/11/data-and-journal-article.html">Data and the journal article</a>. First: if you are saving your data as PDF, <em>stop it</em>. Second: as I suggested to Chris on FriendFeed, there's a serious structural issue with expecting journal publishers to cope with appropriate data archiving: by the time a researcher chooses a journal to publish in, all the decisions about data gathering and representation have already been made&#8212;and they may well have been made badly. The poor journal publisher can't go back in time and fix bad decisions! In our not-yet-standardized data age, early data interventions have to happen close to the researcher, which to me means they need to happen at the institution where the research happens.</li>
<li><a href="http://highearthorbit.com/the-need-for-clear-data-licenses/">The need for clear data licenses.</a> I haven't talked about data licensing here, partly because the current state of intellectual-property law makes me sick at heart, but there's no question that it's an important piece of the data puzzle.</li>
<li>Peer-to-peer technology used for the forces of good: <a href="http://biotorrents.net/index.php">BioTorrents</a>. Datasets vary in size; for the large ones, network latency becomes a sharing problem. Torrenting won't precisely solve the problem, but it certainly increases the size range within which datasets are portable.</li>
<li>Fascinating data project of the week: <a href="http://www.nceas.ucsb.edu/">National Center for Ecological Analysis and Synthesis</a>. What caught my attention is that as I read the project description, it takes public data sharing for granted. NCEAS researchers are not <em>generating</em> data; they are mining existing data. I'm inordinately curious about the disciplinary culture that makes this a feasible thing: what price scooping?</li></ul>

<p>Whew. I have a lot more, but it's Friday.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/tidbits_20_november_2009.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/tidbits_20_november_2009.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/tidbits_20_november_2009.php</guid>
         <category>Tidbits</category>
         
         <pubDate>Fri, 20 Nov 2009 16:18:16 -0600</pubDate>
      </item>
      
      <item>
         <title>... and then what?</title>
          <description><![CDATA[<p>It can be difficult to convince present-focused researchers to give a long-term perspective, such as that of a librarian or archivist, the time of day. (So to speak.) Here's my favorite way to do it: the "&#8230; and then what?" game.</p>

<p>You have digital data. You think it's important. We'll start from there.</p>

<ul><li>Your grant runs out&#8230; and then what?</li>
<li>The graduate student who's been doing all the data-management chores leaves with Ph.D in hand&#8230; and then what?</li>
<li>Your favorite grant agency institutes a data-sustainability requirement for all grants&#8230; and then what?</li>
<li>Your lab's PI retires&#8230; and then what?</li>
<li>Your instrument manufacturer or favorite software's developer goes out of business&#8230; and then what?</li>
<li>Your whomped-up next-door data center burns up, falls down, then sinks into the swamp&#8230; and then what?</li></ul>

<p>You get the idea. No far-fetched catastrophizing, just all-too-plausible scenarios that researchers really ought to have thought about already but usually haven't. If your service can position itself as the "&#8230; and then what," you're on to something.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/_and_then_what.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/_and_then_what.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/_and_then_what.php</guid>
         <category>Tactics</category>
         
         <pubDate>Tue, 17 Nov 2009 17:20:51 -0600</pubDate>
      </item>
      
      <item>
         <title>Tracking my eyes</title>
          <description><![CDATA[<p>I got a very nice email the other day thanking me for being a clearinghouse for e-research information. I'm not quite sure I am that, but just in case I've become it without noticing&#8230;</p>

<p>What I read in the area and think is worthwhile enough to keep around ends up in a few places, all of which have RSS feeds:</p>

<ul><li>the <a href="http://www.zotero.org/cavlec/items/collection/59824">Data Curation</a> folder in my Zotero (you may also be interested in the <a href="http://www.zotero.org/cavlec/items/collection/1203383">Digital Humanities</a> or <a href="http://www.zotero.org/cavlec/items/collection/1126975">Digital Preservation</a> folders)</li>
<li>the <a href="http://delicious.com/cavlec/toblog">toblog</a> and <a href="http://delicious.com/cavlec/datacuration">datacuration</a> tags in my del.icio.us (items in the "toblog" tag end up in tidbits posts here&#8212;usually)</li></ul>

<p>Happy to share these, and also happy to start up a Zotero group if anyone else is interested in contributing items thereto!</p>

<p>(By the way, one rather annoying thing about the Zotero feed&#8212;I almost always save copies of the item along with the item record, and Zotero dumps both into the RSS feed, which from the consuming end looks like a lot of unnecessary duplication. I apologize for this, and wish Zotero would fix it.)</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/tracking_my_eyes.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/tracking_my_eyes.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/tracking_my_eyes.php</guid>
         <category>Metablogging</category>
         
         <pubDate>Mon, 16 Nov 2009 17:11:59 -0600</pubDate>
      </item>
      
      <item>
         <title>The basic carrot: usage statistics</title>
          <description><![CDATA[<p>BMC Bioinformatics published <a href="http://www.biomedcentral.com/1471-2105/10/S14/S2">this article</a> describing a "data publishing framework" for biodiversity data.</p>

<p>Stripped to its essentials, this article is about carrots for data sharing. Acknowledging that cultural inertia (some of it well-founded) militates against spontaneous data sharing, the authors suggest a way forward.</p>

<p>I'm calling this one out because it has implications for storage-system design. The authors want three things for their public data: persistent identifiers, citation mechanisms, and data usage information.</p>

<p>(For once, I feel good about institutional repositories: they swing two out of three at the minimum, and some manage all three!)</p>

<p>Persistent identifiers seem simple but aren't, necessarily. For example, does a constantly-changing dataset get a persistent identifier? How does that identifier know what it's identifying, in that case? Should a persistent identifier be just a URL? What if the domain name goes away or changes? (This is not an idle concern; the University of Illinois, for example, just changed its top-level domain, and the institutional repository I run is eventually going to lose its separate domain entirely.) What, exactly, gets a persistent identifier? The entire dataset? Files within it? Should a query performable on that dataset also be persistently identifiable? How does that work, exactly? And <em>when</em> does something get its persistent identifier? As soon as it hits the system? Or after it's done and blessed, if it ever is?</p>

<p>Anyway. All of this needs to be hashed out (so to speak). It's not optional, system designers.</p>

<p>Once that's sorted out, citation isn't actually a huge hurdle from where I'm sitting. It's not a technical problem; it's kicking the style manuals into acknowledging data and making citation formats for it.</p>

<p>Usage, now, that's a hurdle. It, too, is utterly necessary for cultural reasons, however. The culture of academia looks kindly on impact measurements, even hopelessly faulty ones. Somehow or other, research impact has to be measured for researchers' careers to advance. Data are no exception.</p>

<p>(In my professional neck of the woods, systems designers ignored the need for usage documentation entirely too long, which has made my life as an IR manager extraordinarily difficult. I make this post in hopes of avoiding the same mistake in this new arena.)</p>

<p>What counts as a "use" exactly? How does "use" get harmonized over different kinds of access schemes? How does an API "use" compare with an entire download?</p>

<p>I don't know. I encourage systems designers not to get too hung up on such questions. Record all accesses and make the best decisions you can <em>right now</em> about how to present them. Yes, you'll have to rewrite the event-analysis code, probably more than once, so comment it well.</p>

<p>Do not, however, wait until you have all the answers to write an analyzer. If you do that, you're strangling the open-data movement in its crib. <a href="http://www.biomedcentral.com/1471-2105/10/S14/S2">BMC Bioinformatics explains why</a>.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/the_basic_carrot_usage_statist.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/the_basic_carrot_usage_statist.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/the_basic_carrot_usage_statist.php</guid>
         <category>Tactics</category>
         
         <pubDate>Mon, 16 Nov 2009 16:44:28 -0600</pubDate>
      </item>
      
      <item>
         <title>International Digital Curation Conference</title>
          <description><![CDATA[<p>By way of amplifying the signal: the <a href="http://www.dcc.ac.uk/events/dcc-2009/">5th International Digital Curation Conference</a> is coming up in London in December. I will be there in spirit only, I fear, but I hope there will be a Twitter hashtag I can follow?</p>

<p><a href="http://digitalcuration.blogspot.com/2009/11/5th-international-digital-curation.html">Chris Rusbridge has blogged the program.</a></p>

<p>(If I seem more scatterbrained than usual, it's because most of my spare time and brainspace is currently devoted to building a course I will be teaching online in the spring for Illinois's GSLIS. It's a "Topics in Collection Development" course, which means I have to view things through a lens I'm almost completely unfamiliar with&#8212;I don't do normal collection development, and most of what I know about it is that it scares me to death! I am designing my version to be "how coll-dev is currently changing and may continue to change." Data curation will be included, as will scholarly communication and the serials crisis, institutional repositories, digital collections, digital preservation, and similar things that I actually <em>do</em> know something about. Wish me luck. I will need it.)</p>

<p>I've gotten some good comments to yesterday's poll. Please keep them coming. I <em>know</em> there's more out there!</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/international_digital_curation.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/international_digital_curation.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/international_digital_curation.php</guid>
         <category>Tidbits</category>
         
         <pubDate>Fri, 13 Nov 2009 17:37:11 -0600</pubDate>
      </item>
      
      <item>
         <title>Poll: Where are the institutional programs?</title>
          <description><![CDATA[<p>This is a pushmi-pullyu post. I need some help with an environmental scan, so I'll get us started and the rest of you smart folks can amplify my knowledge.</p>

<p>I want to understand what's going on where with data curation specifically at the institutional level (no NOAA, no ICPSR, none of that) Stateside. Grant-funded is fine, though I'm doubly curious about programs that have been weaned (or are weaning themselves) off the grant money. Here are the programs I know about offhand:</p>

<ul><li>Institutional data curation: <a href="http://www.sdsc.edu/">San Diego Supercomputer Center</a> (right? I'm not entirely sure what they offer vis-a-vis long-term data stewardship), <a href="http://d2c2.lib.purdue.edu/">Purdue's D2C2</a>, <a href="http://datastar.mannlib.cornell.edu/">Cornell's DataStaR</a>.</li>
<li>Subject-specific but still (mostly?) institution-focused: Cornell's <a href="http://cugir.mannlib.cornell.edu/">CUGIR</a> (there must be a lot more GIS out there, mustn't there?), North Carolina's <a href="http://datadryad.org/repo">DRYAD</a></li>
<li>Data-curation training: <a href="http://www.lis.illinois.edu/programs/ms/data_curation.html">Illinois</a>, <a href="http://ils.unc.edu/digccurr/aboutI.html">North Carolina</a>.</li></ul>

<p>Tell me what I'm missing, please and thank you.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/poll_where_are_the_institution.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/poll_where_are_the_institution.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/poll_where_are_the_institution.php</guid>
         <category>Miscellanea</category>
         
         <pubDate>Thu, 12 Nov 2009 05:21:15 -0600</pubDate>
      </item>
      
      <item>
         <title>Heisenberg&apos;s Uncertainty Checksum</title>
          <description><![CDATA[<p>So here's an interesting problem I ran into today. You have metadata in an XML file. You want to make the file self-describingly self-correcting, so you want to embed its checksum inside it. The problem is, you can't add the checksum to the XML file without changing the file's checksum!</p>

<p>Is there an XML verification tool not subject to this particular tail-chase? I don't know of one offhand.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/heisenbergs_uncertainty_checks.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/heisenbergs_uncertainty_checks.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/heisenbergs_uncertainty_checks.php</guid>
         <category>Miscellanea</category>
         
         <pubDate>Tue, 10 Nov 2009 18:39:55 -0600</pubDate>
      </item>
      
      <item>
         <title>Collaborative domain-expertise development?</title>
          <description><![CDATA[<p>Libraries do collaborative collection development, through consortia and increasingly via direct institution-to-institution arrangements. Reference and instruction are collaborative endeavors&#8212;look at any social-networking service with lots of librarians and you'll see on-the-spot crowdsourced reference responses.</p>

<p>Perhaps this collaboration instinct will help libraries respond to the challenge of domain expertise for data curation. Do I need to know cheminformatics, or do I just need to buy a cheminformaticist conference potations until I secure her business card?</p>

<p>Formalizing expertise-sharing arrangements strikes me as rather difficult. Nobody wants to be the person everybody across the country calls with questions about ChemML; when would there be time to get any work done? Still, I would have thought that collaborative collection development had too many moving parts to be practical, and it's being done.</p>

<p>In any case&#8230; I have "develop network of domain experts" in the back of my head as a wise thing to do.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/collaborative_domain-expertise.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/collaborative_domain-expertise.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/collaborative_domain-expertise.php</guid>
         <category>Tactics</category>
         
         <pubDate>Tue, 10 Nov 2009 17:38:32 -0600</pubDate>
      </item>
      
      <item>
         <title>Tidbits, 9 November 2009</title>
          <description><![CDATA[<p>Starting off the week with some juicy tidbits:</p>

<ul><li><a href="http://www.xmltoday.org/content/thoughts-mods-and-restful-services-0">An extremely nerdy but (for nerds) fascinating examination of XML</a> and its implications for data modeling. Do we <em>have</em> to reduce everything to a relational model? Really? Perhaps not&#8230; Notably, it seems to me, this article describes fairly nicely how <a href="http://fedora-commons.org/">Fedora</a> works. (For more beating on the humble RDBMS, see <a href="http://cacm.acm.org/blogs/blog-cacm/32212-the-end-of-a-dbms-era-might-be-upon-us/fulltext">this blog post</a>.)</li>
<li><a href="http://go-to-hellman.blogspot.com/2009/09/white-dielectric-substance-in-library.html">White Dielectric Substance in Library Metadata.</a> "Understanding the noise turned out to be more important than understanding the signal." What does that mean for efforts to decide which data to preserve? "I've observed that most people trying to collect metadata go through an early period of thinking it's easy, and then gradually gain understanding of the real challenges." So have I observed this, Mr. Hellman, so have I.</li>
<li><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0007078">An empirical study of data sharing by authors publishing in PLoS journals.</a> My reaction split neatly in half: half "data are doomed because no one who makes them will lift a finger to save them" and half "surely this could be easier?"</li>
<li><a href="http://blogs.ecs.soton.ac.uk/keepit/2009/09/23/data-repositories-the-next-new-wave/">Steve Hitchcock on how data will change library-managed repositories.</a> My best wishes to Steve as he makes his vision real.</li>
<li><a href="http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf">The Fourth Paradigm</a> (PDF). Microsoft's toe-dip into the data waters. Useful case studies for those thrashing about in planning processes.</li>
<li><a href="http://www.americanscientist.org/issues/pub/2009/3/writing-math-on-the-web/1">Writing math on the web.</a> Because it still makes my head explode that the Web was nominally designed to exchange physics papers, but the math-display problem didn't seem to occur to any of its architects until years later.</li>
<li>And finally, an amazing data project: <a href="http://www.oceanleadership.org/programs-and-partnerships/ocean-observing/ooi/">the Ocean Observatories Initiative</a>.</li></ul>

<p>That should keep everyone out of trouble a while&#8230;</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/tidbits_9_november_2009.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/tidbits_9_november_2009.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/tidbits_9_november_2009.php</guid>
         <category>Tidbits</category>
         
         <pubDate>Mon, 09 Nov 2009 17:48:31 -0600</pubDate>
      </item>
      
      <item>
         <title>No, you can&apos;t have a pony</title>
          <description><![CDATA[<p>I read the <a href="http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/disciplinary-case-studies-life-sciences">RIN report on life-sciences data</a> with interest, a little cynicism, and much appreciation for the grounded and sensible approach I have come to expect from British reports. If you're interested in data services, you should read this report too.</p>

<p>A warning to avoid preconceptions: If you pay too much attention to all the cyberinfrastructure and e-science hype, it's very easy to fall prey to the erroneous notion that most of science is crunching massive numbers via grid computing and throwing out terabytes of data per second.</p>

<p>It ain't so. It never was so. Will it be so in future? Not any time soon, I'm thinking.</p>

<p>The report-writers don't try to soften their error (and much love to them for it): "There is much talk of ‘big science’, and our initial research design presumed that we would be studying large-scale formal collaborations. But we found that most research groups in the life sciences continue to operate on a relatively small scale, and we revised our plans accordingly."</p>

<p>Again, we don't have hard evidence for numbers or weight of small science versus Big Science. If we plan for nothing but Big Science, however, we're making an enormous error in judgment.</p>

<p>There's a good bit of attitude mining in the report; I have little to add to it, so I will merely recommend that you read it. The lack of carrots for data-sharing is a deal-breaker, just as it was for self-archiving, and I agree with RIN that using sticks only will cause fairly serious backlash.</p>

<p>Skipping to the end of the report, then, we find out what researchers want by way of data-curation support. Namely, <em>everything and the kitchen sink</em>. At zero cost to them or their grant agencies, of course. I don't know why any other response would have been expected; it costs researchers nothing to say in a focus group that they want a pony, so why would they <em>not</em> say they want a pony?</p>

<p>At some point, someone will have to tell them <a href="http://www.declan.net/2005/07/22/no-pony/">they can't have a pony</a>. I don't envy that person or agency one bit. Even so.</p>

<p>I believe that individuals and institutions planning data-curation services should take researchers' wants as expressed in this report with a generous dash of salt. No institution can give them what they want, because what they <em>really</em> want is for the problem to be taken care of for them without their involvement. What should be aimed for is giving them what they <em>need</em>.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/no_you_cant_have_a_pony.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/no_you_cant_have_a_pony.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/no_you_cant_have_a_pony.php</guid>
         <category>Tactics</category>
         
         <pubDate>Thu, 05 Nov 2009 17:02:30 -0600</pubDate>
      </item>
      
      <item>
         <title>Stepping away from the shiny</title>
          <description><![CDATA[<p>There is a certain kind of digital project that strikes terror and dismay into the hearts of digital preservationists everywhere. Not a one of us hasn't seen many exemplars. They make me myself feel sad and tired.</p>

<p>They're projects that, no matter their scholarly or design merit, are completely unpreservable because they were built from unsustainable tools, techniques, and materials. What's worse, even a <em>cursory</em> examination with an eye to sustainability would have at least signaled a problem.</p>

<p>It's not the unpreservability so much. It's the obliviousness that makes me hurt inside.</p>

<p>For various reasons, the digital humanities are particularly prone to this sort of thing. Scientists do use unsustainable tools, but often they haven't a choice (thank you for the lock-in, instrument manufacturers) and most times they're at least <em>aware of the problem</em>.</p>

<p>Humanists, on the other hand, will pick up whatever tool seems good to them without even asking themselves whether the result will last past the lifespan of the tool. Then they bring the resulting binary CD-ROM or Flash-based website or whatever to the library with beaming smiles, and are shocked to find out that the library can't help them.</p>

<p>Proprietary tools and formats are often quite shiny. I remember HyperCard well, and so may you. In its day, there wasn't anything shinier. The problem is, following the shiny to the exclusion of all other considerations dooms a shiny project to be less shiny a year later, hardly shiny at all five years later, and <em>completely inaccessible and unusable</em> five years after that.</p>

<p>(I do not kid. Historians and sociologists of early digital culture are deeply distressed at how much "HyperCard art" nearly fell out of reach forever, though there are now emulators capable of dealing with much of it.)</p>

<p>There are better ways to proceed. They may well be less shiny at first, but the secret is that shiny can almost always be added to solid sustainable data later on, through mashups or interface redesign or whatever takes your fancy. Once its platform is thoroughly obsolete, though, a project may well not be rescuable in any form. Worse yet, piling otherwise-sustainable raw materials into an unsustainable platform destroys the sustainability of those raw materials, too. I've seen it happen!</p>

<p>So please, step away from the shiny and think.</p>

<p>(Thanks to <a href="http://twitter.com/pseudonymTrevor">@pseudonymTrevor</a> and other Twitter friends for inspiring this post, and possibly other ones&#8212;I am still pondering the intersection of "never done"-ness and sustainability.)</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/stepping_away_from_the_shiny.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/stepping_away_from_the_shiny.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/stepping_away_from_the_shiny.php</guid>
         <category>Praxis</category>
         
         <pubDate>Wed, 04 Nov 2009 17:15:13 -0600</pubDate>
      </item>
      
      <item>
         <title>Making standards that work</title>
          <description><![CDATA[<p>One phenomenon that will be&#8212;indeed, already is&#8212;utterly unavoidable in the data-curation space is the creation of standards. I once heard <a href="http://community.oclc.org/hecticpace/">Andrew Pace</a> say that standards are like toothbrushes: everybody thinks they're great, but nobody wants to use anybody else's.</p>

<p>Be that as it may, standards development and compliance is one way to make everybody's data play nicely with everybody else's data. It's not the only way, to be sure; one very important way that I'm sure we'll also see more of is Being The Only Game In Town. ICPSR manages this quite successfully, and so does the Digital Sky Survey. If you want to be important in the data spaces dominated by either of these large players, you play by their rules, just that simple.</p>

<p>When there's no big player to lay down the law, though, standards development becomes more attractive. How do you make a standard, then? More to the point, how do you make a <em>good</em> standard, a standard that works, a usable standard, a standard that will last?</p>

<p>I liked <a href="http://adambosworth.net/2009/10/29/talking-to-dc/">this blog post by Adam Bosworth about standards development</a> very much. I think it captures much of the excellence that goes into successful standards as well as the dysfunction attending failed ones. I do want to add a fillip of my own, though, based on my own experience helping to build standards and trying to use standards built by other people.</p>

<p><em>When you're in a roomful of people tasked with building a standard, make sure the room contains representation from every group of people who will be asked or required to use it.</em> That emphatically includes the non-technical and the non-specialist. It goes double or triple if the standard will affect existing technology installations: you <em>must</em> have someone in that standards room who uses the existing technology! No, a developer of the existing technology does not fulfill this requirement, because the distance between developers' understanding and users' understanding is often vast.</p>

<p>If the non-technical, non-specialist representative in the room can't understand the standard, it will fail. If that representative can't produce data that fit the standard, likewise. I agree with Bosworth's reservations about RDF; I myself have trouble understanding it and putting it to use, despite a decade's experience with markup, and I believe the tribulations such folk as I face when trying to deal with it have retarded its adoption significantly.</p>

<p>What happens when this rule about representation is flouted, but standards are published anyway, is standards that fall apart under real-world use. I will adduce <a href="http://www.openarchives.org/pmh/">OAI-PMH</a> as an example. It follows quite a few of Bosworth's recommendations: it's simple (I have explained it in twenty minutes to library-school students), largely human-readable, focused, precise about encodings, in possession of real implementations, and free on the web.</p>

<p>It is also flawed. Huge projects built on it have found its flaws impossible to bypass and expensive to work around (see <a href="http://arxiv.org/abs/cs.DL/0601125">Lagoze et al. 2006</a> for how NSDL ran aground on OAI-PMH's inadequacies). </p>

<p>The major flaw, to my mind, isn't difficult to explain or to understand: OAI-PMH has no error-reporting built in. In a protocol standard <em>built for communication of and about metadata</em>, nobody in the standards-design process ever seems to have asked the (to me) simple and obvious question, "What happens if the metadata is malformed or otherwise wrong?"</p>

<p>Anyone who's worked on the ground with repositories of any stripe knows that metadata problems, sometimes gross problems, are par for the course. For that matter, any librarian can explain the pitfalls of metadata and citation creation at great length. I honestly can't tell you why OAI doesn't seem to have on-the-ground repository managers and other librarians capable of raising such practical issues working on its standards bodies.</p>

<p>I can, however, tell you that they should. The latest OAI development, <a href="http://www.openarchives.org/ore/">OAI-ORE</a>, contains exactly the same no-error-reporting weakness I just pointed out in OAI-PMH. Yes, some of the underlying technologies that OAI-ORE is built on contain certain kinds of error reporting, but the aggregation of those errors that can be reported is only a subset of the errors that I believe will crop up.</p>

<p>To make standards that work, include people on the standard-design team who work with the processes underlying the standard. Now that you know this&#8212;go forth and standardize!</p> <a href="http://scienceblogs.com/bookoftrogool/2009/11/making_standards_that_work.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/11/making_standards_that_work.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/11/making_standards_that_work.php</guid>
         <category>Praxis</category>
         
         <pubDate>Mon, 02 Nov 2009 17:17:35 -0600</pubDate>
      </item>
      
      <item>
         <title>L&apos;esprit d&apos;escalier</title>
          <description><![CDATA[<p>If you're not reading comments here, you're missing out. For reasons I don't entirely understand, some of the best in the business are seeing fit to comment here. They have more to teach than I do!</p>

<p><a href="http://digitalcuration.blogspot.com/">Chris Rusbridge</a> (of, among other things, <a href="http://www.ariadne.ac.uk/issue46/rusbridge/">this thought-provoking meditation on digital preservation</a>) has been spotted here, and whenever he pops up he makes me think about things. This time, I was thinking about disciplinary expertise, and how I need to make a better case that less of it is necessary for data curation than generally admitted.</p>

<p>I hope we can at least admit that data curators don't have to be researchers themselves. Do researchers have to be involved in the curation of their own data? Absolutely! Data curation starts at the beginning of the study-design process, and continues all the way through <em>and past</em> publication. But that doesn't mean that researchers have to do everything. The exact division of labor is still being sorted out; that's partly what this blog is about. That the labor must and will be divided appears to be beyond dispute.</p>

<p>The corollary to this is that a data curator will almost always know less about the data, viewed from certain axes, than the researcher does. She may well know <em>more</em> about it viewed from some other axes&#8212;file format details, metadata crosswalking, whatever. Some things, though, she won't know and presumably won't have to.</p>

<p>So what <em>does</em> she have to know about the research and the discipline in order to be a responsible data steward? And does she have to walk into the process with that knowledge pre-existing, or can she learn it as she works on the research project? How much of what she needs to know will transfer from other projects she's worked on? </p>

<p>Cards on the table: in the absence of much evidence either way, I think that someone with the intelligence, disciplinary background, and intellectual curiosity of a good subject-specialist librarian can learn enough "on the job" to hit the 80/20 point pretty easily&#8212;and 80/20 is more than good enough for a successful campus data-curation program in my book. The other 20% of edge cases can hire specially.</p>

<p>I'll use a True Story about myself as an anecdote. Feel free to quarrel with me (civilly, please) in the comments.</p>

<p>Some years ago I did a small contract job for the <a href="http://www.humanitiesebook.org/">ACLS E-book project</a>. They were working on rekeying and marking up an art-history book with extended segments of polytonic Greek text. Their keying vendor took one look and said "no way do we key polytonic Greek." So ACLS told them to key the rest of it and leave placeholders for the Greek. They came to me asking whether I could key the Greek in proper Unicode without snarling up the markup.</p>

<p>I have never studied Greek. I do not speak Greek. I do not write Greek. I do not <em>read</em> Greek, except in the sense that I recognize the letters and can laboriously sound them out. Don't ask me what in the world the accents and squiggly bits in polytonic Greek mean; I haven't the slightest clue.</p>

<p>Not snarling up markup? <em>That</em> I can manage. After an hour or so of research, I found fonts and tools that could enable me to do the keying job correctly and with reasonable efficiency. ACLS and I agreed on a price, and off I went. I didn't know what the squiggles meant, but I could reproduce them, and that was plenty good enough.</p>

<p>When it came time to proof my work, I didn't rely just on my own eyes; that would have been stupid. I called in my classics-major husband. He found typos and the odd homeoteleuton, which I duly fixed up. I sent the result back to ACLS, and they were happy enough to pay me, so there that is.</p>

<p>And there we have it: a partnership between a tech geek and a reasonably well-trained domain specialist (kindly note that my husband was an <em>undergraduate</em> classics major) took care of a data job. I think this can happen more often in more fields. </p>

<p>The chief barrier is the belief that it can't.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/10/lesprit_descalier.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/10/lesprit_descalier.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/10/lesprit_descalier.php</guid>
         <category>Praxis</category>
         
         <pubDate>Thu, 29 Oct 2009 20:19:32 -0600</pubDate>
      </item>
      
      <item>
         <title>Can we just give the problem to the libraries?</title>
          <description><![CDATA[<p>I pointed out <a href="http://lib.stanford.edu/files/pasig2009sf/pasig2009sf_lesk.pdf">Mike Lesk's slideshow</a> in my last tidbits post, finding it a good critical pr&#233;cis of the data problem. It's pleasantly aware of human problems, human problems many treatments of cyberinfrastructure (including, unfortunately, <a href="http://net.educause.edu/ir/library/pdf/EPO0906.pdf">this otherwise useful call to action from Educause</a>) wholly ignore.</p>

<p>So wince and flinch at the design (black Arial on white? really? in 2009?), but read the slideshow anyway.</p>

<p>I do want to pick apart the slide from which I took the title of this post. I reproduce the said slide's text in full:</p>

<blockquote><h3>Can we just give the problem to the libraries?</h3><p>As a professor in a library school, I wish I could say that libraries were the obvious organization to take care of data. They understand keeping things for a long time and arranging to find them later. It would be a sensible new activity to balance a decrease in foot traffic into book collections. But...</p><ul><li>They have not been ambitious in this area; libraries feel under budget pressure and don't want new tasks.</li><li>They lack the subject area knowledge to deal with complex data sets in scientific areas.</li><li>They often lack the technical skills for advanced data handling.</li></ul></blockquote>

<p>I have no quarrel whatever with Lesk's first point. Libraries have absolutely been timid about this, and they still are&#8212;not without reason, either! This, to me, is the buck-stopper, the Berlin Wall, the concrete bollards. If library administrators shy away from this, or give it lip-service only, Lesk is right and there's nothing to be done. It won't matter how many <em>librarians</em> are ready and willing to do this work, if they're not allowed to or not given sufficient resources and authority to.</p>

<p>How likely is this outcome? In my estimation, <em>more likely than not</em>. My estimation is admittedly colored by this being very early days yet, but as I've remarked before, <a href="http://scienceblogs.com/bookoftrogool/2009/08/if_not_now_when.php">the longer any interested group dithers, the more likely it is that the action will be elsewhere</a>. The more the action moves away from libraries, the more likely library administrators are to breathe a quiet sigh of relief and turn away from the problem altogether.</p>

<p>So what is a librarian who wants this work to do? Well, one answer is to keep an eye on discipline-specific projects, those that are larger than any single institution, the up-and-coming ICPSRs and Sloan Digital Sky Surveys. For those interested in data curation inside an institution, I think the answer may well be to learn enough to insinuate oneself onto research teams directly through their in-house IT arms. I may revisit this answer later; in-house IT is starting to become just cost-ineffective enough that some recentralization may happen. In that case, the would-be data curator has more options. Either way, though&#8212;the wise data curator does not attach himself limpetlike to the library. The action may well be elsewhere.</p>

<p>What is a researcher or funding agency or think-tank that wants libraries to take on this work to do? Researchers need to <em>ask</em>. Nothing gets library priority so fast as a well-articulated request from faculty; that goes double in disciplines where physical library spaces are waning in importance. Agencies and think-tanks: I'd recommend being an awful lot clearer about what the services provided look like and how they need to be staffed. Laundry-lists of skills are useless without an estimate of FTE and budget; such an estimate is noticeably lacking in every single discussion of this problem I've ever read.</p>

<p>I half-agree, half-disagree with Lesk's second point. There's a lot of disciplinary knowledge in academic librarianship. We don't select books blindly! We do it by taking heed of what our local researchers are doing. Many selectors and liaisons assigned to particular disciplines have degrees, sometimes advanced degrees, in that discipline. In the social sciences, by the way, data librarians with appropriate disciplinary knowledge <em>already exist</em>.</p>

<p>The problem isn't the non-existence of disciplinary knowledge; it's the uneven spread of it. For any given discipline at a research university, I'd guess it's a better-than-even bet that the library has a librarian somewhere with appropriate disciplinary expertise&#8212;but it's <em>not</em> a certainty.</p>

<p>Of course, there's also a question of how much disciplinary expertise is actually <em>necessary</em> for this work. <a href="http://managemetadata.org/blog/">Diane Hillmann</a> remarked to me at ALA this summer that "[researchers] all think they're special snowflakes," but in her experience the basic sustainability questions don't differ all that much from dataset to dataset. That's what I think, too, with the added wrinkle that disciplinary specialists may actually be <em>too close</em> to their data to have a good read on how others will want to use and query it. An outsider perspective may well be useful!</p>

<p>(<a href="http://scienceblogs.com/christinaslisrant/2009/10/classic_post_from_the_archivet.php">The real problem is one of first impressions and secret handshakes</a>, as my SciBling Christina adroitly points out in the context of reference interviews.)</p>

<p>I could very nearly recycle the answers I just gave for Dr. Lesk's second question for his third. In aggregate, research libraries have quite a lot of technology expertise. How much any given library has isn't predictable, and may well not be sufficient.</p>

<p>If we cross the answer to the second question with the answer to the third, we approach the real conundrum: sufficient disciplinary expertise and sufficient technical expertise tend not to coexist within the same librarian. Take me, for example: if it's textual or linguistic data, I'm your librarian&#8212;that's my educational background! I can apply common sense and well-honed data-management expertise to numeric or instrument data, but I can't apply disciplinary knowledge because I don't have it. Selectors and liaisons, conversely, likely understand quite a lot about local research in the disciplines they serve, but they mostly don't sling Python and XSLT, nor do they tend to have the digital-preservation knowhow that I do.</p>

<p>John Saylor of Cornell gave what I believe to be the appropriate answer to this problem in his talk at ALA Annual: a technical team dedicated to data needs to work <em>with</em> librarians who have disciplinary expertise in order to solve problems. The disciplinary coverage achievable with this staffing model won't reach 100%, but it'll get as close as seems feasible. Nota bene: without broad participation by disciplinary specialists across the library, a data-curation service suffers and may well fail!</p>

<p>Lesk's objections are serious, pertinent, and pointed. They are not, I believe, unanswerable, but answering them will take considerable vision and will on the part of research-library administrators. Time will tell.</p> <a href="http://scienceblogs.com/bookoftrogool/2009/10/can_we_just_give_the_problem_t.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/10/can_we_just_give_the_problem_t.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/10/can_we_just_give_the_problem_t.php</guid>
         <category>Tactics</category>
         
         <pubDate>Wed, 28 Oct 2009 17:12:01 -0600</pubDate>
      </item>
      
      <item>
         <title>Classification and a bit of subject analysis</title>
          <description><![CDATA[<p>It's been a while since I did anything on my series about library ways of knowing. If you'd like to refresh your memory:</p>

<ul><li><a href="http://scienceblogs.com/bookoftrogool/2009/08/the_classical_librarian.php">The classical librarian</a></li>
<li><a href="http://scienceblogs.com/bookoftrogool/2009/08/the_humble_index.php">The humble index</a></li>
<li><a href="http://scienceblogs.com/bookoftrogool/2009/09/classification.php">Classification</a></li></ul>

<p>Today I'll finish my discussion of classification, and distinguish it from subject analysis, since that distinction often seems to confuse, especially in our digital age.</p>

<p>So if we'll recall, the goal we set for ourselves was to <em>collocate</em> physical books on shelves in such fashion that their arrangement would be useful to information-seekers. With most non-fiction, that means collocation by subject, by what the books are <em>about</em>.</p>

<p>(There are lengthy philosophical discussions of "aboutness" in the information science literature. I recommend avoiding them with all your strength. They make my eyes bleed.)</p>

<p>To make this work, we have to map knowledge-space onto physical space: divide up human knowledge into convenient slots to assign books to. This is, you might say, a tall order: an ontology of infinite domain, but where each item can only fit in one place.</p>

<p>In the States, most libraries use one of two such maps: the <a href="http://www.oclc.org/dewey/">Dewey Decimal System</a> or the <a href="http://www.loc.gov/catdir/cpso/lcco/">Library of Congress Classification</a>. About the kindest thing one can say for Dewey Decimal is that it was a product of its peculiar time; for today's purposes, it is heavily overnumbered in religion, for example, and undernumbered in science. Perhaps worse, its sense of the world is not exactly immediately intuitive to the modern eye: why the long separation of geography from the so-called "social sciences," of which psychology is apparently not one?</p>

<p>This is one danger of any would-be universal classification. Our sense of the world and its knowledge changes over time, sometimes quite a lot and quite suddenly. If our ontology doesn't keep up, it serves its purposes less and less well. How easy <em>is</em> it, really, to find the right shelf in a library of any size organized by Dewey Decimal? Considerations such as these no doubt informed the shift of <a href="http://www.libraryjournal.com/article/CA6698264.html">one library (and later others)</a> to the <a href="http://www.bisg.org/committee-cat-2-bisac-committees.php">BISAC codes</a> typically found in large bookstores.</p>

<p>Another danger of the universal classification is that its specificity is of necessity somewhat limited. Many medical libraries, for example, ditch Library of Congress Classification because it just doesn't drill down far enough into medical minutiae for their needs. The <a href="http://www.nlm.nih.gov/class/">NLM Classification</a> fills the gap.</p>

<p>With physical books, we cannot escape the constraint that each book must go in one and only one place on the shelf. Once we're away from the physical item, that constraint disappears. The card catalogue was the first desperately clever escape from the tyranny of the physical item: in a card catalogue, the same book could be "shelved" by author, title, and one or more (usually three to five, to avoid overproliferation of cards) subjects assigned to it by the cataloguer.</p>

<p>This meant the addition of a subject-heading system to the classification vocabulary. You can't just add more classification numbers to the physical item; you then imply that it goes in more than one place! This is the difference between Library of Congress Classification and <a href="http://id.loc.gov/authorities/about.html#lcsh">Library of Congress Subject Headings</a>. Under most circumstances, the LCC number assigned to a book will correspond closely in meaning to the first LCSH assigned in the book's catalogue record. They are still distinct systems, however! Don't confuse them. Librarians chuckle behind their hands.</p>

<p>Of course, digital items don't have to live in just one space. Classification is therefore slowly giving way to subject analysis and similar ways of relating items to each other as digital libraries develop.</p>

<p>And that, in a remarkably simplified nutshell, is how books are arranged on shelves in libraries. It doesn't happen by magic!</p> <a href="http://scienceblogs.com/bookoftrogool/2009/10/classification_and_a_bit_of_su.php#commentsArea">Read the comments on this post...</a>]]></description>
         <link>http://scienceblogs.com/bookoftrogool/2009/10/classification_and_a_bit_of_su.php</link>
         <guid>http://scienceblogs.com/bookoftrogool/2009/10/classification_and_a_bit_of_su.php</guid>
         <category>Jargon</category>
         
         <pubDate>Mon, 26 Oct 2009 17:20:55 -0600</pubDate>
      </item>
      
   </channel>
</rss>
