The Book of Trogool

I see a lot of metadata out there in the wild woolly world of repositories. Seriously, a lot. Thesis metadata, article metadata, learning-object metadata, image metadata, metadata about research data, lots of metadata.

And a lot of it is horrible. I’m sorry, it just is?and amateur metadata is, on the whole, worse than most. I clean up the metadata I have cleaning rights to as best I am able, but I am one person and the metadata ocean is frighteningly huge even in my tiny corner of the metadata universe.

So here’s a bit of advice that would save me a lot of frustration and effort, and is likely to help the people who really need to read your stuff find it.

When you’re doing keywords? Anything that shows up elsewhere in your record is not a keyword, okay?

Authors and other creators are not keywords, save for the rare case that the item is somehow autobiographical. Titles are not keywords. (Really. They’re not. They may contain a keyword or two, but that’s not the same thing.) Any search engine is going to turn up authors and titles that don’t appear in the keyword field; trust me on this one. Likewise, if every single word in the full text of the item is a keyword, then nothing is.

The point of keywording is not to shovel in every single word that someone might conceivably search for. Leave that kind of indexing to Google and other full-text indexing engines. The point of keywording in this day and age is to distinguish this item from all the other items that look vaguely like it, to help folks who arrive there make the snappiest judgment possible about whether this item is what they need.

When you add keywords with a backhoe instead of an eyedropper, you are not raising the chance your item will be read or used. You are lowering it, because most people who arrive at the item will roll their eyes at the lengthy list of keywords and bounce right back out looking for something more targeted.

Keep your keywords to-the-point and as few as possible. This metadata librarian thanks you for it. So will your readers.

Comments

  1. #1 Markk
    August 18, 2009

    Should there be some kind of standard taxonomy of keywords? I always kind of thought that there should be hierarchies you could look up so that if you have a article about, say, electrical outlets in the world, there would be some EE and some IEE standard keywords that might help put it in its place. In this case I am thinking about categories that deal with who makes them vs something like safety regulations. That is where the keywords could be handy.

    Still I have this feeling that keywords are not going to be human generated in the future anyway, or rather not directly by the authors but by some some smart auto indexer with human rating and adjustment. Beyond the obvious anyway like in the arXiv with HEP or other field of study tags. But I guess that is all keywords should do.

  2. #2 Dorothea Salo
    August 18, 2009

    Hi, Markk; thanks for the great comment.

    What you’re asking about is called a “controlled vocabulary” in librarian parlance. There’s not just one — there are thousands. Some, like Library of Congress Subject Headings, are broad but shallow, covering many disciplines and fields of endeavor, but not to inordinate depth.

    Some are narrow but deep; often, they are associated with article databases in a specific discipline. I’m not an engineer or an engineering librarian, but a quick trawl turned up this list of engineering-related vocabularies. (As usual with standards, the fun part is that there’s so many of ‘em!)

    A problem with using these vocabularies in self-help environments like IRs — well, there are several problems. One is that current-generation software is not terribly friendly to them. (I’ll spare you the sob story of what it takes to integrate a controlled vocabulary with DSpace.) Another is that many of these vocabularies are not free to use; they are often copyrighted and can be used only under license.

    I think you’re right about automated keywording. It’s certainly an active research area! (Joint Conference on Digital Libraries and ASIST Annual often turn up work on the subject.) Honestly, all we’re still waiting for is an actual (and preferably pluggable) production system rather than speculative research projects.

    It won’t solve all our problems — try automatic-keywording an image or sound file! — but it’s a start.

  3. #3 LibraryGuy
    August 18, 2009

    Heh. So i’m sitting here trying to make the catalog we use cough up a title that I know we have in the collection just by using the keyword search option. The book: Wired for War by P.W. Singer. Eye opening to see what works.
    “unmanned vehicles” nothing.
    “unmanned aerial vehicles” A few hits but not the book – even though there is like one on the cover.
    “robot war” That did it.

  4. #4 Markk
    August 18, 2009

    Yes, there are a bunch of categorizations. That is the problem right? The reason I think automated plus annotation will have to be the way to go is based on perhaps outdated experience.

    My MS thesis (from longer ago every second… let’s just say before HTML existed but after SGML) was on a method to try to categorize Numeric Control (NC) (what an old acronym) programs back in the day. These are (were) the things that drove cutting machines, lathes and the like. There were (and are still are I am sure) zillions of NC programs sitting around for the exact same part that someone started at a different location, so they don’t look the same. It ought to have been easy to slap part numbers on these, no problem meta-data wise, the simplest kind of identifier! Then we could of combined to produce nice clean BOM’s … well no, not in the world of contractors with different schemes, outsourced projects that used supposed US Govt approved, standardized lists of parts, all differently, and BOM’s that even the best networked (proto Object Oriented) databases ended up being so weirdly custom that it didn’t matter. It was amazing what the most crude automatic categorization would produce – in a good way. So that has colored my thinking ever since in terms of how messy even the most straightforward labeling can be.

    Anyway, on a related topic, I saw that on this years Beloit College list of things incoming freshmen never dealt with in their lives were =card catalogs=. The literal cards I guess! They were the first great “metadata” I ever dealt with and somehow still seem efficient in my memory.

  5. #5 Mona Albano
    August 19, 2009

    I try to aim for a few categories with a few defined values in each one. For example, is it about hardware or software? What is the operating system (or manufacturer)? What is the product name? Is it historical info such as old purchase orders (not to be changed) or is it current (and improvable)? Is it concepts or procedures? What year does it apply to?

    That sort of limited set of metadata helps people to know how to treat the information. M

  6. #6 Dorothea Salo
    August 19, 2009

    “Helping people know how to treat the information” is absolutely the right way to think about this problem.

    Markk, the catalogue card is still with us in the form of the MARC and AACR2 standards. It is, shall we say, an extremely mixed blessing. I hope to talk about why in a little while.

The site is currently under maintenance and will be back shortly. New comments have been disabled during this time, please check back soon.