Please don't do this! A word about keywords

By dsalo on August 18, 2009.

I see a lot of metadata out there in the wild woolly world of repositories. Seriously, a lot. Thesis metadata, article metadata, learning-object metadata, image metadata, metadata about research data, lots of metadata.

And a lot of it is horrible. I'm sorry, it just is—and amateur metadata is, on the whole, worse than most. I clean up the metadata I have cleaning rights to as best I am able, but I am one person and the metadata ocean is frighteningly huge even in my tiny corner of the metadata universe.

So here's a bit of advice that would save me a lot of frustration and effort, and is likely to help the people who really need to read your stuff find it.

When you're doing keywords? Anything that shows up elsewhere in your record is not a keyword, okay?

Authors and other creators are not keywords, save for the rare case that the item is somehow autobiographical. Titles are not keywords. (Really. They're not. They may contain a keyword or two, but that's not the same thing.) Any search engine is going to turn up authors and titles that don't appear in the keyword field; trust me on this one. Likewise, if every single word in the full text of the item is a keyword, then nothing is.

The point of keywording is not to shovel in every single word that someone might conceivably search for. Leave that kind of indexing to Google and other full-text indexing engines. The point of keywording in this day and age is to distinguish this item from all the other items that look vaguely like it, to help folks who arrive there make the snappiest judgment possible about whether this item is what they need.

When you add keywords with a backhoe instead of an eyedropper, you are not raising the chance your item will be read or used. You are lowering it, because most people who arrive at the item will roll their eyes at the lengthy list of keywords and bounce right back out looking for something more targeted.

Keep your keywords to-the-point and as few as possible. This metadata librarian thanks you for it. So will your readers.

More like this

Should there be some kind of standard taxonomy of keywords? I always kind of thought that there should be hierarchies you could look up so that if you have a article about, say, electrical outlets in the world, there would be some EE and some IEE standard keywords that might help put it in its place. In this case I am thinking about categories that deal with who makes them vs something like safety regulations. That is where the keywords could be handy.

Still I have this feeling that keywords are not going to be human generated in the future anyway, or rather not directly by the authors but by some some smart auto indexer with human rating and adjustment. Beyond the obvious anyway like in the arXiv with HEP or other field of study tags. But I guess that is all keywords should do.

Hi, Markk; thanks for the great comment.

What you're asking about is called a "controlled vocabulary" in librarian parlance. There's not just one -- there are thousands. Some, like Library of Congress Subject Headings, are broad but shallow, covering many disciplines and fields of endeavor, but not to inordinate depth.

Some are narrow but deep; often, they are associated with article databases in a specific discipline. I'm not an engineer or an engineering librarian, but a quick trawl turned up this list of engineering-related vocabularies. (As usual with standards, the fun part is that there's so many of 'em!)

A problem with using these vocabularies in self-help environments like IRs -- well, there are several problems. One is that current-generation software is not terribly friendly to them. (I'll spare you the sob story of what it takes to integrate a controlled vocabulary with DSpace.) Another is that many of these vocabularies are not free to use; they are often copyrighted and can be used only under license.

I think you're right about automated keywording. It's certainly an active research area! (Joint Conference on Digital Libraries and ASIST Annual often turn up work on the subject.) Honestly, all we're still waiting for is an actual (and preferably pluggable) production system rather than speculative research projects.

It won't solve all our problems -- try automatic-keywording an image or sound file! -- but it's a start.

Heh. So i'm sitting here trying to make the catalog we use cough up a title that I know we have in the collection just by using the keyword search option. The book: Wired for War by P.W. Singer. Eye opening to see what works.
"unmanned vehicles" nothing.
"unmanned aerial vehicles" A few hits but not the book - even though there is like one on the cover.
"robot war" That did it.

Yes, there are a bunch of categorizations. That is the problem right? The reason I think automated plus annotation will have to be the way to go is based on perhaps outdated experience.

My MS thesis (from longer ago every second... let's just say before HTML existed but after SGML) was on a method to try to categorize Numeric Control (NC) (what an old acronym) programs back in the day. These are (were) the things that drove cutting machines, lathes and the like. There were (and are still are I am sure) zillions of NC programs sitting around for the exact same part that someone started at a different location, so they don't look the same. It ought to have been easy to slap part numbers on these, no problem meta-data wise, the simplest kind of identifier! Then we could of combined to produce nice clean BOM's ... well no, not in the world of contractors with different schemes, outsourced projects that used supposed US Govt approved, standardized lists of parts, all differently, and BOM's that even the best networked (proto Object Oriented) databases ended up being so weirdly custom that it didn't matter. It was amazing what the most crude automatic categorization would produce - in a good way. So that has colored my thinking ever since in terms of how messy even the most straightforward labeling can be.

Anyway, on a related topic, I saw that on this years Beloit College list of things incoming freshmen never dealt with in their lives were =card catalogs=. The literal cards I guess! They were the first great "metadata" I ever dealt with and somehow still seem efficient in my memory.

I try to aim for a few categories with a few defined values in each one. For example, is it about hardware or software? What is the operating system (or manufacturer)? What is the product name? Is it historical info such as old purchase orders (not to be changed) or is it current (and improvable)? Is it concepts or procedures? What year does it apply to?

That sort of limited set of metadata helps people to know how to treat the information. M

"Helping people know how to treat the information" is absolutely the right way to think about this problem.

Markk, the catalogue card is still with us in the form of the MARC and AACR2 standards. It is, shall we say, an extremely mixed blessing. I hope to talk about why in a little while.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…