Authority control, then and now

Since the end of the year is a fairly quiet time for my particular professional niche, I've taken the opportunity to do some basic name authority control on author name-strings in the repository.

Some basic what on what, now? Welcome back to my series on library information management and jargon.

The problem is simple to understand. Consider me as an author. I took my husband's surname upon marriage; fortunately, I hadn't published anything previously, but I might have done—and if I had, how would you go about finding everything I've written, if it was published under two different names? "Dorothea" is a fairly distinctive given name, especially in my age cohort, but I do share it with other creators.

Now consider creators whose names are not written in Roman characters. The many and varied romanizations of the composer Tchaikovsky may give pause, though my personal favorite example is a certain Libyan leader who wrote a book or two. (Click over and then hit the plus beside "400's: Alternate Name Forms.")

Libraries confronted this problem when the search technology of choice was the card catalogue. The outline of a solution emerges: to avoid wasteful duplication of cards, all the cards representing titles by a given author should be in one place under one name, but it should also be possible to pop in a single card for each additional name variant so that searchers know which variant is hiding the good stuff. ("Chaikowsky, Peter Ilich: see Tchaikovski, Piotr Ilyich, 1840-1893.")

This means choosing a preferred name variant, of course. Ideally, we'd like this to be consistent across libraries, so that the devotee of Russian music who learns the preferred variant in her home library will easily find what she needs at any other library.

There are additional wrinkles as well: it does happen that different authors wind up with the same name, and for library purposes, that's no good. My husband David, for example, shares his name with a book-writing swimming coach. Libraries chose to use birth years—and, only if necessary, death years—to disambiguate.

Aha, you say. This is why not all author names in library catalogues have attached dates. This is why not all authors with listed birth dates have death dates, even when they'd have to be older than Methuselah to be living still. Yes, this is why. Dates in author headings started strictly as a disambiguation measure; the swim coach didn't have his birth year beside his name until my husband turned up and wrote a book. Of late, there have been raucous arguments among cataloguers in libraryland about adding death dates as a matter of course.

All of this activity—choosing preferred name variants such that each name listing remains unique, listing other name variants with the preferred, organizing by-author displays accordingly, coping with name changes—is called "name authority control." (It has an analogue for subject work, sensibly enough called "subject authority control." This verges on the topic of controlled vocabularies, which is definitely one for another post. Or six.) For catalogue cards, this solution is remarkably elegant and entirely functional. For computer-based record management—well.

Relational-database experts are howling right now, at the idea that a primary key—what's used to identify a particular row of information, a particular item, in a database—would ever change. The whole point of a primary key is its immutability! Ask for record number 91346342, always get the same record. You never, ever, ever change that record ID. Ever. Really, not ever. If a particle of information can change, it shouldn't be used as a primary key!

Linked-data experts are howling as well: why don't all these people have URIs? (If you remember your analogies from the SAT, database:primary key::RDF:URI. Roughly, anyway.) Well, they do, now, thanks to VIAF. Here's my VIAF URI (no, I have no idea why my birth year is included in my authority string, as my name by itself is unique in authority data; ask a cataloguer) to look at. Feel free to hunt for your own URI.

To some librarians, all this business of immutable identifiers may sound like specious wrangling, but it's not: it's actually a major disjunction among cataloguing practice, the databases underlying ILSes, and the perennially-emerging world of linked-data mashups via RDF. Inexpert programmer that I am, the idea of programming around library methods of authority control makes my head hurt. It leads to real problems making online catalogues work well (never mind library systems that aren't tied into authority control, such as digital-library platforms and institutional repositories), and making library data play nicely with other people's data. When gearhead librarians and other technologists say "library data is siloed," this is exactly the sort of thing they mean.

You may, particularly if you are a hard scientist, have noticed another hole in this system: you don't get into it unless you have written a book. (Exceptions, yes, for editors and composers and book illustrators and whatnot. However.) I, for example, had two or three articles and book chapters come out before co-authoring a book published in 2008. I didn't have an authority record until the book was catalogued. If all you've published are articles, you don't have an authority record, sorry.

This is becoming a serious problem! If it were just people like me struggling with it, that wouldn't signify; as a librarian, I'm supposed to struggle with this sort of thing. I learned hotshot DIALOG-searching tricks in library school to get around article databases' lack of name authority control, for instance. Right now, I've built up a strategy for finding physicists' and engineers' first names that mostly works, though I do wish whatever weird graduate-school midnight hazing ceremony that deprives these worthy people of their given names in favor of their initials would wither away and die. (I am joking. Mostly. This phenomenon, though of course it isn't the result of hazing, can be maddeningly difficult to rectify, especially when the author in question is a graduate student who either doesn't graduate or doesn't go on to an academic career.)

No, the real problem concerns the changing nature of performance measurement in academia, mostly in the sciences to date. As journal impact factors wane in importance (not nearly fast enough for me!), the importance of measuring the impact of individual articles and other publications via citations and download counts rises. How are we to measure this anything like correctly for a given author if we can't reliably match articles to authors?

In an article published earlier this year, I wrote that there was a ferment of activity around the question of author authority, and what would come of it all was far from clear. I'm happy to say that clarity is emerging, in the form of ORCID: the Open Researcher and Contributor ID initiative. This effort looks to me to have critical mass and brainpower to make a difference: publishers, libraries, technologists, and research funders are all involved.

In the meantime, I plod through the repo's author listings, making what minimal order I may, very desirous of a better solution.


More like this

Seems like you could have as well titled this "Anatomy of a Clusterfuck". The obvious thing, from an outsider's POV, is to recognize that systems of organization change over time and while you can't predict the future, you can at least graciously accept it when it comes. Making author identity systems backwards-compatible with earlier systems is good, but there's absolutely no reason to port ugly but necessary hacks from earlier systems forward into systems that don't need them, and yes, I'm talking about citation styles.

As someone who is a database wiz with an Information Science degree, and who is considering a Masters in Information and Library Science degree this doesn't shake me. It's simple really.

Allow aliases for authors. Each alias points back to original author and you're good. They can change their name and their new works will still reference back to the original author name. No violation of primary keys.

Mr Gunn: I wouldn't care about citation standards so much if the actual papers had full names! But often (I am looking at you, AIP) those too are initialisms.

Tony P: Yes, that is the solution, and it's unquestionably where ORCID will be going.

The Open Researcher and Contributor Id sounds like a great initiative. However, I wonder if it will help you out that much without local authority control in the system that you use? If the system you were using had name authority control, it would be a relatively easy change to add an identifier to the top level authority record and then display the id in every record. However, since it doesn't have this ability, it would seem you are in the same position of trying to identify all of the records so you can update them with the Open Researcher and Contributor Id. Over time you would also be depending on users to remember to update every new author entry into the repository with the id. It would seem that for any system to take full advantage of the new initiative, it would need to have local name authority control. Good luck with everything.

By Nate Sarr (not verified) on 19 Dec 2009 #permalink

Yes, implementing whatever ORCID comes up with is a distinctly non-trivial proposition. I would guess that there will be some sort of query mechanism to at least come up with candidate IDs for a large batch of names based on article titles or DOIs. As large datastores -- arXiv, OAIster -- start to be reduced to order, it will become easier to match up unknown entities in smaller datastores.

Even so, I shouldn't be at all surprised if ORCID finds out what I'm finding out: viz. and to wit, that there are not a few one-shot wonders whose identities are so obscure as to be impossible to resolve.

Over time, presumably, this will get easier, as author IDs are assigned to articles as matter-of-factly as DOIs are now.

I wonder if it will help you out that much without local authority control in the system that you use? If the system you were using had name authority control, it would be a relatively easy change to add an identifier to the top level authority record and then display the id in every record.

Heh. I AM local authority control in the system I use. :)

But when ORCID really gets going, I expect that will change.