So anyone who’s spent any time at all with Google Books (hence forth GB), has probably noted some really bizarre – I mean truly strange - metadata. Like messed up titles, authors, publication years, oh and categories are totally hit or miss. I frequently take for granted that everyone has seen all of the memes that go around in the library web 2.0 circles. But that’s crazy of course. So I’ll just throw this at you scattershot.
At a meeting at the Berkeley iSchool on the GB settlement (and that’s another thing I should blog about but don’t have time for the research needed), linguist Geoff Nunberg tore up GB for this. See his blog post and then the pdf of his slides.
Google responded. One of the arguments is that their data providers gave them crap – or at least conflicting ok stuff. Heh. If you mashed up every US and UK academic library catalog you would still have better metadata than they have and they only had to pick one library (the originator) for each scan and then map the LCSH to the BISAC. Seriously. Like certain fields would be weird, but we’ve had machine readable standardized records for decades and decent cataloging for decades before that. And they have the whole load from our massive union catalog, WorldCat.
I mean, that doesn’t stop me from using it, but I’m just using for natural language full text searching and linking out from my library’s catalog, which is cool. linguists apparently thought they could rely on the metadata when using it as a corpus for analysis.