Data, Copyrights, And Slogans, Oh My.

I got drawn into a debate about copyrights and factual data this week that felt like it merited its own blog post. It was kind of surreal new media debating - I was going back and forth with a smart guy from the UC Berkeley school of information on a friend's Facebook wall for most of a day on the topic. It was definitely a change from the typical FB chatter and in some ways the character count constraints of a wall post were formative to the debate. But some of the questions raised deserved long answers, and the issues involved are complicated and subtle and non-obvious. Hopefully moving the conversation here will permit more complete airing...

So, the whole thing started when Jon Philips, a dear friend and running-dog creative commoner, posted that we need to have a slogan-level campaign about data. He suggested "data is not copyrightable" and the comments started to fly. Some jumped in and said it was an empty slogan. Being a pedantic wonk, I jumped in to point out that this was a technically correct and truthful statement. And then it got interesting, at least, from my perspective.

We got embroiled in the weeds of the issue - the definitions of data, and importance of compilation and selection and arrangement, the funky international regimes around data protection, and so forth. I'm going to try to untangle them a little bit here.

To the first point, about data and copyright. It's important to remember that copyright protects "creative expressions" - not facts of nature, or factual observations about nature. So the picture of Mt Everest is copyrighted, not the measurements of its height. And no matter how many explorers perished to get the measurement, that doesn't make the measurement creative or confer an automatic copyright to it.

Nor does copyright accrue to a lot of similar measurements. Taking a whole batch of measurements doesn't change the inherent non-creative nature of measuring compared to authoring. Datum, data, it's all the same thing from a copyright perspective. That's why I got all pedantic and pointed out that no matter how much data you collect, it doesn't become copyrighted by nature of being in a collection.

Dictionaries will give you definitions of data. Here are three. One of the weird things about the law though is that the law doesn't really care what common definitions are - the words are defined in cases, not dictionaries, or in legislation itself...

In the US at least, the most important case is called "Feist v Rural". From St. Wikipedia:

Since facts are purely copied from the world around us, O'Connor concludes, "the sine qua non of copyright is originality". However, the standard for creativity is extremely low. It need not be novel, rather it only needs to possess a "spark" or "minimal degree" of creativity to be protected by copyright.

In regard to collections of facts, O'Connor states that copyright can only apply to the creative aspects of collection: the creative choice of what data to include or exclude, the order and style in which the information is presented, etc., but not on the information itself. If Feist were to take the directory and rearrange them it would destroy the copyright owned in the data.

The court ruled that Rural's directory was nothing more than an alphabetic list of all subscribers to its service, which it was required to compile under law, and that no creative expression was involved. The fact that Rural spent considerable time and money collecting the data was irrelevant to copyright law, and Rural's copyright claim was dismissed.

So the key here is that the compilation of facts didn't make it creative. One telephone entry, 10,000 telephone entries, no difference. The key is to get the facts out without touching the creative stuff: the look and feel, the style sheets, and so forth. It's kind of like playing Operation - don't touch the sides, and you don't get a shock (or in this case, a cease and desist letter).

This was good news for telephone book competition. It's fantastic news for science. It means that at least in the US, there is a right granted to us as users to extract, republish, integrate, federate, query, mash, mix, fold, spindle, and mutilate data to our own ends. It is an essential legal component of the emerging web of data. If copyrights traveled with either an individual datum or a data set, we'd have attribution stacking problems that make the miserable 27 pages of illegible wikipedia attribution look like a walk in the park, and that's just for today. In 30 years, which is less than halfway towards the end of a copyright whose death-clock began ticking today, it'd be a nightmare.

So "Data isn't copyrightable" might be a poor slogan. But it's an essential truth. It sits at the basis of a lot of really important legal aspects around data. If data were copyrightable it might be easier to understand, but it'd be a lot worse to use.

This creates what a lot of folks seem to think is an incentive problem. How can we create incentives for people to create data collections if there's no protection? I'll come back to this in another post this weekend to avoid writing a book-length one here, and a third post that looks at the funky problems that bad contracts create on data and the funky ways that people try to deal with them, for better and for worse...

More like this

OK, I'll ask the obvious (to me) question. How do charts and graphs fit into this?

1) Is arranging data in a chart different from arranging it alphabetically?

2) Does it make a difference when charts and graphs use selected data rather than the whole set?

3) Is there really any difference between copying somebody else's chart and using their data to generate your own, virtually identical, chart?

4) Sorry if you've covered this before--I just discovered your blog. :-)

Hi - I will cover this in the next set of posts, but briefly stated, the chart itself is likely to be arranged in a way that gets copyright. However, there isn't a lot of coverage against your extracting the data and publishing a chart with different colors, for example, or legends. And if there is only one way that the data can be displayed the coverage gets even fainter from copyright (this is called "convergence" - as in, there's only one way to do it, therefore it's not creative to do it).

no matter how much data you collect, it doesn't become copyrighted by nature of being in a collection

This gets icky in Europe, where databases can be protected by copyright under some circumstances... perhaps a topic for another future post?

Yes, i'm writing a second post hopefully online tonight that addresses the weird situation in the EU. There is also the "Sweat of the brow" issue in the British commonwealth tradition to address and the "Crown copyright" taboot.

Your assuming the "data" is made up of facts. The use of the term data continues to expand and its common meaning includes more and more situations in which data is not facts but works copyrightable works in themselves. I work in publishing, for example, and the slogan you mentioned is laughable in a context where the data are articles and other such works. I know you really mean that a data format doesn't, by itself, make something copyrightable, but the slogan doesn't convey that at all.

All very true - that's why I tried to restrict myself to discussing facts. Databases of photographs - like flickr.com - are another good example of this. This is why I note in the post that the slogan isn't a good one, but that the essential idea of "data is not copyrightable" in the sciences is an important truth. Data doesn't equal database - databases can contain lots of things that aren't data, like articles.

Of course, in the modern scientific world, we have so many articles that we desperately need to feed them as data to software for processing, which is why I support Open Access, but that's off topic to today's ranting.

And don't forget Australia - our leading court case on the issue (Desktop Marketing, which is about telephone books no less) goes almost directly against this. It says that the labour of arranging the data alone is enough to confer copyright on the final work. Even if the labour would have happened anyway, and you've just arranged the data in the most logical way.

Which means that yes, our telephone book is copyrighted in Australia. And you can't create competing products that arrange the same data in a similar manner (eg with the same headings, ordered alphabetically etc).

There's a case (ICE TV) before our High Court at the moment that we're all hoping will overturn this, or at least narrow the circumstances in which mere labour (and not creativity) will be enough to give you protection. It's a real problem having our law out of line with the rest of the world's, and does create some ugly monopolies over public data. The case is basically to do with TV guides, and TV networks using copyright over the guides to prevent TIVO-like services from being developed by third parties.

Hi Jessica!

Yes, I'll get to the UK and Australia regimes as well. What started as a set of short FB wall posts is rapidly morphing into a book chapter. But Jessica is, as usual, correct and well spoken! This is why the role of the Creative Commons Australia team is so important...

And don't forget Australia - our leading court case on the issue (Desktop Marketing, which is about telephone books no less) goes almost directly against this. It says that the labour of arranging the data alone is enough to confer copyright on the final work. Even if the labour would have happened anyway, and you've just arranged the data in the most logical way