I got drawn into a debate about copyrights and factual data this week that felt like it merited its own blog post. It was kind of surreal new media debating – I was going back and forth with a smart guy from the UC Berkeley school of information on a friend’s Facebook wall for most of a day on the topic. It was definitely a change from the typical FB chatter and in some ways the character count constraints of a wall post were formative to the debate. But some of the questions raised deserved long answers, and the issues involved are complicated and subtle and non-obvious. Hopefully moving the conversation here will permit more complete airing…
So, the whole thing started when Jon Philips, a dear friend and running-dog creative commoner, posted that we need to have a slogan-level campaign about data. He suggested “data is not copyrightable” and the comments started to fly. Some jumped in and said it was an empty slogan. Being a pedantic wonk, I jumped in to point out that this was a technically correct and truthful statement. And then it got interesting, at least, from my perspective.
We got embroiled in the weeds of the issue – the definitions of data, and importance of compilation and selection and arrangement, the funky international regimes around data protection, and so forth. I’m going to try to untangle them a little bit here.
To the first point, about data and copyright. It’s important to remember that copyright protects “creative expressions” – not facts of nature, or factual observations about nature. So the picture of Mt Everest is copyrighted, not the measurements of its height. And no matter how many explorers perished to get the measurement, that doesn’t make the measurement creative or confer an automatic copyright to it.
Nor does copyright accrue to a lot of similar measurements. Taking a whole batch of measurements doesn’t change the inherent non-creative nature of measuring compared to authoring. Datum, data, it’s all the same thing from a copyright perspective. That’s why I got all pedantic and pointed out that no matter how much data you collect, it doesn’t become copyrighted by nature of being in a collection.
Dictionaries will give you definitions of data. Here are three. One of the weird things about the law though is that the law doesn’t really care what common definitions are – the words are defined in cases, not dictionaries, or in legislation itself…
In the US at least, the most important case is called “Feist v Rural”. From St. Wikipedia:
Since facts are purely copied from the world around us, O’Connor concludes, “the sine qua non of copyright is originality”. However, the standard for creativity is extremely low. It need not be novel, rather it only needs to possess a “spark” or “minimal degree” of creativity to be protected by copyright.
In regard to collections of facts, O’Connor states that copyright can only apply to the creative aspects of collection: the creative choice of what data to include or exclude, the order and style in which the information is presented, etc., but not on the information itself. If Feist were to take the directory and rearrange them it would destroy the copyright owned in the data.
The court ruled that Rural’s directory was nothing more than an alphabetic list of all subscribers to its service, which it was required to compile under law, and that no creative expression was involved. The fact that Rural spent considerable time and money collecting the data was irrelevant to copyright law, and Rural’s copyright claim was dismissed.
So the key here is that the compilation of facts didn’t make it creative. One telephone entry, 10,000 telephone entries, no difference. The key is to get the facts out without touching the creative stuff: the look and feel, the style sheets, and so forth. It’s kind of like playing Operation – don’t touch the sides, and you don’t get a shock (or in this case, a cease and desist letter).
This was good news for telephone book competition. It’s fantastic news for science. It means that at least in the US, there is a right granted to us as users to extract, republish, integrate, federate, query, mash, mix, fold, spindle, and mutilate data to our own ends. It is an essential legal component of the emerging web of data. If copyrights traveled with either an individual datum or a data set, we’d have attribution stacking problems that make the miserable 27 pages of illegible wikipedia attribution look like a walk in the park, and that’s just for today. In 30 years, which is less than halfway towards the end of a copyright whose death-clock began ticking today, it’d be a nightmare.
So “Data isn’t copyrightable” might be a poor slogan. But it’s an essential truth. It sits at the basis of a lot of really important legal aspects around data. If data were copyrightable it might be easier to understand, but it’d be a lot worse to use.
This creates what a lot of folks seem to think is an incentive problem. How can we create incentives for people to create data collections if there’s no protection? I’ll come back to this in another post this weekend to avoid writing a book-length one here, and a third post that looks at the funky problems that bad contracts create on data and the funky ways that people try to deal with them, for better and for worse…