Documents and Data...

Last month I was on Dr. Kiki's Science Hour. Besides being a lot of fun (despite my technical problems, which were part of my recent move to GNU/Linux and away from Mac!), I also discovered that at least one person I went to high school with is a fan of Dr. Kiki, because he told everyone about the show at my recent high school reunion. Good stuff.

In the show, I did my usual rant about the web being built for documents, not for data. And that got me a great question by email. I wrote a long answer that I decided was a better blog post than anything else. Here goes.

Although I'm familiar with the Creative Commons & Science Commons, the interview really help me understand the bigger picture of the work you do. Among many other significant and timely anecdotes, I received the message that the internet is built around document search and not data search. This comment intrigued me immensely. I want to explore that a little more to understand exactly what you meant. Most importantly, I want to understand what you believe the key differences between the documents and the data are. From one perspective, the documents contain the data, from another, the data forms the documents.

True, in some cases. But in the case of complex adaptive systems - like the body, the climate, or our national energy usage - the data are frequently not part of a document. They exist in massive databases which are loosely coupled, and are accessed by humans not through search engines but through large-scale computational models. There are so many layers of abstraction between user and data that it's often hard to know where the actual data at the base of a model reside.

This is at odds with the fundamental nature of the Web. The Web is a web of documents. Those documents are all formatted the same way, using a standard markup language, and the same protocol to send copies of those documents around. Because the language allows for "links" between documents, we can navigate the Web of documents by linking and clicking.

There's more fundamental stuff to think about. Because the right to link is granted to creators of web pages, we get lots of links. And because we get lots of links (and there aren't fundamental restrictions on copying the web pages) we get innovative companies like Google that index the links and rank web pages, higher or lower, based on the number of links referring to those pages. Google doesn't know, in any semantic sense, what the pages are about, what they mean. It simply has the power to do clustering and ranking at a scale never before achieved, and that turns out to be good enough.

But in the data world, very little of this applies. The data exist in a world almost without links. There is no accepted standard language, though some are emerging, to mark up data. And if you had that, then all you get is another problem - the problem of semantics and meaning. So far at least, the statistics aren't good enough to help us really structure data the way they structure documents.

From what you posited and the examples you gave, I envision a search engine which has the capacity to form documents out of data using search terms, e.g. enter two variables and get a graph as a result instead of page results. Not too far from what 'Wolfram Alpha' is working on, but indexing all the data rather than pre-tabulated information from a single server/provider. Perhaps I'm close but I want to make sure we're on the same sheet of music.

I'm actually hoping for some far more basic stuff. I am less worried about graphing and documents. If you're at that level, you've a) already found the data you need and b) know what questions you want to ask about it.

This is the world in which one group of open data advocates live. It's the world of apps that help you catch the bus in Boston. It's one that doesn't worry much about data integration, or data interoperability, because it's simple data - where is the bus and how fast is it going? - and because it's mapped against a grid we understand, which is...well, a map.

But the world I live in isn't so simple. Doing deeply complex modeling of climate events, of energy usage, of cancer progression - these are not so easy to turn into iPhone apps. The way we treat them shouldn't be with the output of a document. It's the wrong metaphor. We don't need a "map" of cancer - we need a model that tells us, given certain inputs, what our decision matrix looks like.

I didn't really get this myself until we started playing around with massive-scale data integration at Creative Commons. But since then, in addition to what we do here, I've been to the NCBI, I've been to Oak Ridge National Lab, I've been to CERN...and the data systems they maintain are monstrous. They're not going to be copied and maintained elsewhere, at least, not without lots of funding. They're not "webby" like mapping projects are. There's not a lot of hackers who can use them, nor is there a vast toolset to use.

So I guess I'm less interested in search engines for data than I am in making sure that people who are building the models can use crawlers to find the data they want, and that they can be legally allowed to harvest that data and integrate it. Doing so is not going to be easy. But if we don't design for that world, for model-driven access, then harvest and integration will quickly approach NP levels of complexity. We cannot assume that the tools and systems that let us catch the bus will let us cure cancer. They may, someday, evolve into a common system, and I hope they do - but for now, the iphone approach is using a slingshot against an armored division.

More like this

As part of the series of posts reflecting on the move of Science Commons to Creative Commons HQ, I'm writing today on Open Data. I was inspired to start the series with open data by the remarkable contribution, by GSK, to the public domain of more than 13,000 compounds known to be active against…
As part of the series of posts reflecting on the move of Science Commons to Creative Commons HQ, I'm writing today on Open Data. I was inspired to start the series with open data by the remarkable contribution, by GSK, to the public domain of more than 13,000 compounds known to be active against…
On the Googles, Common Knowledge gets more than 25,000,000 hits. It's a market research company, a scholarship foundation, a non profit fundraising firm, and in its inverse as Uncommon Knowledge part of a conservative group site, and an interview series at the Hoover Institution. We can take the…
On the Googles, Common Knowledge gets more than 25,000,000 hits. It's a market research company, a scholarship foundation, a non profit fundraising firm, and in its inverse as Uncommon Knowledge part of a conservative group site, and an interview series at the Hoover Institution. We can take the…

This is an incredibly important post, and very clearly written as well. Thank you.

PZ is making another mistake here. He is assuming that O'keefe's boat isn't always stocked like this. More than likely, O'keefe is a not unusual GOP operative, someone like Limbaugh who is into some weird BDSM stuff.

And where in the hell did he get a huge boat anyway, so young? I didn't know being a GOP slime mold paid so well.

THE AMERICAN REVOLUTION!

the WORLD TRADE CENTER PROPHECY - THE DANCE OF DEATH

youtube.com/watch?v=X0Hez25fFrg

the ungrateful bastards full of hubris...

en.wikipedia.org/wiki/Hubris

a bullet for your head, traitor

And finally, the *only* man in Minnesota who says there is no God has suddenly become an arbiter on mental health...

unfacts.org/factsforum/viewtopic.php?t=4080

COME SEE A PHOTO OF MABUS AND AN EXPLANATION OF IT!

Very informative! Thanks for a well written post. I often think of data, how easily it is to obtain and how it flows. Your post has given me a great deal more to think about. Thank-you!

I am interested in so many of the same things you are, and you have given me some real directions I can explore. Again I really appreciate your active presence on the Internet.

I mean, really?? I'm a scientist, and just reading that even made *my* eyes glaze over. If one thing they're trying to convey is the importance and relevance of the scientist's research to GQ readers, what percentage of the readers are really going to walk away with a deeper understanding of what Dr. Jamieson does by reading that description? It would have been a small thing to ask each participant to submit a layman-friendly version of their research (their "elevator talk" description, for example) for GQ to include.

Finally--one of the "scientists" is Dr. Oz. What is he doing in there? One, I would think he's already well-known enough; why not save that spot for another scientist? Two, yes, I know he's actually done research and published, and is on the faculty at Columbia. Fantastic. He's also a serious woo peddler, who has even featured everyone's favorite "alternative" doc, Joseph Mercola, on his talk show, and discussed how vaccines may be playing a role in autism and allergies (despite mounds of evidence to the contrary). This seems to completely contradict their goal of "research funding as a national priority," since Oz is often (and Mercola is always) highly critical of "mainstream medicine." I really don't understand his inclusion, and think it's to the detriment of the rest of the campaign.

Ama yaÅadıÄım dünya çok basit deÄildir. kanser progresyon enerji kullanımının iklim olayları derinden karmaÅık modelleme, Doing - bu yüzden iPhone uygulamaları çevirmek kolay deÄildir. Yolu onları bir belgenin çıkıŠolmamalı davranın.

Just happening upon this post now. It is a great point that you make, thank you. I'm wondering since it has been awhile since you've written this if you have anything to add to your thoughts. If so, were can I read it?

This is an incredibly important post, and very clearly written as well. Thank you.

John Wilbanks,

I heard your interview on Dr. Kiki's Science Hour. I am so happy you blog and use twitter, and I appreciate it. I am interested in so many of the same things you are, and you have given me some real directions I can explore. Again I really appreciate your active presence on the Internet.

I hate that I didn't spot this when you first wrote it, but it fits with my sense that there's something quietly building around these ideas this fall and winter. Perhaps 2011 will be more interesting.