Now on ScienceBlogs: And so, driven on ceaselessly toward new shores

Seed Media Group

November 5, 2009

Distributed Science, Part 2

Category:

I got a lot of feedback on my last post in which I argued that open source is the wrong metaphor fo science, because it ties us too closely to the artifact that is open source software. The core of my argument remains the same - science is not software, and we shouldn't treat it the way we treat software. But I got a few comments, here on the blog and in email, that are worth looking at.

Here's comment #1.

You cite openwetware and the biobricks registry, but if you look closer, openwetware is a wiki, not a website about open source wetware tech. To my knowledge, other than the people over at diybio, there have been no signs of anyone with an understanding of free and open source software infrastructure (not the legalese- the toolchains) applying the concepts to the world of open source science.

This comment illustrates my point by missing it, which is that we should not be applying the understanding of software to science. In software, we the humans are in charge. We write the code. We compile it. Everything exists inside a system that we built, that is at least somewhat intelligently designed. Bringing this "understanding" to science means we shove a science peg into a software slot. The idea that "open source science" should be a site about wetware tech betrays a focus on the construction of tech, which is indeed the point of software.

But science isn't like software. Science is about extending the boundaries of our ignorance, not making technology. The difference between making technology (which is the point of software) and making discoveries (the point of science) is the root of the failure of the "open source science" metaphor. Science is about creating knowledge that doesn't exist and exposing ignorance that does exist, not about writing source code that we control.

In honor of his recent passing, here's Claude Lévi-Strauss: "The scientist is not a person who gives the right answers, he's one who asks the right questions." (from Le Cru et le cuit, 1964)

This is precisely why I want to take us up a layer in the ontology. Open source software is an example of distributed innovation, and as an inspiration to make distributed innovation happen in science, it's lovely. But it's an inspiration, not a map.

We should absolutely have distributed innovation in science. Open WetWare (which I am well aware is a wiki) contains many protocols, crafts and techniques, that are shared openly. This is a locally relevant form of distribution, even if it doesn't fit into an open source software box. Control over protocols and craft is at the core of one of the biggest resistors to distribution in science, which is competitive withholding. So is the registry of standard biological parts. These are resources and toolchains that absolutely support distribution of capability and increase capacity, which are fundamental to early-stage distributed innovation.

They're just not what we expect when we wear open source glasses.

Here's comment #2:

The "Open Gel Box" project is an initiative to bring biotech equipment into the 21st century. We need innovation in "established" tools to make them intuitive and accessible for anyone who wants to work with DNA. To that end, a group of users from the DIYbio list got together and designed a better, faster gel system than what exists today.

Pearl Biotech is now manufacturing a complete gel electrophoresis system according to the Open Gel Box design The Pearl Gel Box is available for $199 at http://www.pearlbiotech.com. We're advocating for better equipment on all fronts, such as an Open Thermal Cycler.

I think this is awesome. It's not "open source" though. It's not even what I'd call "distributed innovation" - the innovation theorists call this kind of thing User-Driven Innovation. This is about as clear a case of UDI as I know, right down to the fact that it's designed by the DIY folks and then made pretty and sold by a company. This again gets to the paucity of the open source software example. It simply isn't big enough to fit science into it.

Distributed science, user-driven science, open innovation science, we need ALL of them, not a narrow idea that comes from software. It's about hardware for science. It's about data for science. It's about laboratories for science. It's about research departments and funders and promotion and tenure. It's about paradigms, and paradigm shifts.

It's not software.

We control software. We don't control science. DIY Biology is one of the absolute leading examples of how, when we have a critical mass of open craft and protocols, users can lead the way. But it's not something that's enabled by an open source license, a code version repository, and other hallmarks of open source software. It's users saying, "screw this, I can do better" - and doing it. It's users who know the problem best and design the best solutions.

The business school folks call this "stickiness." The knowledge of how to make the solution is localized - sticks - to the user. The dumb firms in the sector only make products their marketing departments tell them about, and the smart ones find ways to take user inventions and turn them into their product lines. Like Pearl.

Comment #3:

(from my post: Stem cells, mice, vectors, plasmids, and more will need to available outside the old boy's club that dominates modern life sciences.)

This is simply never gonna happen, because of the huge irreducible expense of maintaining and manipulating these reagents.

See: Personal Genome Project, Coriell Cell Culture Repository, Jackson Laboratories, StrainInfo. I could link a dozen more. The nodes are emerging. What's missing is the network that connects them. What's missing is an impact factor for materials.

We're headed straight towards a future where scientists will need to publish their tools, data, and narratives, instead of compressing everything into a "paper" that is constrained by the cost of printing and mailing. I for one can't wait. It's going to be a key to distributing democratized access to tools, which is fundamental for both distributed innovation *and* user-driven innovation.

Comment #4:

I believe your historical facts are a little skewed. Open Biology perhaps began on the internet back with BIONET, which functioned well through the late 80's and early 90's, until the network apparently failed to grab sufficient interest for funding. [...] There have been efforts to create biology software repositories (similar to sourceforge.net except for Biology software) and these have largely failed to attract a majority of Bio-scientists too.

This comment's talking about software. I'm not. It again illustrates the way that the open source metaphor comes with code-centric blinders.

It would be great to accelerate this process even further, for example by expanding PLoS, encouraging all scientists to publish their working software (for example, MATLAB scripts) into open source repositories

Now this is talking about the foundations for distributed science. When there is software in science, it should be published. Just like stem cells. Into repositories. Couldn't agree more.

encouraging the people-in-the-middle (hobbyists, engineers) to publish in an intermediate form which isn't as strict as a scientific journal yet maintains some level of technological standard and legitimacy -- similar to the Internet RFC's, which started as simple technical memo's.

Now here's where the comment truly shines, IMO. This is thinking broadly about breaking open the central metaphor of knowledge governance in science. This is not about "open source" - the internet RFCs aren't "open source software" - they are protocols, distributed for implementation and comment. Sort of like that stuff on the Open WetWare wiki, huh?

Coming back to my point.

Let's take off the open source glasses. Making science isn't like making software. Engineering foundations for distribution, for user hacking, for bringing more people into the system, these are the things that allowed open source to emerge in software. Good design choices, like separation of concerns, led us to the world of open source software. Let's learn from those lessons and build the foundations first, and let the science surprise us with the way it localizes distributed and user driven innovation.

October 30, 2009

Open Source Science? Or Distributed Science?

Category:

I was asked in an interview recently about "open source science" and it got me thinking about the ways that, in the "open" communities of practice, we frequently over-simplify the realities of how software like GNU/Linux actually came to be. Open Source refers to a software worldview. It's about software development, not a universal truth that can be easily exported. And it's well worth unpacking the worldview to understand it, and then to look at the realities of open source software as they map - or more frequently do not map - to science.

The foundations of open source software are relatively easy to track. In the beginning, there was free software and Richard Stallman. RMS didn't just invent the GPL as a legal, he wrote crucial foundational software for writing software, notably the GNU compiler collection, GNU Debugger, and the original Emacs. So from the beginning, there was not only a free legal tool, but tools for coding that were better than other systems at the time.

Simultaneously, we can see that the emergence of microcomputers and ubiquitous access to the internet expanded the number (and interconnectivity) of potential programmers. Suddenly there were tens of thousands of programmers with computers at home and at work. The explosion of the Web saw the creation of infrastructure like code repositories, version control systems, and coding communities. Thanks to object-orientation, software was also very amenable to being broken into defined, modular chunks and tasks. One coder could work on a kernel function, another on a user interface function, a third on an application, and they could be reasonably sure that as long as they all followed the standards, their work would snap together into the growing distribution. The phrase "open source" can sort of be a shorthand for this kind of innovation, which we also see in wikipedia and other community built projects.

Open source, if we view it through a different lens, is really more about a distributed methodology for software development. The burden of creation is widely distributed across a massive community with more-or-less equal access to tools and systems. In this context, the role of the legal tool is more akin to an enzyme. It was an essential piece of a puzzle, but it was not the only piece. In fact, without the rest of the infrastructure (connectivity, tools, and people) the legal tool on its own would not have led us to GNU/Linux.

Yet far too often the focus on "porting" open source to science focuses on the legal aspects rather than performing an analysis of the infrastructure for science. Science is actually not very similar to modern software at this point. In science, especially life science, many of these factors don't exist. There isn't democratic access to tools. You tend to need a lab, which means you tend to need to work at a place big enough to afford a lab, which tends to mean you need an advanced degree, which means there is no crowd - thus the fundamentals for distributed science development aren't there. And when we try to force open source on a knowledge space that is fundamentally poorly structured for distributed development, we'll not only be frustrated by our failures to replicate the GNU/Linux and Wikipedia successes, we'll risk discrediting the idea of distribution itself.

Another problem: the open source approach, which is based on the open licensing of a powerful, moderately internationally harmonious property right, doesn't really apply very well to science, in which the IP situation is far more often patents v trade secret instead of copyright v copyleft. Copyrights are free to acquire, and thus easy to license at no cost as well. No one's losing an investment they made of $50,000 or more to acquire their copyright when they license code under copyleft. Patents are not so amenable to legal aikido. And they can kill a great idea in the cradle by tying up all the rights in a tangle of patent thickets and expensive licenses.

A third problem is that science is a long, long, long, long, long way from being a modular knowledge construction discipline. Whereas writing code forces the programmer to compile the code, and the standard distribution forces a certain amount of interoperability, scientists typically write up their knowledge as narrative text. It's written for human brains, not silicon compilers. Scientists are taught to think in a reductionist fashion, asking smaller and smaller questions to prove or disprove specific hypotheses. This system almost guarantees that the tasks fail to achieve modularity like software, and also binds scientists through tradition into a culture of writing their knowledge in a word processor rather than a compiler. Until we can achieve something akin to object-orientation in scientific discourse, we're unlike to see the distributed innovation erupt as it does in culture and code.

A fourth problem is that science has the additional problem of collective action congestion created by the significant institutional participation impact of research institutions, tech transfer offices, venture capital, startups, and so forth. Software isn't subject to these constraints, at least, not most software. But science is like writing code in the 1950s - if you didn't work at a research institution then, you probably couldn't write code, and if you did, you were stuck with punch cards. Science is in the punch cards stage, and punch cards aren't so easy to turn into GNU/Linux.

None of this is meant to discourage open approaches. We need to try. The problems we face, from neglected diseases to climate change to earthquake analysis to sustainability, are so complex that they'll probably overwhelm any approach that is not inherently distributed. Distributed systems scale much better than non-distributed, closed systems. But we should always understand the foundations, and closely examine our work to see if we need to work on building those foundations.

In the sciences, the first foundation is access to the narrative texts that form the canon of the sciences. Tens of thousands of papers are published a year. They need object-orientation - semantics - so that we can begin to treat that information as a platform, not a consumable product. Licensing is a part of this, but so is technology and scientific culture. Better ontologies, buy-in to technical standards, publisher participation in integration and federation, and more will be foundational to the establishment of content-as-platform. As the data deluge intensifies, this foundation becomes more and more important, as the literature provides the context for the data. Moving to a linked web or semantic web without a powerful knowledge platform at the base is building a castle made of sand - close to the water line.

Another foundation is access to tools and the creation of fundamental open tools. We need the biological equivalent of the C compiler, of Emacs. Stem cells, mice, vectors, plasmids, and more will need to available outside the old boy's club that dominates modern life sciences. We need access to supercomputers that can run massive simulations for earth sciences and climate sciences. These tools need to be democratized to bring the beginning of distributed knowledge creation into labs, with the efficiencies we know from eBay and Amazon (of course, these tools should perhaps be restricted to authenticated research scientists, so that we don't get garage biologists accidentally creating a super-virus).

The legal aspects weave through these foundations. The license has power to create freedoms but the improper application of a license approach carries significant risks. The "open source" meme can often feel a little religious about licenses, but it's good to remember that the GPL was invented not in the desire to write a license, but in a desire to return programming to a free state. With data and tools, we have the chance to avoid the intellectual property trap completely - if we have the nerve for it.

There is some distributed innovation happening in new fields of science, like DIY biology, and in non science communities, like patients sharing treatments and outcomes with each other. A quick examination of the foundations reveals they are ripe for distribution: DIY biology can build on open wetware, the registry of standard biological parts, and the availability of equipment and tools. Patients can connect using Web 2.0 and talk to each other without intermediaries. But this doesn't scale across into traditional science.

I propose that the point of this isn't to replicate "open source" as we know it in software. The point is to create the essential foundations for distributed science so that it can emerge in a form that is locally relevant and globally impactful. We can do this. But we have to be relentless in questioning our assumptions and in discovering the interventions necessary to make this happen. We don't want to wake up in ten years and realize we missed an opportunity by focusing on the software model instead of designing an open system out of which open science might emerge on its own.

September 2, 2009

Story Time

Category:

This post was prompted by the combination of three events: a visit with the founder of PubGet, an invitation to keynote at a conference on publishing, and an interview with Bora about the Science Online 2009 conference last January in RTP.

The past year has seen an explosion of talk about the future of the scientific article. It's wonderful to see, even if the results are either depressingly complicated to achieve or depressingly incremental innovation. Both of those results are better than when I got into this - I remember at a conference in Sweden in 2006 hearing a grand high priest of the publishing industry argue that they'd gotten this whole digital publishing thing sorted right out...that attitude was the first thing that needed to change. Glad it has.

I've been hammering for years now on the need to enrich articles with semantics. My talk at that conference in Sweden was probably the first good one I gave on the topic, and it's been an leitmotif for me going back to the mid-1990's when I was studying epistemology and getting my first real exposure to networked computers. For years I was convinced it was right around the corner.

That semantic publishing future now feels closer than it ever has. But I'm actually less convinced it's around the corner than in years past, and the reasons for that are human, not technical.

To be clear: in the following, I'm going to be talking about narratives and text, not about databases. The semantic future for databases and data is already here, but to paraphrase William Gibson, it's just unevenly distributed. Those of the argument that the Semantic Web isn't going to work have already lost the argument. You just don't see it, because it's an infrastructure upgrade to the back-end of the Web to make it work for data.

But the impact of formal semantics on text, which is what humans interface with, has been negligible. It's had nowhere near the impact of tagging and folksonomy. That's driven me, and many others who like formal semantics, crazy.

The benefits to a formal semantic approach to text are so obvious: we can start to treat knowledge as a graph, and we can even maybe start to get some network externality benefits to that knowledge. Make it more valuable via the network...one fact is like one fax machine, but many facts build a hypothesis, etc. etc. etc.

Beautiful dream. Not going to happen anytime soon.

The problem is that people are the writers. Humans. Not machines. Machines luuuuuv semantics. Otherwise they can't tell the difference between a picture and a pitcher (or between a pitcher of water and a baseball pitcher). This is why one should never send one's mother to buy jewelry via Google without the safe browsing mode enabled.

And people don't like formal semantics. I majored in formal semantics, and it's a topic that still gives me headaches.

People like stories.

Scientists are people.

Scientists like stories.

A paper is a story. It tells, in its own way, the story of years of work. Of building expertise. Of designing falsifiable hypotheses. Of the results found in the lab. Of the search to balance those results against the canon and dogma. Of the potential ramification of the results.

It's a story of science. And the telling of it is an important part of being a human who does science.

A recent article in PLoS Genetics states that "Fission Yeast Tel1ATM and Rad3ATR Promote Telomere Protection and Telomerase Recruitment" - now, those are the key "facts" asserted. They could be written into machine-readable format. I will spare you what that would look like. Suffice to say it's eye bleedingly ugly, and requires lots of agreement about unique identifiers. It's doable. It's being done for the databases and that will eventually make it possible for the literature. It's just not fun. And it ignores the story.

It reduces the research tale to a few assertions, nested into a massive graph of stuff other people asserted. While this is great for machines, it is lousy for people.

This is all leading up to an idea I'm working on for the talk later this month. Publishers need to be in the business of providing the service that translates the stories for the machines to understand. The Web makes it trivial to publish stories in human readable form. All the beautiful layout services and print services that used to be worth paying for...aren't. Peer review isn't free, but it's nowhere near as expensive as it's made out to be - and it's going to get transformed by the Web, too. The Web makes peer review massively more powerful as it makes it massively more democratic. The Web kills a lot of things that used to drive value in content, especially controlled content.

After all, I can't remember the last time I used a Zagat's guide. Not when I have Chowhound. It's going to come to science. Don't know exactly how, but it's coming.

But this only covers one piece of science - the telling of the story. There's another key, which is the ability to use the information to write a new tale. The ability to take this massive corpus of story and turn it into something that can be modeled, that can be used by humans and machines together to draft new stories...that ability is going to require the emergence of publishers who understand their role in the new content economy. It's not as printers who use bits rather than ink. It's as translators between the human stories and the machines who have to take those stories, integrate them into a web of linked data, and make it possible for humans to ask questions, dream dreams, and tell new stories.

The semantic article isn't going to come from individual scientists rebelling and marking up their own text. It's going to be a publisher value-added service - "let us make your article integrated, and comprehensible, so that you maximize your citation count and potential collaboration."

Sounds good, doesn't it?

Focusing on the control of copies of the article, of the story, isn't just a losing strategy because of the open access movement, although it is that as well. It's the wrong concept entirely. Translation is a service for which authors would gladly pay. For which searchers would gladly pay. And it's a market that is going to get more valuable as a result of open systems, not less valuable, as the cost of controlled scientific published content drops thanks to green and gold open access.

Think about Clayton Christensen's law of conservation of attractive profits: "When attractive profits disappear at one stage in the value chain because a product becomes commoditized, the opportunity to earn attractive profits with proprietary products usually emerges at an adjacent stage."

Publishers are trying to fight the commoditization of the story. They shouldn't. The vast majority of the stories are bought and paid for by the public one way or the other. Publishers should be looking at the place where they can compete on proprietary services, and taking over those markets before their competitors - or startups - beat them to it. There is enormous opportunity in the emerging open access world to make money without needing to vigilantly police the movement of content.

Help the scientists tell their stories in a way that lets those stories integrate into the digital web. Don't just gussy up a paper version of a story with hyperlinks. Don't focus on controlling the movement of stories. They're sand in your hands once they're on the network. Embrace that fact. Find the value in the next layer, the service layer.

Be a guide. Be a search engine. Be a translator.

September 1, 2009

Ignore this post

Category:

Seriously. Just getting around to technorati claiming. Move along, nothing to see here. Watch for a lengthy post on scientific publishing later tonight or tomorrow.

59tbcg4wsi

August 20, 2009

Open Data: It's About Interoperability, Not Property

Category:

I wrote this up on the request of a colleague who heard my talk recently on open data. I'm posting it here for comment and adding some hyperlinks...

Moving from a Web of documents to a Web of data (or of Linked Open Data) is an oft-cited goal in the sciences. The Web of data would allow us to link together disparate information from unrelated disciplines, run powerful queries, and get precise answers to complex, data-driven questions. It's an undoubtedly desirable extension of the way that the existing networks increase the value of documents and computers through connectivity - Metcalfe's Law applied to more complex information and systems.

However, making the Web of data turns out to be a deeply complex endeavor. Data - here, a catchall word covering databases and datasets and generally meaning here information that is gathered in the sciences as a result of either experimental work or environmental observation - require a much more robust and complete set of standards to achieve the same "web" capabilities we take for granted in commerce and culture.

Unlike documents, the ultimate intended reader of most data is a machine. Some classic examples include search engines, analytic software, database back ends, and more. There is simply too much data in production to place people on the front lines of analysis. When data scales easily into the petabytes, we just can't keep up using the existing systems.

This machine-readability requirement is very different from the Web of documents, which was designed to standardize the way information is shown to people. Machine readability means we have to think, early and often, about the level of interoperability in any given chunk of data. "How "connectable" is it to other data?" should be the first question we ask of new data, because the level of effort required to make data connectable post-hoc is significant - frequently unbearable.

The connectability quotient creates significant pressures to build interoperability deep into the Web of data. It implies a level of rigor in the design of data that understands the intended use of that data is in a network context. Thus, we need to turn ourselves to the concept of interoperability and examine what it means in a data context.

There are three interlocking dimensions to interoperability in data: legal, technical, and semantic. By legal, we mean the contractual and intellectual property rights associated with the data; by technical, the standard systems (especially the computer languages) in which the data is published; and by semantic, the actual meaning of the data itself - what it describes, and how it relates to the broader world.

Each of these dimensions is complex on its own. Taken together, the three represent unsolvable complexity. The semantic layer alone requires an almost miraculous level of agreement on "what things mean," and anyone who has witnessed argument among scientists, be they economists of physicists, knows that even apparently simple topics turn contentious over matters as basic as definitions. Consensus on the technical layer is somewhat easier - the existence of the Web and the Semantic Web "stack" of standard technologies has begun to take a leadership position in data networking - but still difficult, long, and open to argument. One of the only opportunities we have is in the legal layer, where we can look to a broad set of successes in legal interoperability through the use of a simple, flat standard: the public domain.

The public domain is a very simple concept - no rights are reserved to owners, and all rights are granted to users. The public domain exists as a counterweight to copyright in the creative space, but in some countries - especially the United States - as a first option for data that is not considered "creative."

The public domain option currently underpins a wide variety of linked data that is already well on its way to achieving Web scale. From the International Virtual Observatory, whose members build an international data net on norms of "acknowledgment" rather than contracts of "attribution", to the world of genomics, where entire genomes and related data are harmonized nightly across multiple countries, the public domain creates complete interoperability at the legal layer of the data network, and serves as a foundation for the next layer of technical interoperability.

Interestingly we have yet to observe similar network effects emerging in cases where the underlying data is treated in a more conservative "intellectual property" context by using copyright licenses or database licenses inspired by copyright. Indeed, in the case of the international consortium mapping human genomic variation, the implementation of a "click through" license was found in practice to impede integration of that mapped variation with other public domain data, limiting the value of the map. The license was removed, the public domain option instated, and the database was immediately technically integrated with the rest of the international web of gene data.

The legal element is of course just the beginning. The entities inside the databases themselves must be named and linked, in a standard way. Consensus on a dizzying array of technical standards must be achieved through working groups and hard won agreement. Semantic agreement - or disagreement - must be enabled where possible, and managed through savvy technology where not possible. But if the entire system must begin with a complex set of legal terms and conditions, and be subject to the kinds of injunctions and property claims so familiar from the creative world, it is inherently unstable and unlikely to interoperate.

We have seen the public domain option work, again and again, across the scientific disciplines. Implementing the public domain as the interoperability standard for the legal dimension of the web of data holds the greatest promise for scalability and long-term achievement of the network effect for data, as it permits the widest range of experimentation and development at the technical and semantic layers.

August 5, 2009

May All Your Standards Be Simple and Evolvable

Category:

I was in a roundtable yesterday talking about Health IT with a bunch of very smart people in the bay area. It was sort of a briefing of ourselves and others about the real issues underpinning what it would take to generate real disruptive innovation in health technology and health costs. The vast majority of the conversation centered on payment reform, which is outside my ambit.

But we did spend some time talking about health data standards, and the problem of getting standards that are so geared to the existing market-dominant companies that they actually froze out new market entrants. My contribution in all this was pretty small, and to me seemed obvious. The standard that works best tends to be the least powerful solution to the problem, especially if it's an openly released solution. This can be counterintuitive - why wouldn't we want the most powerful one? - but it's been proven again and again.

In technology, standards propagate like kudzu. Most of them go nowhere, representing an enormous sunk cost of time and money. And that's because most of them are way too complex. The more powerful they are, the more brittle they are, the more expensive they are to implement, and the more they restrict the re-use of the system.

Tim Berners-Lee calls this the Rule of Least Power, and it's one of the most important lessons I learned working at the W3C. There's a simple reason for this - the more basic the markup of the content, the easier it is to write applications that process the content.

Thus TCP/IP, created simply to move bits between computers, begat a variety of new protocols like FTP, Gopher, Finger, many other protocols that layered atop the basic bits standard. Complexity from simplicity. Attempting to embed file transfer into the bits protocol would have made this whole process a lot harder.

And of course HTML/HTTP begat the entire Web, all the way to YouTube and Amazon and everything else. Writing video codes into HMTL wouldn't have worked nearly as well as writing a standard that was simple enough to be extended by smart users coming along ten years later.

To the rule of least power we can add the rule of openness - the standards process should be as open as is feasible, and the standards themselves must be open. Users have to be able to read a standard, and to have the freedom to implement the standard, to be able to innovate atop it with new systems.

There's a lesson here. Gathering the relevant powers that be to figure out a standard is an important task. The W3C, the IETF, the OMG (that's Object Management Group, not the internet acrony, for you younguns), and what feels like every different data discipline on earth does standards this way.

But there's a lot of fingers on the scale for most of this work. That's because data standards tend to get created by well-meaning, overworked, and underpaid people who are making a real sacrifice to work on the standards. And those people are going to depend on a lot of in-kind work from the interested parties, who are always going to try to bend the standards to their will.

That can go multiple ways. The paranoid conclusion is that the for-profits involved will try to use the standard to increase stock prices, which is why smart standards efforts include patent policies to prevent enclosure. But there's a bigger problem out there, which is much less visible but much more of a force in the creation of standards that don't get used, or that don't do what we want them to do.

It's what I call the problem of standards completeness. Experts in the field, interested parties, impassioned volunteers - these people by their nature tend to want to make the standard they build as complete as possible. They want to cover the most ground with the standard. They understand the space so well that they want to build standards that address vast swaths of work.

But that violates the Rule of Least Power. And as we move towards a web of data, even a web of patient data, we'll do well to make our standards by solving real problems with the simplest possible solutions, then releasing those solutions for others to build on.

The impact of the simple evolvable standard in short term is probably less than a more complete, perfect standard. Certainly TCP/IP didn't scare the systems integrators at its inception. But it's the power of the crowd that can build on the open standard that breaks open the market. Thanks to simple standards, two talented programmers can start a company in a garage that changes the world.

If we're going to bring that level of innovation potential to health IT, we need to keep the lessons of the simple standard in mind. Because right now, if you're a bright young entrepreneur, you don't get into health IT. And the lack of not just standards, but the right kinds of standards, is the first barrier we have to knock down to change that reality.

July 31, 2009

Integrate. Annotate. Federate.

Category:

Following on to yesterday's post, where I wrote about the four functions that traditional publishers claim as their space (registration, certification, dissemination, preservation), I want to revisit an argument I made last week at the British Library.

In my slides, I argued that the web brings us at least three additional functions: integration, annotation, and federation. I wanted to get this argument out onto the web and get some feedback...

Let's start with integration. The article no longer sits on a piece of dead tree, inside a journal formatted by date and volume and page number. It exists as a digital entity, capable of dense integration into other digital entities. One way to think of this is to think of how the citation is truly weak tea compared to the hyperlink - an individual citation carries more weight than an individual hyperlink, but the hyperlink is so easy to create, and carries so much power in aggregate, that we get Google. Citations are the only way most articles are integrated with other articles, and that simply has to change.

Articles need to be integrated with lots of other digital information. Media is an obvious one, and the Elsevier-Cell "article of the future" seems to start here with an interview with the authors. To me this is absurd, and the height of how a "big company" thinks "the users" use the web. I don't want to hear an author interview with a reporter. I assume the author is going to say his or her work is sweets and sparkles and Nobel prizes. I'd rather see an embedded high-resolution video of all protocols necessary to replicate the experiment like the ones you get from JoVE (I'd like them to actually be open access too, but that's a different blog post).

If you want to make the article of the future, start with integration and work backwards. Don't start with the article and work forward, because you'll be trapped in document mentality instead of the network mentality.

We don't just want the data downloadable, we want to be able to run the same algorithms the author ran on the data, and adjust the variables myself, to see if the results are the output of statistical foul play or negligence. We want to be able to hide all the boring language that recapitulates past canon and focus on the new assertions, unless of course the author is trying to game the past canon and shade the facts. And we want to be able to effortlessly click out and get data about the assertions in the paper from other databases - when there's a gene mentioned, we should be able to one-click and run any number of core queries against the sequence, the ontological classifications, order genetic materials from biobanks and so forth.

Annotation is the second new essential function. The old method of annotation is through either writing a new paper that validates, invalidates, extends, or otherwise affects the assertions made in an old paper. Or if something is really wrong, there might be a letter to the editor or a retraction. In a wiki world, this is fundamentally insane. The paper is a snapshot of years of incremental knowledge progress. We have much better technology to use than dead trees.

Of course, there isn't any incentive to take the wiki that is science and actually use a wiki to create and edit it. Scientists get tenure for papers, and egoboo is cold comfort. Annotation needs to be provided by publishers, and is being provided, but the next step is to create an open platform that actually tracks the kind of annotation-relationships that the web enables. Bloggers use trackback to create a formal hyperlink between blog posts, and the protocol can and should be extended to let us connect all sorts of things: articles, wiki pages, database entries, catalog pages for biological materials, data sets, and on and on. By making these link transactions - which exist anyway - explicit and trackable, and most importantly reportable, we'll create a currency that scientists will gladly spend. It won't be about "sharing" but instead about "publishing" more of the intermediate knowledge that currently gets left on the lab floor when the paper gets written.

Federation is the last essential new function I'll deal with here (have some theories on other long term essential ones, but they're poorly formed in comparison). By federation I mean the ability to take a set of articles and federate them into a corpus with other materials. There's a lot of reasons one might want to do this: text mining, semantic indexing, integration with information that is private, and so forth. It's great to be able to read articles on the web. But if we're going to really explode the way we communicate, the ability to cache local copies (or cloud copies) in new formats for new kinds of analysis, and the right to then distribute the resulting corpus for follow-on innovation and exploration, is going to be central.

Publishers are so focused on the prevention of copying that they don't see the central business opportunity here: the human-readable, copyrighted version of the article is the least federation-friendly. Charge a fee to make the article beautifully machine-readable and give away the text - because the service of improving the technical aspects of the article is clearly a value-add that shouldn't be subject to a funder mandate.

Integration, Annotation, Federation. It's what the Web is all about. And if we can get to the point where publishers feel these as core responsibilities, the Open Access debate will have made a major leap. All of these create a world in which the text of the article itself is lower in economic value, and thus easily distributable, than the connectivity of that article into a larger web of information. OA is the beginning, not the end game, of making the web work for science the way it works for culture. Step two is all about the connectivity, and it's time to start arguing - loudly - for the right to start wiring the science together.

July 30, 2009

Publishing science on the web

Category:

I spoke last week at an event at the British Library about the future of the scientific article. It was a lively event - lots of friendfeed and twitter reactions - and it got me thinking a lot about the way we use publication in science.

In my conversations with research staff and leaders at the BL, I ran across this statement. Publishers frequently claim four functions: registration (when was an idea stated?), certification (is the idea original, has it been "proved" to satisfactory peer review?), dissemination (delivery), and preservation of the record. The journal thus provides for both the claiming of ideas by scientists and for the "memory" of the sciences.

But the Web does a lot of this for us outside of science. It's become easy to write and read, and to use Google as a memory cache. The ability to rapidly find relevant information is part of daily life for us outside of science. But inside of science there is complaint that even within one's own specialized discipline, there is too much to read, too many journals, too little time. This doesn't even begin to include the coming deluge of data wrought by the relentless miniaturization and parallelization of a world where data is generated by robotic lab machinery and captured by tiny, ubiquitous sensors.

Wikis and blogs provide almost costless registration and dissemination of new scientific communication. But resistance to wikis and blogs is a feature of science - Nature's web efforts are yet to make significant revenue despite significant individual use. Is it a matter of certification? Preservation? Cultural aspects related to the way we fund and reward scientists?

Another thought on science communication - science is already a wiki if you look at it a certain way. It's just a really, really inefficient one - the incremental edits are made in papers instead of wikispace, and significant effort is expended to recapitulate the existing knowledge in a paper in order to support the one-to-three new assertions made in any one paper. And the papers are written in a highly specialized form of text that demonstrates the expertise of the writer in the relevant domain, but can form a language barrier to scientists outside the domain understanding the key facts.

In places where the local knowledge is sufficient enough to create falsifiable hypotheses and experiments, the time required to learn the language of others doesn't get rewarded by results - gene sequencing doesn't need a physicist, for example. How can we get to enough technical standards so that this kind of science can be harvested, aggregated, and mashed up by people and machines into a higher level of discipline traversal? Right now the problem is we still think about cross disciplinarity as a function of people choosing to work together. But the internet and the Web give us a different model. What's more cross disciplinary than Google? But the language barrier among scientists is preserved - indeed, made worse - by the lack of knowledge interoperability at the machine level. It's the Tower of Babel made digital. Until we can get past that one, we're going to be stuck doing human speed knowledge construction on machine speed data generation...

July 11, 2009

WisconsinView converts to CC0

Category:

Just a quick hit - I'm digging out after a wonderful break from work - but this deserves notice...

Since 2004, WisconsinView has made aerial photography and satellite imagery of Wisconsin available to the public for free over the web. As part of the AmericaView consortium, WisconsinView supports access and use of these imagery collections through education, workforce development, and research. Starting June 30, 2009, WisconsinView is making available all of its more than 6 Terabytes of imagery data under the new CC0 Protocol provided by Creative Commons. The CC0 (pronounced CC-Zero) Protocol waives any rights in a dataset, ensuring that all of the dataset is available to anyone without encumbrance of any kind.

Thanks to Puneet Kishor, our SC Fellow for geospatial, for his tireless advocacy on behalf of the public domain for data!

ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter
Visit the Collective Imagination blog
Advertisement
Enter to win

© 2006-2009 Seed Media Group LLC. ScienceBlogs is a registered trademark of Seed Media Group. All rights reserved.

Sites by Seed Media Group: Seed Media Group | ScienceBlogs | SEEDMAGAZINE.COM