Unpacking "the cloud"

By dsalo on August 8, 2009.

I hear talk about "the cloud" as the solution to research data curation. Data will waft softly up into "the cloud," and "the cloud" will look after it and give it back on demand, and there will be unicorns and rainbows and rainbow-colored unicorns, and—well, you get the idea.

I think this is bosh. Balderdash. Bunkum. But I also think it's worth unpacking why this is a popular and recurring idea, because there's the germ of a service design in there.

"The cloud" means a lot of things to a lot of people, but for the sake of argument, let's call it "third-party data-storage services" such as Amazon's S3. S3 is not a solution for data curation. The service-level agreement amounts to "we can lose any of your data any time, and your only recourse might be a refund of what you paid us." For unique, irreplaceable data, this is beyond unacceptable. Think it can't happen? It already has.

As part of a well-managed storage and backup system, S3 might do. Might. But do you really want to design around its limitations?

However. Look up at the sky, if you're lucky enough to be near a window. I'm guessing you see either no clouds at all, or a lot of them. More than one, at any rate. How many skies contain just one cloud?

Cost questions aside, what is it that "the cloud" promises that people want? Could those of us interested in data build that?

"The cloud" promises to make data storage secure, safe, and above all easy. Yes, I think we can do this, and I think we should. Fedora, IRODS, pick your poison—but big disk, taken care of invisibly behind the scenes, with lots and lots of ways to get data in and out?

We can do this. We should.

More like this

Equipment and data curation

Monado of Science Notes commented on my irreplaceable-data post thusly: It sounds as if the best thing to do in the short term is not throw away the old equipment. And to use the old equipment to copy digital media to newer forms... for which no one ever gets a budget, right? It's such a great…

Top-down or bottom-up?

As I watch the environment around me for signs of data curation inside institutions, particularly in libraries, I seem to see two general classes of approach to the problem. One starts institution-wide, generally with a grand planning process. Another starts at the level of the individual…

Cloud Computing

In general, I try to keep the content of this blog away from my work. I don't do that because it would get me in trouble, but rather because I spend enough time on work, and blogging is my hobby. But sometimes there's an overlap. One thing that's come up in a lot of conversations and a lot of…

Let Them Eat Disk

Many people, first confronted with the idea of data curation, think it's a storage problem. A commonly-expressed notion is "give them enough disk and they'll be fine." Terabyte drives are cheap. Put one on the desk of every researcher, network it, and the problem evaporates, right? Right? Let me…

We have. But we're doing it commercially:

http://www.caringo.com/

I'm not seeing your service-level agreements.

Think it can't happen? It already has.

This is a bit misleading. The lost data was live instance data, not stored S3 data. Running virtual machines were erroneously terminated, and as consequence any active data in them that had not been saved externally, to S3 for example, was lost.

That's not to say that S3 won't lose data in the future, or may already have lost data that went unreported. But unless I'm misunderstanding, the referenced incident was a very different failure mode than a storage cloud losing data.

However, as someone who's been working with cloud storage in general and S3 in specific for over three years now, I can completely agree with your core premise. Though a professionally managed cloud runs a much lower risk of data loss than your typical smaller setup, it's still a single provider and a single point of failure.

But do you really want to design around its limitations?

When the alternative is hosting your own multi-site datacenter infrastructure, working around the limitations of various cloud providers isn't such a bad deal.

We can do this. We should.

I'd better stop reading blogs and get back to work then ;-)

A cloud like S3 is infrastructure. An individual institution may or may not be able to provide infrastructure as good or better, with the same or different functionality.

The cloud provides no curation, as you point out. It helps diddly with your metadata or workflow.

It might just be, however, as or more reliable in terms of backup and recovery as whatever one's local institution provides.

You mention Fedora and IRODS -- these provide metadata, so can help people do curation. My intuition, however, is that there are a class of users who have curation needs that are small enough and broad enough that it will never be feasible to write software to help them. For these people, their best bet may be guidelines for what types of metadata to store and a pointer to large, relatively secure, inexpensive digital storage.

Does that mean a ton of data will likely get lost of the centuries? Yes, but perhaps it's unreasonable to expect that humanity will find a way to save everything.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

EPA Reconsiders Its Biden Ban On Asbestos Everywhere

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…