Unpacking "the cloud"

I hear talk about "the cloud" as the solution to research data curation. Data will waft softly up into "the cloud," and "the cloud" will look after it and give it back on demand, and there will be unicorns and rainbows and rainbow-colored unicorns, and—well, you get the idea.

I think this is bosh. Balderdash. Bunkum. But I also think it's worth unpacking why this is a popular and recurring idea, because there's the germ of a service design in there.

"The cloud" means a lot of things to a lot of people, but for the sake of argument, let's call it "third-party data-storage services" such as Amazon's S3. S3 is not a solution for data curation. The service-level agreement amounts to "we can lose any of your data any time, and your only recourse might be a refund of what you paid us." For unique, irreplaceable data, this is beyond unacceptable. Think it can't happen? It already has.

As part of a well-managed storage and backup system, S3 might do. Might. But do you really want to design around its limitations?

However. Look up at the sky, if you're lucky enough to be near a window. I'm guessing you see either no clouds at all, or a lot of them. More than one, at any rate. How many skies contain just one cloud?

Cost questions aside, what is it that "the cloud" promises that people want? Could those of us interested in data build that?

"The cloud" promises to make data storage secure, safe, and above all easy. Yes, I think we can do this, and I think we should. Fedora, IRODS, pick your poison—but big disk, taken care of invisibly behind the scenes, with lots and lots of ways to get data in and out?

We can do this. We should.

Tags

More like this

Think it can't happen? It already has.

This is a bit misleading. The lost data was live instance data, not stored S3 data. Running virtual machines were erroneously terminated, and as consequence any active data in them that had not been saved externally, to S3 for example, was lost.

That's not to say that S3 won't lose data in the future, or may already have lost data that went unreported. But unless I'm misunderstanding, the referenced incident was a very different failure mode than a storage cloud losing data.

However, as someone who's been working with cloud storage in general and S3 in specific for over three years now, I can completely agree with your core premise. Though a professionally managed cloud runs a much lower risk of data loss than your typical smaller setup, it's still a single provider and a single point of failure.

But do you really want to design around its limitations?

When the alternative is hosting your own multi-site datacenter infrastructure, working around the limitations of various cloud providers isn't such a bad deal.

We can do this. We should.

I'd better stop reading blogs and get back to work then ;-)

A cloud like S3 is infrastructure. An individual institution may or may not be able to provide infrastructure as good or better, with the same or different functionality.

The cloud provides no curation, as you point out. It helps diddly with your metadata or workflow.

It might just be, however, as or more reliable in terms of backup and recovery as whatever one's local institution provides.

You mention Fedora and IRODS -- these provide metadata, so can help people do curation. My intuition, however, is that there are a class of users who have curation needs that are small enough and broad enough that it will never be feasible to write software to help them. For these people, their best bet may be guidelines for what types of metadata to store and a pointer to large, relatively secure, inexpensive digital storage.

Does that mean a ton of data will likely get lost of the centuries? Yes, but perhaps it's unreasonable to expect that humanity will find a way to save everything.