Since people know I work for Google, I get lots of mail from folks with odd questions, or with complaints about some Google policy, or questions about the way that Google does some particular thing. Obviously, I can’t answer questions about Google. And even if I could, I wouldn’t. This isn’t a Google blog; this is my blog, which I write as a hobby in my free time.
But there are intersections between my work life and my hobby. And one of the ideas that underlies many of the questions that I receive, and which also
hits on my work, and my hobby. And that’s the idea of scale. Scale is computer-science talk for how things change as they get bigger. In particular, I’m talking about the scale of information; the amount of information that we use on a daily basis has increased dramatically, and the amount of dramatic, fundamental change that has resulted is both amazing, and amazingly unnoticed by most people.
Take a bunch of information, and think about how you would analyze it. Suppose you have a list of, say, music tracks. On my computer, I’ve got 4,426 tracks of music. Now, suppose I wanted to go through those tracks, and summarize what’s there. If all you want to look at is simple information – the name of the track, the artist, the length, the genre, etc., it’s trivial. You could do it in seconds on any modern computer. Heck, you could do it in seconds on any modern cellphone.
Now, suppose that instead of 4,000 tracks, it were 4 million. That’s still
not a big deal. I can still do all sorts of interesting analysis of 4 million
records in minutes. With a modern computer, it’s no big deal.
But what if I wanted to do something more interesting, like look at the files and try to compute the tempo? Now instead of dealing with four million sets of strings, I’d need to actually look at the full content of the songs, to compute the tempo. OnIn my mumic collection, the average track seems to take up a couple of megabytes; let’s say it’s just one megabyte per song. To analyze a library of 4,000 songs at one megabyte each is doable on a modern PC, but it’s going to take a while. To do it to a library of 4 million tracks is simply not feasible. I can buy a disk for my computer large enough to store that many tracks, but there’s no way that I’d ever be able to run an interesting sonic analysis and finish it.
Services with music databases like LastFM, Pandora, and iTunes maintain
information on tens of millions of tracks, and many of us play with those
every day without ever thinking about what’s going on. LastFM and Pandora
analyze millions of tracks of music, so that when you tell it what you like, it
can suggest similar things based on the musical structure. Ten years ago, having a computer do that would have been impossible. Today it’s routine.
That’s scale – the way that things change as you increase the size of the problem.
Things get bigger and bigger, and eventually the size of the problem crosses a line so that the increase in size fundamentally changes the problem.
Again, let me try to put the idea of quantities of data into some kind of terms
that intuitively give you an idea. I bought my first computer in 1982. It was a
Franklin Ace 1000 – and Apple II clone. It had a floppy disk drive, which stored
140,000 bytes on a 5 1/4 inch floppy disk. My next computer was a big step up – an
Amiga, which stored around 800,000 bytes on a 3 1/2 inche disk drive. Today, my
cellphone has a removable flash drive which can store 8,000,000,000 bytes, on
a little tiny card about 1/4 inch on a side, and about the thickness of a greeting
card. If I were to convert the storage on my cellphone to floppy disks on my old Apple
clone, it would require about 57,000 disks. That would be a stack of floppy disks
almost 500 feet high.
We’re routinely carrying around quantities of data that we simply don’t have
the ability to intuitively understand. A gigabyte is a billion letters. What does that mean? We just don’t think about it. The closest we come is
to break it down some other way; instead of calling it a billion characters, we’ll call it a hundred songs. But the availability of that quantity of data has dramatically changed the way we work, and the things we can do. I used to
have a little briefcase full of cassettes for my car stereo. I carried this stupid thing around, with 100 little plastic cassettes in it, just to have a decent selection of music to listen to. Now I’ve got 887 tracks on my Android, and I complain about how I don’t have enough space for a decent selection! That’s scale – the
amount of data I can afford to carry with me has changed dramatically, and as a result, my expectations about what’s normal have changed.
The effects of scale are very counter-intuitive. By making the problem bigger,
in some cases, you make it easier. There are things that you can do
easily with data at massive scale that you couldn’t do at all with
small quantities of data; and there are things that you can easily do with small amounts of data that become impossible when you scale upwards.
We’ll start with the upside. Scale means that you can do some amazing things
that used to be impossible. For example, scale means that you can record the entire genomes of large numbers of organisms. Last week, my fellow ScienceBlogger Tara Smith at Aetiology wrote an article about a recent Ebola outbreak; doctors studying the outbreak wanted to see what strain of Ebola they were dealing with, so they sequenced it, and compared it to the existing library of Ebola genomes, and identified it as a new strain, but related to one of the known strains. Scientists now now routinely sequence the genes new infectious diseases, and compare them with a range of known variants. In fact, they do more than that; to quote one website
“Due to the explosive growth of the field of Genomics, several
databases containing fully sequenced genomes of many viruses have been created and
placed online for easy access. By taking the newly sequenced Ebola genome data,
scientists were able to compare its composition to that of other viruses. When the
translated genome of the Ebola virus and its respective subtypes was compared to that
of other viruses, something of particular interest stood out. There was an almost
identical match between the immunosuppressive motifs of the oncogenic retroviruses,
murine leukemia virus (MuLV) and feline leukemia virus (FeLV) and a region of the
To translate that into simpler terms: they didn’t just compare the genome of the new Ebola variant to other Ebola viruses; that wouldn’t be particularly difficult, since the Ebola genome is only about 19K of data. But they compared it to the sequenced genomes of every previously sequenced virus, and found nearly identical sequences in Ebola and feline leukemia virus!
That’s the good side of scale. Things that we wouldn’t have imagined just a few years ago are not just possible, but they’re downright routine – because carrying
around and manipulating massive quantities of data is now routine. Scientists now casually work with massive quantities of information, and by doing that, they’re finding completely new ways of doing science. Ten years ago, if they could have gotten state of the art equipment, the scientists looking at the Ebola genome could probably have sequenced it. But they would never have dreamt of comparing it to feline leukemia virus. But with the scale at which we can now work, it’s harder to exclude some probably irrelevant virus than it is to go ahead and compare it, just on the incredible off-chance that there might be something there.
In some ways, it’s almost like something out of Douglas Adams’ “Hitchhiker’s Guide
to the Galaxy” made real: we can now generate answers to questions that we didn’t even
ask, and by there are answers laying in the data we have available for questions that
we don’t even know. By searching for apparently meaningful patterns in large
quantities of data, we can find answers before we even know what the questions are.
(Of course, just like in HHGtG, we still need to figure out what the question is.)
Scale has dramatically changed science: we can routinely look for things and discover
correlations that we don’t expect, which opens up whole new questions that we never
thought of before.
That’s the happy side of scale: the ability to mine huge quantities of data means
that we can look at things and discover things in a whole new way. But there’s also
a downside: there are things that would be easy for small-scale data that becomes unmanageable on a large scale.
In computer science, we’ve traditionally drawn a line between things that are really computable, and things that while theoretically computable, simply take too long on any real computer to be doable. We’ve drawn that line in terms of something called algorithmic complexity. The line has traditionally been that anything whose
complexity can be bounded by some polynomial computed from the size of its input is computable, while anything whose complexity is bounded by some exponential in the size of its input is not. But scale changes that. When you’re working with large
scale data, polynomial complexity is not really doable. Even if you’ve got something
which grows very slowly, like sorting, becomes close to impossible at scale.
A great example of how even people who deal with massive quantities of data
haven’t really grasped what a deep fundamental change we’re seeing in the scale of
things comes from the database world. A while ago, a database magazine published an
article critiquing Google’s MapReduce for not being sufficiently
relational-database-like. I wrote href="http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php">a
post responding to it, explaining why that’s wrong. I still get mail about that
old post from database folks who think that the stuff that’s typically done by
MapReduce should be done by relational databases.
Databases were designed to deal with what were (at the time) considered massive
quantities of data. And as database people keep reminding me, modern databases support clustering – meaning using multiple computers in parallel to make things faster – just like Google, so that they’re more scalable than I give them credit for.
BUt there’s massive, and then there’s massive. Databases are great
at handling large quantities of data. But there are collections of data that are simply too large for current database technology. And much of the reason
for that isn’t just because databases haven’t been designed to work with network-scale datasets, but because many of the fundamental database algorithms and representations simply don’t work on the massive scale.
It’s easiest to explain this through an example. This past weekend,
at Google (a team which I have no connection to; I know no one involved in
the task) announced the results of trying to sort one petabyte of data,
stored in 10 trillion 100-byte records. Doing this required 4,000
computers and 48,000 hard drives working together for just over six hours.
When database people talk about clustering, they’re usually talking about up to a
dozen or so machines. (I just did a quick search of IBM’s DB information on the web,
and couldn’t find any reference to using more than 12 CPUs.)
There’s a big difference between dealing with 12 CPUs and gigabytes of data, and
4,000 CPUs and a petabyte of data stored on 48,000 disk drives! Modern large scale
systems are increasingly more like the petabyte scale of large map-reduce than the
gigabyte scale of databases. According to the post about the petabyte sort results,
Google routinely processes 20 petabytes of data every single day. I’m sure
that other large internet companies are seeing similar quantities of data. Both Yahoo and Microsoft are certainly maintaining search indices that have to be in that
rough size range; ISPs need to log traffic information for millions of users;
cellphone companies must be recieving billions of data packets per hour providing them
with location information about cellphones; and so on. And it’s not just
internet companies: maintaining genomic data on thousands of human beings (which is
something medical researchers want to do); or maintaining infection disease tracking
information for large populations; or even managing auctions and bids on eBay all involve storing
That’s scale, and that’s downright scary. Just sorting one
20th of the data processed by Google every day takes a team of brilliant engineers
to figure out how to do it, and the result requires tens of thousands of computers and hundreds of thousands of disks!
Imagine how many computers it must take to encrypt all of that information! Companies like Google, Yahoo, Amazon, and Ebay are recieving millions
to billions of requests per day, and they’re not storing the information about those requests in text files that any bozo can read. Just think about what it means to
encrypt 20 petabytes of data per day.
There’s another important downside to scale. When we look at large quantities of information, what we’re really doing is searching for patterns. And being the kind of creatures that we are, and given the nature of the laws of probability, we are going to find patterns. Distinguishing between a real legitimate
pattern, and something random that just happens to look like a pattern can be somewhere between difficult and impossible. Using things like Bayesian methods
to screen out the false positives can help, but scale means that scientists need to learn new methods – both the new ways of doing things that they couldn’t do before, and the new ways of recognizing when they’ve screwed up.
There’s the nature of scale. Tasks that were once simple have become hard
or even impossible, because they can’t be done at scale. Tasks that were once
impossible have become easy because scale makes them possible. Scale