Databases are hammers; MapReduce is a screwdriver.

A bunch of people have sent me links to an article about MapReduce. I've hesitated to write about it, because the currently hyped MapReduce stuff was developed, and extensively used by Google, my employer. But the article is really annoying, and deserves a response. So I'm going to be absolutely clear. I am not commenting on this in my capacity as a Google employee. (In fact, I've never actually used MapReduce at work!) This is strictly on my own time, and it's purely my own opinion. If it's the dumbest thing you've ever read, that's my fault, not Google's. If it's the most brilliant thing you've ever read, that's my credit, not Google's. I wasn't asked to write this by Google, and I didn't ask their permission, either. This is just me, the annoying geek behind this blog, writing solely
on my own behalf, speaking for no one but me. Got it?

The reason that I'm interested in this is because it's related to my PhD work. Back in
grad school, what I worked on was applying parallelism to structured data in
non-scientific applications. That's pretty much what MapReduce does. And the solution
that I proposed was a kind hierarchical scatter/gather operation - which is, very nearly, the
way that MapReduce works. The big difference? MapReduce beats what I did. The guys who designed MapReduce noticed something that I didn't, and took advantage of it, which made M/R programs a lot cleaner and easier to write. The article that I'm going to discuss criticizes M/R for exactly that key idea.

Let's start at the beginning. What is MapReduce? What does it do?

Suppose you're at work, and you need to do something that's going to take a long time to
run on your computer. You don't want to wait. But you don't want to go out and spend a couple
of million dollars buying a supercomputer. How do you make it run faster? One way is buy a
whole bunch of cheap machines, and make it run on all of them at once. Another is to notice
that your office has lots of computers - pretty much every office has a computer on the desk of every employee. And at any given moment, most of those computers aren't doing much. So why not take advantage of that? When your machine isn't doing much, you let you coworkers borrow the capability you're not using; when you need to do something, you can borrow their machines. So when you need to run something big, you can easily find a pool of a dozen machines.

The problem with that approach is that most programs aren't written to run on a dozen machines. They're written to run on one machine. To split a hard task among a lot of computers is hard.

MapReduce is a library that lets you adopt a particular, stylized way of programming that's easy to split among a bunch of machines. The basic idea is that you divide the job into two parts: a Map, and a Reduce. Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines - so all the pieces run at the same time. Reduce takes the results from the sub-parts and combines them back together to get a single answer.

The key to how MapReduce does things is to take input as, conceptually, a list of records. The records are split among the different machines by the map. The result of the map computation is a list of key/value pairs. Reduce takes each set of values that has the same key, and combines them into a single value. So Map takes a set of data chunks, and produces key/value pairs; reduce merges things, so that instead of a set of key/value pair sets, you get one result. You can't tell whether the job was split into 100 pieces or 2 pieces; the end result looks pretty much like the result of a single map. (The key idea that differentiates M/R from my PhD work is that realization that by treating results as key/value maps, you get a very clean, uniform reduction process. I got the idea of the scatter/gather pattern, but my reduces were really ugly in comparison to M/R.)

The beauty of MapReduce is that it's easy to write. M/R programs are really as easy as parallel programming ever gets.

So, getting back to the article. They criticize MapReduce for, basically, not being based on the idea of a relational database.

When I first wanted to spend some time learning about relational databases, my then boss told me a story about database people, which I've found to be remarkably true. The story is, RDB people have found the most beautiful, wonderful, perfect hammer in the whole world. It's perfectly balanced - not too heavy, not too light, and swings just right to pound in a nail just right every time. The grip is custom-made, fitted to the shape of the owners hand, so that they can use it all day without getting any blisters. It's also beautifully decorated - encrusted with gemstones and gold filigree - but only in places that won't detract from how well it works as a hammer. It really is the greatest hammer ever. Relational database guys love their hammer. It's just such a wonderful tool! And when they make something with it, it really comes out great. In fact, they like it so much that they think it's the only tool they need. If you give them a screw, they'll just pound it in like it's a nail. And when you point out to them that dammit, it's a screw, not a nail, they'll say "I know that. But you can't expect me to use a crappy little screwdriver when I have a magnificent hammer like this!"

That's exactly what's going on here. They've got their relational databases. RDBs are
absolutely brilliant things. They're amazing tools, which can be used to build amazing
software. I've done a lot of work using RDBs, and without them, I wouldn't have been able
to do some of the work that I'm proudest of. I don't want to cut down RDBs at all: they're truly great. But not everything is a relational database, and not everything is naturally suited towards being treated as if it were relational. The criticisms of MapReduce all come down to: "But it's not the way relational databases would do it!" - without every realizing that that's the point. RDBs don't parallelize very well: how many RDBs do you know that can efficiently split a task among 1,000 cheap computers? RDBs don't handle non-tabular data well: RDBs are notorious for doing a poor job on recursive data structures. MapReduce
isn't intended to replace relational databases: it's intended to provide a lightweight way of programming things so that they can run fast by running in parallel on a lot of machines. That's all it was intended to do.

To be specific, the criticisms from the DB guys were that MapReduce is:

  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on

Point one is rubbish. M/R is a great way of programming some large-scale data
intensive applications. If you've got a large computation whose data is naturally described
relationally, and which is feasible to run on a single really big machine, and you have access
to said really big machine, then no question - you should be using an RDB. And for many
computations that seem too complex to run is a reasonable amount of time on a single machine, working out the proper layout in tables, with the proper indexing scheme, can make what seemed non-feasible downright easy. But that's not every computation. In fact, that's not even most computations. MapReduce isn't intended to replace the relational database.

Let me give an example. When I was writing about Fractals, one reader was fascinated by the BuddhaBrot, and decided to try generating an image of a related fractal, called the NebulaBrot. The basic idea of that is to take the set of points which are on the border of the mandelbrot set, and trace their paths as you perform the mandelbrot iteration. The more times a given point is crossed by one of those paths, the brighter you make that point. The end result is a very cool image. The problem with it is, it's not zoomable. The paths are erratic, and there's no way of figuring out which initial points are the roots of paths that are going to pass through a particular sub-area of the figure. So what the very bright reader did was build a huge NebulaBrot - the idea being that instead of starting small and then zooming, you start big and then squish. That gives you the ability to look at very high resolution views of small parts. Sort of like pre-zooming.

The way that he implemented the NebulaBrot was as a MapReduce. He needed to trace through paths of millions of points. That can take quite a bit of time. So what he did was divide up the initial points into subsets, and farm them out with a map. Each map instance computed the paths for some set of points. The result of the map was a set of pairs: points, and the number of times that paths had crossed those points. The points were the keys, the number of crossings were the values. The reduce took the set of point/crossing pairs, and summed up the crossings for each point. Then the result was translated into an image. Presto! Giant NebulaBrot, very quickly.

You can't do that in a relational database. It's not a relational application. It doesn't really start with row data - MapReduce just calls its sets of input data rows, because it's a nice metaphor; it doesn't mean that they're really tabular data - in the NebulaBrot, the
data isn't really rows, it's points. No relational tool is going to help with that - there's no scheme of careful, elegant table layout and indexing that's going to make that computation tractable on a single small machine. And relational tools aren't going to help make it run in parallel.

Moving on to point two, which was that MapReduce is primitive and sub-optimal, because it doesn't use indexing. Same basic response: indexing is a great tool if your data is tabular, and you have a central index that you can work with. But if your task isn't fundamentally relational, and what you really need is computation - like the massive numbers of floating point multiplications in the NebulaBrot - then indexes aren't going to help. The problem that MapReduce is trying to solve is making it easy to write a program that does a huge amount of computation in a reasonable amount of time. Indexes don't help if the essential task is computationally intense.

Point three: "it's not novel." No, stop, you must be kidding! It never claimed to be
novel. It's not supposed to be novel. In fact, the initial publication which
describes MapReduce
specifically states that it was inspired by the Map and Reduce primitives from functional programming languages! MapReduce is, basically, a form of data
parallel computing - that is, a scalable implementation of a well known, well understood,
easy-to-use model of parallel programming. How on earth is that a problem?

Point four: "missing features in RDBMS". Isn't this really just a different way of saying
the initial criticism? It's really just another version of "It's not relational, so it's not
good". Once again: that's deliberate. MapReduce isn't a database application. If you've got a
relational database problem, use a relational database. If you've got a massive computational
task that you want to distribute among dozens, or hundreds, or thousands of small machines,
then use MapReduce. Those wonderful DBMS tools? They weren't designed for massively parallel
cluster/cloud computing. In fact, they don't work in the massive cluster
environment.

Point five: "incompatible with tools that DB developers have come to depend on". Are
you starting to see why I told the hammer story? This is just another restatement of the same old original criticism. It's not a relational database, so it's not good. No, database-oriented tools and processes don't work for MapReduce programming. In fact, what
makes for a good DB design will often make for a really piss-poor M/R design. They're not trying to do the same thing.

So in the end, is there anything worth taking from the database guys critique of
MapReduce? Frankly, I don't think so. They really don't seem to understand what M/R is
designed for, or why it's designed the way it is. They seem to have homed in on the fact that
M/R uses the terms "table" and "row", and concluded that it's intended to be something like
the kinds of tasks for which you'd use an RDBMS. But aside from the superficial resemblance in
terminology, there's very little connection between the two ideas. M/R is a great tool for a
particular family of computational tasks. That doesn't detract from RDBMS - relational
databases are also an amazing tool for a large family of computational tasks. The fact that
the relational database approach isn't the ideal solution to every task isn't the
fault of M/R. The fact that M/R uses similar terminology to RDBMS despite being intended for a very different family of tasks than RDBMS doesn't mean that M/R is wrong. M/R isn't an alternative to relational databases. It's something entirely different.

Just because you've got the best hammer in the entire world doesn't make everything a nail. If you've got a screw, even a cheap, old, rusty screwdriver is going to do a better
job. And MapReduce is a lot better than a cheap, old, rusty screwdriver.

Categories

More like this

Technorati Tags: scale, computation, information Since people know I work for Google, I get lots of mail from folks with odd questions, or with complaints about some Google policy, or questions about the way that Google does some particular thing. Obviously, I can't answer questions about Google…
Over in my post accepting my victory as the biggest geek on ScienceBlogs, an interesting discussion about beginners learning to program got started in the comments. It was triggered by someone mentioning David Brin's article in Salon about how terrible it is that computers no longer come with basic…
Back at my old digs last week, I put up a post about programming languages and types. It produced an interesting discussion, which ended up shifting topics a bit, and leading to a very interesting question from one of the posters, and since the answer actually really does involve math, I'm going to…
"The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve." - E. P. Wigner Our universe, or at least our understanding of the universe, appears to allow us to see its naked underbelly…

Thanks for clearing that up, Mark.

"RDBs don't handle non-tabular data well: RDBs are notorious for doing a poor job on recursive data structures."

And it's somewhat surprising how often this sort of structure comes up in real-world data. Our brains naturally try to sort things into hierarchichal structures, but that's not actually how most naturally-occurring data is most accurately laid out.

By Scott Simmons (not verified) on 22 Jan 2008 #permalink

As computers are starting to invade more and more space, do you think that we'll end up writing a significant portion of our programs in a way that they can make use of available computing devices? It'd be cool, if we actually could employ idle office computers, mobile phones or toasters just like that, but personally I don't see that in the very near future.

Flaky:

No, I don't.

The problem is, it's damned hard to program things to run in parallel. That's the biggest issue. No one is going to write parallel programs until performance really becomes a serious issue.

When it does become an issue, it's hard enough to make parallel programs run effectively when you've got a pretty
uniform collection of processors, which is the basic assumption of both M/R and of my dissertation work. If you're talking about distributing computation among a large collection of random processors - from the miniscule processor in your toaster to the decent processor in your DVR to the heavy iron in a gaming PC - the difficulty of making things work, and getting a decent load balance, and having the end result actually be fast enough to generate the trouble is just too much work. It just won't pay off in the end.

I think this kind of parallelism is going to be limited to mostly homogeneous environments: an office full of identical PCs, racks full of servers, data centers, etc.

On the other hand, I also think that you're going to see an increase in the number of places where you can effectively rent a bunch of CPUs to do a job. So when you have something massive to run, you'll be able to wander to sourceforge or IBM or someone like that, with a big datacenter, and rent enough time to run your system on their hardware.

"RDBs don't handle non-tabular data well: RDBs are notorious for doing a poor job on recursive data structures."

RDBs or SQL-based DBs? Recursive joins have been around since the earliest days of relational theory - but SQL cannot describe them; it seems pathologically opposed to recursivity.

By gwenhwyfaer (not verified) on 22 Jan 2008 #permalink

(...or recursion, even)

By gwenhwyfaer (not verified) on 22 Jan 2008 #permalink

Very well put Mark. One of my favorite tag lines is "if your only tool is a hammer, every problem looks like a nail": Drs. DeWitt and Stonebraker have made their reputations in RDBMS -- DeWitt in particular with parallel databases. It's their only tool.

All refutations to the original articles have valid points, but miss this one issue: When you're talking about Petabytes, you might want to look at some of the advancements in research by people who've been dealing with petabytes all this while.

It's not about relational data. That is not the only thing databases do. Saying that database research is about RDBMS is like saying OS Research is about hacking Linux.

I agree that MapReduce is a tried and tested programming model, and a great "screwdriver". However, what is happening right now is that the research community is getting excited about what is an excellent piece of *engineering*. Soon, the research communities (OSDI, etc), are going start spending millions of man hours researching "MapMergeReduce", and "Data Structures for Map Cache Locality", completely forgetting that Parallel Joins and Index structures already exist, you just have to look outside of your OSDI journals and read SIGMOD papers from 20 years ago.

So it's basically a plea from one research community to another: Please look at our work; we think you'll reinvent the wheel for the next 5 years if you don't.

The original blog has nothing to do with engineering. Google's MapReduce is an excellent implementation. I am yet to see an open source equivalent that can scale beyond 1000 nodes and still be stable. (And let's not even talk about how efficient Hadoop is). Hence, unless Google releases something tangible, any "research" in MapReduce will have NO impact on the world, other than

a) proving that a 20 year old paradigm still holds
b) they have some magic sauce that you will never get hold of

Given these points, how does Google's MapReduce, and the publicity it gets, benefit the budding computer scientist, engineer, or blogger? It doesn't. You're better off using PBS for computing, you'll be more efficient. This is why the DB community is asking: "if you want to innovate beyond mapreduce, which you should, take a look at our work, it's useful."

All of this being said, we should look at who's getting this right. The PIG project at yahoo creates a declarative layer OVER Hadoop. Not only do you benefit from the M/R paradigm, you also have a compiler that generates tweakable M/R routines from terse, manageable lines of simple declarative code. This can potentially save thousands of hours of coding time, where you were writing m/r after m/r, doing the same old sort, merge, group, filter operations. This is what people would want, once MapReduce is done. Have the OS folks thought of this? no. the DB people did, because they've done similar stuff before. That's the basic point.

[disclosure: unlike the author, i have written some goog m/r code.]

I have to agree, it was a stupid article, and MapReduce really is a very old technique. I used it already in 70's, yes, there were system with multiple cpus and attached I/O already then so threading is nothing new or difficult. Most concrete example from that time is a monthly report and microfilm system which, designed by our database people, took 31 days to run ( if I would have let it but not every month has 31 days! ) It took me two days to rebuild the whole system to use MapReduce style processing - run time less than 6 hours after changes. The other benefits were once I had the skeleton we were able to build any reporting data just by parameterizing the run instead having one system to run one report. By the way, the system is still in use today. Unfortunately this is something which is not in most CS/DBA education. Even last summer I had to evaluate a (new) system which had the same problem in smaller scale, some terminal sessions starting slowly because of some indexed access. After switching it to MapReduce style the start time did go from 120-160 seconds to round 2 seconds. Many more cases over years, it actually is not too difficult to build a model which shows the benefits but once you start talking about seek times, latencies, channel / controller / memory congestions with people who only know how to write some vendor SQL, you get empty stares and arguments, it can be very frustrating. The art of performance has been forgotten too long.

By Tuomo Stauffer (not verified) on 22 Jan 2008 #permalink

A question from a non-computer scientist who is nonetheless interested in this discussion. Isn't Google a distributed collection of massive databases? In my mind's eye I imagined the data that originates from applications like Gmail, Docs etc to be somehow stored in a databases - or am I missing the point, and this MapReduce is primarily about search only?

By Grant Sundin (not verified) on 22 Jan 2008 #permalink

gwenhwyfaer:

No, recursive joins (which ARE supported in SQL with vendor extensions - see Oracle's 'connect by' extension for an example) are not very efficient, they can't use indexes well enough.

There are tricks to accelerate tree structures, but they require certain trade-offs.

Network (object) database are much better for hierarchical structures, but they suck at ad-hoc relational queries.

By Alex Besogonov (not verified) on 22 Jan 2008 #permalink

Grant:

I don't work on search or docs, so I don't actually know specifically how they're implemented. But even if I didn't, I wouldn't be able to comment on it.

From my own point of view, I don't see MapReduce as a search thing. I see it as a processing thing. That is, the point of MapReduce isn't to be able to, say, get a fast result when you type something into a search box. It's to do something that takes a long time - something that involves a lot of computation.

One of the things that I spent a lot of time on in grad school is network latency. Spreading something over a network has a significant cost in communication time. It's not worth doing something on more than one machine unless it's going to take a long time, because otherwise, the time spent just getting stuff onto the wire to send it to other machines will outweigh the cost of just doing the computation.

I haven't checked the latencies recently. But as of about five years ago, the latency was high enough that you could send something close to 1/4 of a megabyte of data effectively for free - the cost of just getting something onto the network meant that there was no measurable difference between sending 300 bytes and 300,000 bytes. In terms of computing cycles, you're talking about millions of cycles in the time it takes to send a message and recieve a response. So unless you have something that takes millions of cycles to compute, sending it to another machine will slow you down, not speed you up.

So while I don't know, I'd be shocked if MapReduce had any connection with something like Google Docs. I just can't imagine that there's any task in Docs that needs enough computation to justify the immense cost of communicating.

As a computer programmer, there's a couple of things I've noticed:

  1. People who do research like projects that have good reasons for not using relational databases (like above) will still insist on not using relational databases when they would be useful. (especially when the same data needs to be accessed outside their black box)
  2. People who always have used relational databases question things like MapReduce because the trust so totally in SQL query optimization, prefetching, indexing and caching and think...WOW that would be hard to beat with any other data store, and I certainty wouldn't be bright enough to write it. Thus people who've adopted relational databases want to use them even when they are not the best.
  3. As you've said above, it really a matter of picking the right tool for the right job which is always easier said than done (see 1. and 2.).

    Kailden:

    You're right.

    One of my own pet frustrations is how many people use programming languages that are incredibly awful fits for the program they're writing. But it's what they know, so they stick with it, even though it's a rotten tool. Human inertia is always a problem.

    To me, the entire article seemed pointless. It was just like saying:

    "This apple is awfully bad and not tasty at all. Why is it a bad apple? Well, because it's not an orange. Let us explain: see, this apple clearly lacks the orange colour so characteristic of oranges. The skin is totally too thin, and after peeling it off, you can't really split of with your fingers. This points prove that this apple is not an orange, therefore it must be a bad apple."

    Parallel/distributed computation and databases are totally different things - even if databases can be run on distributed systems. How can you even try to criticize one for not being the other?

    Mark, the "wrong tool for the job" is in a way a parallel to the comment you had above on networked computing not being worth it unless you outweigh the cost of communication: unless the problem is large and complex enough (and we actually realize it), learning a new language, or even switch from the language you've been using lately to one you're not quite current in, may simply take more effort and time than just using the one you're up to speed in even if it takes a little longer.

    Another non-programmer IT person (who went to CMU and used Andrew in the 80's) with a question: What say you about OS-level parallel processing support like Xgrid? I've got 2500 machines on the network with the capability, but we honestly don't have that many BIG computing tasks to handle - but all I've got to do is flip a switch on the client side and it's available to any Xgrid serving machine we might have. Seems attractive, I'd think, being supported by the OS.

    Also, is it based on M/R, or is it completely different?

    grahamsw wrote:
    "and my favorite version is... when all you have is a hammer everything looks like a baby seal"

    Mine is "when all you have is a nailgun, every problem looks like a messiah."

    Nice article BTW...

    While I don't understand the technical aspects here as I'm not in computer programming, this analysis is beautifully written and brings to my mind the frustrating meetings I've attended where people objected to things because the proposed changes didn't fit their personal views of the world/work habits/power agendas, as opposed to whether the proposed changes would really provide benefits to those of us affectged by the change.

    By Texas Reader (not verified) on 22 Jan 2008 #permalink

    What I find particularly humorous about the article's point No. 2 "It doesn't use indexes!" is that the original mission of MapReduce was to...create indexes, specifically, an enormous index of the web.

    My impression is that MapReduce is used today to create the very sort of indexes for fast searching that the authors insist MapReduce should be using. Do they think enormous data structures organizing the internet are just delivered in the night by the index fairy or something?

    By Matthew L. (not verified) on 22 Jan 2008 #permalink

    I'm wondering if there's talking past each other going on between the RDB people and the M/R people.

    You said that "RDBs don't handle non-tabular data well' but you also said that "The result of the map computation is a list of key/value pairs." - isn't that a table?!

    Is there an argument to be made that processing the Reduce portion of the endeavor might be better handled by a database after the Map portion is completed, or doesn't that make any sense?

    The other amusing thing about the article was its claim that M/R would not scale well. Umm, isn't Google already proving otherwise? It's like staring at the Earth from the surface of the moon and asserting it isn't round.

    Some semi-offtopic comments:

    The n_{1/2} (the byte size in a point-to-point transfer where the bandwidth reaches half the realized maximum) is around 1-2kB for Infiniband and around 8-10kB IIRC for GigE. So, the latencies have come way down. (The latter means, for example, that it takes 160 microseconds to send 10KB to another computer. Relativley long by some standards since a single 2.4GHz core can do at least 300K operations on that data during that time. But since Google (according to the standard MapReduce paper) is using the filesystem for the push part of it's data transfer (the result being, presumably much smaller) of 64KB blocks, I don't think it's the latency that Google's folks are worried about. There must be enough work to make it worth distributing.)

    Thanks for mentioning the generally non-scientific nature of MapReduce's current uses. It's not generally applicable for the sort of tightly-coupled algorithms done in, say, large-scale fluid dynamics simulations. Though, someone out there has done a matrix-matrix multiply in MapReduce. They don't give the performance of the underlying serial code, so it's hard to evaluate the algorithm in comparison to popular threaded implementations like GotoBLAS.

    It should be pointed out that for Google's well-touted application (re-indexing all of their web data) MapReduce's scheduler relies on the distributed redundancy in GoogleFS to ensure that the data to be processed is on the machine that will process it when it needs it and to handle any failure modes on that machine (including poor performance). I.e., there can be more to using MapReduce than just the gains enabled by the algorithm itself.

    Finally, Marc, though you can be forgiven since you're a functional-language geek, I hoped you'd have pointed out that Google's MapReduce implementation isn't just some "library" but, in fact, a C++ class library! :)

    Billb:

    I didn't say anything about how MapReduce is implemented. As I mentioned, I don't use MapReduce in my work, so I'm not too familiar with the details. I *think* there's more than one implementation, with different performance related tradeoffs. The original M/R, according to the published paper, is in C++. I would guess that the others are in either C++ or Java, but I don't know for sure.

    I'm an old db guy and a tech programmer and I'm confused. Map/reduce sounds very similar to parallel index creation that Oracle and probably others has had for years. I remember investigating index creation using multiple CPU's about 10 - 15 years ago on a new fangled 64 CPU n-cube, nowadays you build an index by saying parallel = n and tada it does it.

    If you don't have a multi processor box with an expensive database installed or don't want to, then map-reduce sounds like a nice way to build indexes.

    When I started Mainframes were on the way out and mini's were all the rage, many of the mini folk were rediscovering techniques the mainframe guys had used for years. Now the new parallel PC people are saying "look at this, how cool", and the mini people are saying we've been doing that for years..and so on..

    Davo: That's a common thing to happen in computing. When I was in grad school for computer science (early 80s), I recall someone who'd been to a computer science conference and decided to drop in on a PC conference next door. He came out of it saying "My god! These people are rediscovering every mistake we ever made!"

    Davo:

    If all that MapReduce did was create indices for a table in a database, you'd be right. But that's not what it does - that's the point. Look at the example I gave of generating the nebulabrot. How can you reduce that to building a database index?

    MapReduce isn't a database system. It's not a storage system, or an indexing system, or a query system. It's an execution strategy, for distributing a complex computation among a collection of machines. You can use MapReduce as the implementation strategy for index generation for a database. You can use a database as a storage system for managing data for a M/R computation. They're different things.

    The fundamental idea of MapReduce is to take advantage of the computational capability of cheap computers, instead of getting expensive multiprocessor machines.

    As an example: if I were still in grad school, my advisor would be able to buy 20 cheap dual-core PCs and put them in a rack for $10,000 or so, and I could use them as a cluster with 40 processors. How much would it cost me to buy a 40 processor parallel system? With M/R, I could easily write parallel programs for running on that cluster of 20 machines. Are there any database systems that I can buy that can distribute a computation among those 20 machines? Are there even any databases that I can buy that will distribute index generation among those 20?

    The point is, M/R and RDBMS are totally disjoint things.

    So it's basically a plea from one research community to another: Please look at our work; we think you'll reinvent the wheel for the next 5 years if you don't.
    Diablo, that's one of the better comments I've seen on this topic, on this blog, and others, where you are actually cutting through the hype and distilling this for the masses. This excitement reminds me of the SOAP/Web Services hype somewhat, with the old guard, shaking their heads, saying "um, what about transactions? security?", the Young Turks responding "What are you talking about, Pops?? We've got XML, it's a whole new world! See ya!"

    By Robert Sullivan (not verified) on 24 Jan 2008 #permalink

    Thank you for the response to the other article. It helps me understand more about the benefits and drawbacks to MapReduce.

    I'll bet this is something that can be used for quicker rendering of complex 3D images (movies). And with things like Amazon's elastic compute cloud, you've already got an infinite (kind of) number of processors for doing all that parallel work.

    The database "guru" that my company hired insisted that SQL Server would be able to do game battle simulations faster than our native C code. Yes, some RDB guys really drink their own kool aid!

    Great article, thanks.

    Nice article, and I agree wholeheartedly. I didn't read the original article as a plea at all, it read more like an angry nonsensical rant. Sure relational databases are wonderful, but they aren't the right solution for heaps of chaotic data. M/R is sooo soo much more than a query technology.

    Mark:
    Someone proposed to me that if you replaced every instance of "MapReduce" with "BigTable" in that article it almost made sense. Can you comment on whether their criticisms might apply to BigTable?

    David:
    lol

    Mark:
    Have a look here http://en.wikipedia.org/wiki/Merge_sort turn the diagram on the right sideways and it looks very much like the map/reduce algo. With merge sort processes are forked and the merge is performed. The same scheme is used many times in data warehouses (where I'm from), multiple processes split the data set change it and merge it back together, so the concept of map/reduce has been around along time.

    What the map/reduce seems to do which is new (after a very brief read) is, as you say, provide the plumbing for looking after the processes on disparate machines, this has been done before in rdbms's and imho a bit more elegantly but as you say, to use map reduce you don't need an rdbms.

    So imho it is a nice implementation of a system combining a couple of old concepts, is it revolutionary? well I s'pose that's a matter of semantics, if you come from java/web land then perhaps it's new, if you come from big database land it's a nice implementation but has some flaws.

    There is one thing I don't see in the paper that I probably missed and that is once you transform your data what do you do with it? How do you look it up without some sort of indexing system, be it rdbms or flat file?

    cheers,
    Davo

    Isn't there a couple of dynamics that are missing from the discussion here that explain the chasm between DB and M/R? Good enough and economics. I had some Japanese researchers come to me with surprise about the "fault-acceptance" computing of BigTable. It blew their minds that faults were expected and designed into the system. But it makes sense when you realize that there is no way to evaluate the quality of the results (who will notice entry #6 missing? nobody.). So maybe M/R does fine in the world of "good enough" which makes sense when the premier implementation is based off of a free service model. "Hey, what do want for advertising supported?" Second, M/R is a response to the significant economics required for enterprise class DB. M/R is (as you put it in your example) something that came out of grad school's by engineers smart enough to understand that the cost of the x86 platform makes a lot of sense if you can parallelize the compute. If mainframe compute pricing falls, then guess what, the next great compute system will be a big unified box. Regardless, as grads and open source programmers realize that they can hook up 20 x86 boxes from Fry's for a lot cheaper than a big ole honkin dbase, then guess which system they'll use?

    There are commercial RDBs that allow index generation over 20 machines. They aren't cheap and it is rare that a company would need such a system for their standard business processes. (Where you would use a RDB system.) That doesn't mean such systems are a replacement for mapreduce; different tools for a different set of problems.

    Hmm. On the other end of the spectrum there are a lot more situations where RDBs are part of the problem instead of of the solution. If it fits in ram on a single machine it should probably not be in a RDB. A transaction log and a memory dump is a lot less code and allows developers to stay at the OO side of things. Smalltalk is somewhat more powerful than SQL for ad-hoc querying and handles hierarchical data just fine.

    By Stephan Eggermont (not verified) on 02 Feb 2008 #permalink

    As an RDB guy, I find that this "rebuttal" of the M/R criticisms by DeWitt and Stonebreaker really misses most of the points of the original article. It's not really so much that RDB folks have a hammer that they want to use for every problem, but rather that those who have to deploy extremely large data archives do not usually have the luxury of choosing one technology (M/R) for some problems and another (RDB) for others.

    As a general purpose tool for data intensive applications, it is hard to beat RDBs, and it's perfectly valid to say, "if you want to use M/R instead, you better have really good reasons for doing so, and be warned that you will be giving up all these advantages of RDBs". Having had experience for over 10 years with deploying a multi-Terabyte scientific archive, I can assure you that there are tons of other considerations beyond just parallel query performance. The advantages that DeWitt & Stonebreaker list are very important factors in choosing one technology over the other.

    So while M/R may be fantastic at handling a certain kind of computational problem, RDBs remain excellent overall solutions for very large scientific data archives, and those considering M/R seriously would serve themselves well by remembering that.

    Has anyone else noticed the Database Column article has disappeared?

    By Chris Gonnerman (not verified) on 19 Jan 2010 #permalink

    I have not done much research on parallel programming. However, still a question? Doesn't multi-core processors do the same thing as MAPREDUCE? ..I have worked with some 32 core processors. Or does MAPReduce will result in more faster computation!

    Isn't there a couple of dynamics that are missing from the discussion here that explain the chasm between DB and M/R? Good enough and economics. I had some Japanese researchers come to me with surprise about the "fault-acceptance" computing of BigTable. It blew their minds that faults were expected and designed into the system. But it makes sense when you realize that there is no way to evaluate the quality of the results (who will notice entry #6 missing? nobody.). So maybe M/R does fine in the world of "good enough" which makes sense when the premier implementation is based off of a free service model. "Hey, what do want for advertising supported?" Second, M/R is a response to the significant economics required for enterprise class DB. M/R is (as you put it in your example) something that came out of grad school's by engineers smart enough to understand that the cost of the x86 platform makes a lot of sense if you can parallelize the compute. If mainframe compute pricing falls, then guess what, the next great compute system will be a big unified box. Regardless, as grads and open source programmers realize that they can hook up 20 x86 boxes from Fry's for a lot cheaper than a big ole honkin dbase, then guess which system they'll use?

    So it's basically a plea from one research community to another: Please look at our work; we think you'll reinvent the wheel for the next 5 years if you don't.
    Diablo, that's one of the better comments I've seen on this topic, on this blog, and others, where you are actually cutting through the hype and distilling this for the masses. This excitement reminds me of the SOAP/Web Services hype somewhat, with the old guard, shaking their heads, saying "um, what about transactions? security?", the Young Turks responding "What are you talking about, Pops?? We've got XML, it's a whole new world! See ya!"

    hyperslug, thanks for the link

    By Anonymous (not verified) on 20 Jul 2010 #permalink

    I have not done much research on parallel programming. However, still a question? Doesn't multi-core processors do the same thing as MAPREDUCE? ..I have worked with some 32 core processors. Or does MAPReduce will result in more faster computation!