Moneyball: What Really Determines a Player's Salary?

i-eef22f1fbe1d83f1fcb92e1d04704b04-Swisher-thumb-500x557-71345.jpg

Nick Swisher, the prospect the traditional scouts and statisticians agreed upon.

After watching the film "Moneyball" with my young son about professional baseball and the struggle to build a strong team on a shoestring budget, I wondered "What really determines a player's salary?"

Is it how telegenic they are? Provocative, reckless? A reliable hitter? How many walks?

What do you think? A research group at MIT took on this question from, of course, a mathematical and computer science point of view. To address this and many other questions, they developed an entirely new way to mine through enormous sets of data. Essentially, they found a new magnet to pull out the proverbial needle from a giant haystack.

They call it "MIC," just described in this week's issue of Science: {with my emphasis}

i-0fc2cff91ea60a369abab008483b1acd-Baseball-thumb-960x720-71348.png

The maximal information coefficient.

Intuitively, MIC is based on the idea that if a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship. Thus, to calculate the MIC of a set of two-variable data, we explore all grids up to a maximal grid resolution, dependent on the sample size (Fig. 1A), computing for every pair of integers (x,y) the largest possible mutual information achievable by any x-by-y grid applied to the data. We then normalize these mutual information values to ensure a fair comparison between grids of different dimensions and to obtain modified values between 0 and 1. We define the characteristic matrix M = (mx,y), where mx,y is the highest normalized mutual information achieved by any x-by-y grid, and the statistic MIC to be the maximum value in M

We used MINE to explore four high-dimensional data sets from diverse fields.
...(iii) performance statistics from the 2008 Major League Baseball (MLB) season

In the MLB data set (131 variables), MIC and Ï both identified many linear relationships, but interesting differences emerged.
On the basis of Ï, the strongest three correlates with player salary are:
â¢walks,
â¢intentional walks,
â¢and runs batted in.

By contrast, the strongest three associations according to MIC are:
â¢hits,
â¢total bases,
â¢and a popular aggregate offensive statistic called Replacement Level Marginal Lineup Value
(27, 34) (fig. S12 and table S12).

We leave it to baseball enthusiasts to decide which of these statistics are (or should be!) more strongly tied to salary.

Judge for yourself: Are they right?

Categories

More like this

What would you say are the strongest three factors associated with the salaries of major-league baseball players? According to a popular, well-established algorithm, the main influential factors are walks, intentional walks and runs batted in. How much does he earn? But a paper recently published…
Joe Morgan is a Hall of Fame baseball player and a former member of the Cincinnati's Big Red Machine. He is also a commentator for ESPN and a strong opponent of all the new fangled baseball statistics. Anyone who has listened to an ESPN broadcast of Major League Baseball has heard Morgan criticize…
Regular readers probably know that I'm into more than just science, skepticism, and promoting science-based medicine (SBM). (If they're regular readers of my other, not-so-super-secret other project, they might also realize that they've seen this post before elsewhere. I had to stay out late for a…
Over at Open Left, jeffbinnc pithily summarizes all of the metrics of which educational 'reformers' are fond: Then, to illustrate just how the focus on more and better tests is going to be raised to the levels of panacea, the CAP rolled out a new report last week that based just about everything on…