i-eef22f1fbe1d83f1fcb92e1d04704b04-Swisher-thumb-500x557-71345.jpg

Nick Swisher, the prospect the traditional scouts and statisticians agreed upon.

After watching the film “Moneyball” with my young son about professional baseball and the struggle to build a strong team on a shoestring budget, I wondered “What really determines a player’s salary?”

Is it how telegenic they are? Provocative, reckless? A reliable hitter? How many walks?

What do you think? A research group at MIT took on this question from, of course, a mathematical and computer science point of view. To address this and many other questions, they developed an entirely new way to mine through enormous sets of data. Essentially, they found a new magnet to pull out the proverbial needle from a giant haystack.

They call it “MIC,” just described in this week’s issue of Science: {with my emphasis}

i-0fc2cff91ea60a369abab008483b1acd-Baseball-thumb-960x720-71348.png

The maximal information coefficient.

Intuitively, MIC is based on the idea that if a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship. Thus, to calculate the MIC of a set of two-variable data, we explore all grids up to a maximal grid resolution, dependent on the sample size (Fig. 1A), computing for every pair of integers (x,y) the largest possible mutual information achievable by any x-by-y grid applied to the data. We then normalize these mutual information values to ensure a fair comparison between grids of different dimensions and to obtain modified values between 0 and 1. We define the characteristic matrix M = (mx,y), where mx,y is the highest normalized mutual information achieved by any x-by-y grid, and the statistic MIC to be the maximum value in M

We used MINE to explore four high-dimensional data sets from diverse fields.
…(iii) performance statistics from the 2008 Major League Baseball (MLB) season

In the MLB data set (131 variables), MIC and ρ both identified many linear relationships, but interesting differences emerged.
On the basis of ρ, the strongest three correlates with player salary are:
• walks,
• intentional walks,
• and runs batted in.

By contrast, the strongest three associations according to MIC are:
• hits,
• total bases,
• and a popular aggregate offensive statistic called Replacement Level Marginal Lineup Value
(27, 34) (fig. S12 and table S12).

We leave it to baseball enthusiasts to decide which of these statistics are (or should be!) more strongly tied to salary.

Judge for yourself: Are they right?

    Current ye@r *