The Terminator said:

Excluding the United States and Switzerland would make this worse.

Further, do you have any justification for excluding them?

Eliminating data points, simply because they don’t “fit” isn’t very

good methodology.

Because with least squares estimation, outlying values bias the

results. In this case they make the correlation higher, giving a

value that would be quoted by a politician, and not a statistician.

I made the same mistake on an oral presentation a semester ago.

I was computing a linear regression for data taken in the Millikan

Oil Drop Experiment. I “threw out” outlying points because I thought

that since they were much farther away than the rest of the clustered

points, they were erroneous. Henry Kendall, a Nobel laureate, who was

the instructor for Junior Lab, turned a shade of crimson, and explained

to me in no uncertain terms that throwing out those points was UNDESIRABLE.

When you discarded outlying points it was wrong. When I did it was

right. The difference is that I knew what I was doing.

Here is scatter plot of handguns vs homicide rate.

U|9 | | homicide rate | (per 100,000) N |5 | | C F |3 SANFW S |2 ES |1 --------------------| 1 7 15 29 Per cent households with handguns

11 of the countries form a cluster in the bottom left hand corner of

the plot. Linear regression effectively fits a line connecting this

cluster with the US in the upper right corner of the plot. This gives

a reasonably large correlation coefficient since all the cluster

points are close to the line. It is reasonable to do another

regression just within the cluster.

It also a good idea to think of reasons why the other countries do not

fit in the cluster. NI presumably because of the political violence

not present in the other countries, Switzerland because half of the

handguns were army weapons, something not occurring in other countries,

and the US because (fill in your own reason here).

He asked me, “Many years ago, it was taught that you should take three

measurements when weighing a compound, and throw out the maverick point.

Do you know what’s wrong with this?” I did. “Say that you have two

low measurements and one high, so you toss out the high. Your computed

average would be too low.” He concurred.

Let’s see, we weigh our compound three times and get weights of 26, 27

and 106 grams. One thing we can be sure of is that the true weight is

NOT 53 grams (the mean). If we have a 20% chance of getting a bogus

weight, then there is a 50% chance that the mean of three measurements

will be bogus. In this situation, a robust statistic such as the

median is called for.

I append an extract from comp.risks that may be relevant.

---------------------- Date: Mon 17 Mar 86 11:43:53-PST From: JAGAN@SRI-CSL.ARPA Subject: A Stitch in Time To: Neumann@SRI-CSL.ARPA This is the probable sequence of events that led us back in time on CSLA: 1. A power glitch (late night SUNDAY) caused the F4 to hard boot. 2. During a hard boot, the TIME is retrieved from eleven independent sources (which are assumed to be correct!) 3. One of these sources had the incorrect time of some warm day in 1972 causing the average to be wrongly computed resulting in Dec 6th/1985. Suggestion: 1. Change the statistical measure from MEAN to something less sensitive to one or two abnormal times; for example the average of the 5th, 6th, and 7th largest times. [IT IS ABSOLUTELY INCREDIBLE THAT UNSAFE ALGORITHMS continue to be used. This problem is as old as the hills. Statisticians routinely throw out the absurd values before computing the mean. Dorothy Denning pointed out the pun in their terminology (applicable to Byzantine agreement algorithms, where you don't trust anyone): the OUT-LIERS are really the OUT-LIARS. EVEN WORSE, Jagan points out that if the clock had been accidentally set INTO THE FUTURE, things could also get very sticky. We also have a problem of nonunique clock readings during the hour at 2AM when Daylight Savings Time ends. A good time to be asleep. PGN]