This post contains some more notes on a reply to the badly flawed “Main Street Bias” paper.

In my previous post I showed that the MSB papers was wrong to claim that it was plausible that the unsampled regions was 10 times as large as the sampled region. In this post I look at their model. Their model is wrong because it assumes that there is no main street bias in the sampled region and because of this they massively overestimate any bias in the Lancet sampling.

Let’s start with a correct model of the situation. I’ve adopted their terminology where possible.

We have a population of size N divided into Ni people inside the survey space (Si) and No people outside the survey space. The death rate for people living in Si is Bi and Bo for people living in So. The overall death rate B is just the weighted average of Bi and Bo:

B = (Ni*Bi + No*Bo)/(Ni + No)

The bias introduced by using Bi instead of B is

R = Bi/B

if we let n = No/Ni and b = Bo/Bi, after a little algebra we find that

R = (1+n) / (1+n*b)

If we plot this function, you will see that to get significant bias

you must have **both** n significantly bigger than zero and b

significantly different from 1. Even in an extreme case where n and b

are both 2 (ie two-thirds of the houses are missed and the death rate

is twice as high in the unsampled region), the bias factor is only 0.6.

To get the large R=3 bias that the MSB authors propose, they need

implausibly extreme values for both n (10) and b (0.27) — that is,

the unsampled region is ten times as big and residents suffer just one

quarter the risk of violent death.

To get their implausible value for b, the MSB folks use their own

model. They have parameters, q which is the risk of violent death while

you are in Si divided by the risk for times when you are in So and

parameters fi, the fraction of time residents of Si are in Si and fo

the fraction of time residents of So are in So.

The formula they derive for R is equivalent to saying that

b = (q – (q-1)*fo) / (1 + (q-1)*fi)

To see what’s wrong here, look at their argument for a high value of q:

It is likely that the streets that define the samplable region Si are

sufficiently broad and well-paved for military convoys and patrols to

pass, are highly suitable for street-markets and concentrations of

people and are, therefore, prime targets for improvised explosive

devices, car bombs, sniper attacks, abductions, and drive-by

shootings. Given the extent and frequency of such attacks, a value of

q=5 is plausible.

Where do they think that the people at street markets and those forming concentrations of people come from? The people in the unsampled region have to go to markets as well and there is no reason to suppose that they spend less time there than people from the sampled region. This means that attacks on markets and concentrations of people produce no main street bias.

Let’s make this concrete with an example they reflects the pattern of violence that the MSB authors think leads to main street bias and has exactly the parameters in their model that the authors claim are plausible.

We have 3,000 people in So and 300 people in Si, so n = 10. We have 30 violent deaths occurring in So and 15 in Si, so q = 5. There is a market in Si where folks from So spend 1/14 of their time and people form Si spend 1/14 of their time in the market and 1/14 in So.

11 of the 15 violent deaths in Si happened in the market. This gives the “plausible” parameters used in the paper and their formula says that R=3.0.

Because the market draws people equally form Si and So, 1 of the deaths at the market was a resident in Si and the other 10 were from So. So residents of So suffered 40 violent deaths and Bo = 40/3000 = 1.3%. Residents of Si suffered 5 violent deaths so Bi = 5/300 = 1.7% and b = Bo/Bi = 0.8. Plugging n=10 and b=0.8 into my formula and we get R= 1.2. So in this example, despite a huge value of n and deaths tending to occur on main streets, the bias was negligible and the MSB model wrongly suggested that the bias was large.

The reason why their model gets it wrong is that it assumes that the risk is the same, on average, everywhere in Si. That means that even though Si residents only spend 1/14 of their time at the market, the model assumes that are exposed to the risk from the market 24 hours a day.