Now on ScienceBlogs: The Galaxy's Biggest Valentine

ScienceBlogs Book Club: Inside the Outbreaks

Good Math, Bad Math

Finding the fun in good math; Shredding bad math and squashing the crackpots who espouse it.

Search

Profile

markcc.jpg
Mark Chu-Carroll (aka MarkCC) is a PhD Computer Scientist, who works for Google as a Software Engineer. My professional interests center on programming languages and tools, and how to improve the languages and tools that are used for building complex software systems.

Donors Choose

Other Information

Add this blog to my Technorati Favorites!

Recent Posts

Recent Comments

Categories

Blogroll

Old Topic Indices

Great Online Books

« Random Variables | Main | Ben Stein and Darwin: Truth is what matters. »

Probability Distributions

Category: statistics
Posted on: April 14, 2008 1:19 PM, by Mark C. Chu-Carroll

I'd like to start with a quick apology. Sorry that both the abstract algebra and the new game theory posts have been moving so slowly. I've been a bit overwhelmed lately with things that need doing right away, and by the time I'm done with all that, I'm too tired to write anything that requires a lot of care. I've known the probability theory stuff for so long, and the parts I'm writing about so far are so simple that it really doesn't take nearly as much effort.

With that out of the way, today, I'm going to write about probability distributions.

Take a probabilistic event. That's an event where we don't know how it's going to turn out. But we know enough to describe the set of possible outcomes. We can formulate a random variable for the basic idea of the event, or we can consider it in isolation, and use our knowledge of it to describe the possible outcomes. A probability distribution describes the set of outcomes of some probabilistic event - and how likely each one is. A probability distribution always has the property that if S in the set of possible outcomes, and ∀s∈S:P(s) is the probability of outcome s, then Σs∈SP(s)=1 - which is really just another way of saying that the probability distribution covers all possible outcomes.

In the last post, about random variables, I described a formal abstract version of what we mean by "how likely" a given outcome is. From here on, we'll mainly use the informal version, where an outcome O with a probability P(O)=1/N means that if watched the event N times, we'd expect to see O occur once.

Probability distributions are incredibly important to understand statistics. If we know the type of distribution, we can make much stronger statements about what a given statistic means. We can also do all sorts of interesting things if we understand what the distributions look like.

For example, when people do work in computer networks, some of the key concerns are called bandwidth and utilization. One of the really interesting things about networks is that they're not generally as fast as we think they are. The bandwidth of a channel is the maximum amount of information that can be transmitted through that channel in a given time. The utilization is the percentage of bandwidth actually being used at a particular moment. The utilization virtually never reaches 100%. In fact, on most networks, it's much lower. The higher utilization gets, the harder it is for anyone else to use what's left.

For different network protocols, the peak utilization - the amount of the bandwidth that can be used before performance starts to drop off - varies. There's often a complex tradeoff. To give two extreme cases, you can build a protocol in which there is a "token" associated with the right to transmit on the wire, and the token is passed around the machines in the network. This guarantees that everyone will get a turn to transmit, and prevents more than one machine from trying to send at the same time. On the other hand, there's a family of protocols (including ethernet) called "Aloha" protocols, where you just transmit whenever you want to - but if someone else happens to be transmitting at the same time, both of your messages will be corrupted, and will need to be resent.

In the token-ring case, you're increasing the amount of time it takes to get a message onto the wire during low utilization, and you're eating up a slice of the bandwidth with all of those token-maintenance messages. But as capacity increases, in theory, you can scale smoothly up to the point where all of the bandwidth is being used by either the token maintenance or by real traffic. In the case of Aloha protocols, there's no delay, and no token maintenance overhead - but when the utilization goes up, so does the chance of two machines transmitting simultaneously. So in an Aloha network, you really can't effectively use all of the bandwidth, because as utilization goes up, the amount of bandwidth dedicated to actual messages decreases, because so much is being used by messages that had to be discarded due to collisions.

From that brief overview, you can see why it's important to be able to accurately assess what the effective bandwidth of a network is. To do that, you get into something called queueing theory, which uses probability distributions to determine how much of your bandwidth will likely be taken up by collisions/token maintenance under various utilization scenarios. Without the probability distribution, you can't do it; and if the probability distribution doesn't match the real pattern of usage, it will give you results that won't match reality.

There are a collection of basic distributions that we like to work with, and it's worth taking a few moments to run through and describe them.

  • Uniform: the simplest distribution. In a uniform distribution, there are N outcomes, o1,...,on, and the probability of any one of them, P(oi)=1/N. Rolling a fair die, picking a card from a well-shuffled deck, or picking a ticket in a raffle are all events with uniform probability distributions.
  • Bernoulli: the Bernoulli distribution covers a binary event - that is, and event that has exactly two possible outcomes, "1" and "0", and there's a fixed probability "p" of an outcome of "1" - so P(1)=p, P(0)=1-p.
  • Binomial: the binomial distribution describes the probability of a particular number of "1" outcomes in a limited series of independent trials of an event with a Bernoulli distribution. So, for example, a binomial probability distribution will say something like "Given 100 trials, there is a 1% probability of one success, a 3% probability of 2 successes, ...", etc. The binomial distribution is one of the sources of the perfect bell curve. As the number of data points increases, a histogram of the Bernoulli distribution becomes closer and closer to the perfect bell.
  • Zipf distribution. As a guy married to a computational linguist, I'm familiar with this one. It's a power law distribution: given an ordered set of outcomes, oi, ..., on, the probabilities work out so that P(o1)=2×P(o2); P(o2)=2×P(o3), and so on. The most example of this is if you take a huge volume of english text, you'll find that the most common word is "the"; randomly picking a work from english text, the probability of "the" is about 7%. The number 2 word is "of", which has a probability of about 3.5%. And so on. If you plot a Zipf distribution on a log-log scale, with the vertical axis descending powers of 10, and the horizontal axis is ordered ois, you'll get a descending line.
  • Poisson. The poisson distribution is the one used in computer networks, as I described above. The poisson distribution describes the probability of a given number of events happening in a particular length of time, given that (a) the events have a fixed average rate of occurence, and (b) the time to the next event is independent of the time since the previous event. It turns out that this is a pretty good model of network traffic: on a typical network, taken over a long period of time, each computer will generate traffic at a particular average rate. There's an enormous amount of variation - sometimes it'll go hours without sending anything, and sometimes it'll send a thousand messages in a second. But overall, it averages out to something regular. And there's no way of knowing what pattern it will follow: just because you've just seen 1,000 messages per second for 3 seconds, you can't predict that it will be 1/1000th of a second before the next message.
  • Zeta. The zeta distribution is basically the Zipf distribution over an infinite number of possible discrete outcomes.
Share on Facebook
Share on StumbleUpon
Share on Facebook
Find more posts in: Physical ScienceTechnology

Comments

1

A good start!

My only complaint is about an engineering detail. While you are writing about computer networking, you only cover one subtype, local area networks, and limited to link layer protocols.

In wide area networks the Poisson distribution begins to fall apart. One possible culprit is in the higher protocol layers, especially Transmission Control Protocol (TCP). Since it resends missing packets after timeout, there will be duplicate packets in flight, and as load approaches capacity, the distributions will have longer tails than Poisson predicts.
http://en.wikipedia.org/wiki/Long-tail_traffic

Real life measurements show that the Internet has a self-similar behaviour over wide time ranges, i.e. it is chaotic in nature. Therefore it can't be modelled accurately using any classical distribution.

Posted by: Lassi Hippeläinen | April 14, 2008 2:31 PM

2

Lassi:

Thanks! I wasn't actually aware of that. I haven't studied queueing theory since grad school, and on the networks of the time, I think that poisson still worked pretty well. But the class where I studied that came before the web or the mainstreaming of the internet.

Do you have any references I could look at about the self-similar nature of the internet? I'd like to read more about it.

Posted by: Mark C. Chu-Carroll | April 14, 2008 2:46 PM

3

Aloha based networks are not actually "transmit when you want to", they are "transmit when you want to, if nobody else is talking". This makes a big difference, since a collision is virtually guaranteed at high utilizations in the first case, but allows a much higher utilization in the second. In fact, the limit of utilization becomes dependent on the overall physical size of the network in the second case.

Posted by: Brian Utterback | April 14, 2008 4:00 PM

4

I second Lassi's commendation of this is as a good start. My only quibble would be that it seems a bit odd to list basic distributions -- even referring to a 'bell' curve as the limit of the binomial -- and yet decline to at least mention the normal, even in a "then there's this other thing that turns out to be rather important in ways we'll go into next time" sort of way. If it's because you're being picky about distributions as discrete, then you should say so -- especially given your less pedantic approach in the previous post...

Posted by: matt | April 14, 2008 5:44 PM

5

geometric (exponential decay, roughly, describing the distribution of how many games one needs to play to achieve the first "win")?

(rough) example: my team wins about .40 of its games, based on past performance. How many games will we play this year before we win our first league game?

Posted by: Wry Mouth | April 14, 2008 10:53 PM

6

"Do you have any references I could look at about the self-similar nature of the internet? I'd like to read more about it."

I tried to Google for something that is on-line (using "self-similar distribution internet"), but all relevant articles seem to be behind a paywall. Of course many readers of your blog have access to IEEE or ACM libraries, so they can read them.

Sorry for a terse answer, but I'll be in Dubai for the rest of the week, and the plane leaves pretty soon...

Posted by: Anonymous | April 15, 2008 12:43 AM

7

They just wrote an interesting post about visualizing higher dimensional distributions over at Overcoming Bias:
http://www.overcomingbias.com/2008/04/conf-space.html

Posted by: Benoit Essiambre | April 15, 2008 9:20 AM

8

Citeseer seems to have a number of papers. The one below seems pretty good though it doesn't discuss the "hurst" parameter which is an interesting way of gauging the level of self-similarity in traffic.

http://citeseer.ist.psu.edu/uhlig01understanding.html

Posted by: Anonymous | April 15, 2008 11:43 AM

9

what you quoted as zipf is actually exponential
p(x)~(1/C)^x
with C const

a power law is p(d)~d^(-C)
where C is a constant

for words i think it is C=3/2 or 5/2 but too lazy to google

Posted by: avishalom | April 15, 2008 2:40 PM

10

Nothing about the normal distribution?
Or the t-distribution?
Or the F-distribution?
or Chi-square?

I'd say those are the four I use most as a statistician.

And the Poisson, at least for regression problems, usually needs to be abandoned for a negative binomial distribution, since the conditional variance almost always increases with the conditional mean.

Posted by: Peter | April 15, 2008 9:40 PM

11
On the other hand, there's a family of protocols (including ethernet) called "Aloha" protocols, where you just transmit whenever you want to - but if someone else happens to be transmitting at the same time, both of your messages will be corrupted, and will need to be resent.

Minor nitpick: Ethernet actually uses CSMA/CD, which means that nodes wait until the line is free before transmitting. Collisions occur only when two nodes start transmitting at the same time.

Love your blog as always.

Posted by: secondclass | April 22, 2008 2:49 PM

12

secondclass:

Yeah, I know that my description was a bit oversimplified, but I was already feeling like I was giving too much detail; try sending and look for collision" is close enough to give people a sense of how it works, and I was debating whether or not to even include that much detail.

(As an interesting aside, when I worked as a sysadmin, we actually had some trouble with an ethernet that got too long. We'd run an ether as a coil through four floors of the EE building. Turned out that it got long enough that at Ethernet propagation speed, the machines at the far ends of the network could collide without detecting, because ether only looks for a collision when it starts sending. Took us a while to figure out that that was the problem, because we couldn't believe that the signal propagation could take long enough on a cable in one building to allow missed collisions! Once we finally figured it out, we split the building into two subnets with a repeater between them, and that took care of it.)

Posted by: Mark C. Chu-Carroll | April 22, 2008 3:00 PM

13

http://www.ams.org/notices/199808/paxson.pdf
is a very good introduction to the failure of poisson modeling due to the self-similar structure of network traffic

Posted by: Paddy | December 2, 2008 1:18 AM

ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.