Back when I first started this blog on blogspot, one of the first things I did was write an introduction to information theory. It’s not super deep, but it was a decent overview – enough, I thought, to give people a good idea of what it’s about, and to help understand why the common citations of it are almost always misusing its terms to create bizzare conclusions, like some ["law of conservation of information",][conservation]

[conservation]: http://en.wikipedia.org/wiki/Law_of_conservation_of_information “wikipedia on Dembski’s law of CoI”

This is a slight revision of that introduction. For the most part, I’m just going to clean up the formatting, but once I’m going through it, I’ll probably end up doing some other polishing.

****

So what is information theory?

================================

Information theory is a relatively new field of mathematics that tries to characterize what information is in a quantifiable way. It’s an area of math that was almost totally obscure until not too long ago, and one which is frequently misunderstood, even by people who mean well.

A bit of history

—————–

Modern information theory comes from two very different sources: Shannon’s information theory, and Kolmogorov/Chaitin algorithmic information theory. Fortunately, they both converge, mostly, on the same place.

### Shannon’s Information Theory

The more famous branch of IT is Claude Shannon’s information theory, as presented in his infamous paper, [Communication in the Presence of Noise][shannon]. Shannon’s interest came from his work at Bell Labs for AT&T, which wanted a way of being able to figure out how much wire they needed to lay for the telephone network.

AT&T was interested because for a telephone company, laying wire in an incredibly expensive operation. The wire itself (especially back in the ways when it was bundles of copper) is quite expensive; the process of running and burying the cables is even more expensive. So ideally, as a phone company, you want to run the cables once; you want to lay enough that you’ll never have to come back and do it again; and you want to lay the smallest amount that will do the job, since the materials are so expensive.

The problem that they wanted Shannon to help solve was, how could they determine how much wire did they actually need to lay? It’s actually an astonishingly tricky question. The less they laid, the less it cost them in the short term. But they *knew* that the number of phone lines was going to increase dramatically – so if they were cheap, and laid too little wire, when the number of phones exceeded the capacity of what they had already laid, they’d have to go back and lay more. So they wanted to be able to make a good prediction of the minimum amount of wire they could lay that would meet their needs both at the time it was laid, and in the future.

But there’s a big trick to that. First: how much information can a single wire carry? And when you bundle a bunch of wires together, how much can you pump over a single wire without it interfering with signals on it’s neighbors in the bundle?

That’s what Shannon was trying to understand. He was trying to find a mathematical model that he could use to describe information transmission, including the effects of imperfections in transmission, including the introduction of noise and interference. He wanted to be able to quantify information in a meaningful way that would allow him to ask questions like: “At what point does noise in the line start to eliminate the information I want to transmit?”.

The answer is just fascinating: a given communication channel has a capacity for carrying information; and a given message has a quantity of information in it. Adding noise to the signal *adds* information to message *until* the capacity of the channel is reached, at which point information from the noise will start to *replace* information from the message; at that point, you can say that information from the message will start to be lost.

Shannon called the information content of a signal it’s *entropy*, because he saw a similarity between his information entropy and thermodynamic entropy: in a communicating system, entropy *never* decreases: it increases until the capacity of the channel is reached, and then it stays content. You can’t reduce the amount of information in a message. (There are various rumours that in fact this choice of terminology was actually suggested by von Neumann himself.)

That naming led directly to the most common misuse of information theory. But that’s a post for another day.

### Algorithmic Information Theory

The other main branch of information theory came from a very different direction. The two pioneers were Andrey Kolmogorov, and Greg Chaitin. Kolmogorov and Chaitin didn’t know each other at the time they independently invented the same mathematical formalization (in fact, Chaitin was a teenager at the time!), a fact which led to some friction.

In the interests of honesty, I’ll say that Greg Chaitin works in the same building that I do, and I’ve met him a number of times, and been greatly impressed by him, so my description is heavily influenced by Greg.

Anyway – algorithmic information theory grew out of some of the fundamental mathematics being done at the beginning of the 20th century. There was an effort to create a single, fundamental, set of mathematical axioms and inference rules, which was both complete (every true statement was provably true), and consistent (every well-formed statement was either true or false). This effort failed, miserably; the nail in the coffin was Gödel’s incompleteness theory. Gödel presented an extremely complicated proof that showed, essentially, that no formal system could be both complete and consistent. (See my recent article about [Extreme Math][extreme] for more about this.)

[extreme]: http://scienceblogs.com/goodmath/2006/06/extreme_math_1_1_2.php “1+1=2″

Most people saw Gödel’s proof, did the equivalent of saying “Oh, shit!”, and then pretended that they hadn’t seen it. It does mean that there are fundamental limits to what you can do using mathematical formalization, but for the most part, they don’t affect you in normal day to day math. (Sort of the way that car designers didn’t change the way they build cars because of relativity. Yes, it showed that the fundamental physical model that they were using was wrong – but at the speeds that cars move, that wrongness is so small that it just doesn’t matter.)

But for people in the early stages of what would become computer science, this was a big deal. One of the early hopes was that the mathematical system would produce a “decision procedure”, a mechanical process by which a computing device could check the truth or falsehood of any mathematical statement. Gödel showed that it couldn’t be done.

But the early computer scientists – in particular, Alan Turing – embraced it. It led directly to two of the fundamental rules of computer science:

1. The Church-Turing Thesis: all mechanical computing systems are basically the same: there is a fundamental limit to what is computable, and any “acceptable” system can compute anything up to that limit – so if you can find a way of doing it on any computing device, then you can, ultimately, do it on every acceptable computing device.

2. The Halting Problem: there are some things that you cannot program a device to do. In particular, you cannot write a program for any computing device that examines another program, and tells you if that other program will eventually stop.

The halting problem turns out to say *exactly* the same thing as the Gödel incompleteness theorem. Only you can write a proof of it that a freshman college student can understand in ten minutes, instead of a proof that the average math grad student will need a semester to plow through.

Chaitin and Kolmogorov saw this as a big deal: using an algorithmic approach to how to process information, you could prove something very simply, which would require a much more complicated proof using other methods.

K-C information theory grew out of this. According to Kolmogorov and Chaitin, the fundamental amount of information contained in a string (called the string’s information entropy after Shannon), is the shortest string of program + data for a computing device that can generate that string.

(If you’re interested in Gödel’s incompleteness theorem, then Hofstadter’s

[Godel, Escher, Bach: an Eternal Golden Braid][geb] is a really fun introduction. For K-C theory, Greg Chaitin has written a bunch of really [great][chaitin1] [books][chaitin2] for mathematical laymen.

[geb]: http://www.amazon.com/gp/search/ref=br_ss_hs/002-3998406-1014458?search-alias=aps&keywords=godel%20escher%20bach “GEB amazon link”

[chaitin1]: http://www.amazon.com/gp/product/9814021725/sr=8-4/qid=1141849782/ref=pd_bbs_4/002-3998406-1014458?%5Fencoding=UTF8

[chaitin2]: http://www.amazon.com/gp/product/1852336684/sr=8-3/qid=1141849782/ref=pd_bbs_3/002-3998406-1014458?%5Fencoding=UTF8″

Information and Entropy

————————-

Now that I’ve got a bit of background and origins done, it’s time to get to some of the fun stuff.

As I said yesterday, both of the big branches of information theory have a concept of *entropy*. While the exact presentations differ, because they’re built on somewhat different theoretical foundations, the basic meaning of entropy in both branches is pretty much the same. I’m going to stick with the terminology of the Kolmogorov-Chaitin branch, but the basic idea is the same in either.

In K-C theory, we talk about the information content of *strings*. A string is an ordered sequence of characters from a fixed alphabet. It doesn’t really matter what alphabet you choose; the point of this kind of theory isn’t to come up with a specific number describing the complexity of a string, but to be able to talk about what information means in a formal algorithmic sense.

The K-C definition of the entropy of a string is the total quantity of information encoded in that string.

Here’s where it gets fun.

Another way of saying exactly the same thing is that the entropy of a string is a measure of the *randomness* of the string.

Structured strings are structured *precisely* because they contain *redundancy* – and redundancy does not add information.

Which string has more information: “XXXXYYYYY” or “4X5Y”? They have the same amount. One is a lot smaller than the other. But they contain the same amount of information.

Here’s a third way of saying exactly the same thing: the entropy of a string is the length of the smallest compressed string that can be decompressed into the original.

Here’s a better example. Go back and look at the first two paragraphs of yesterday’s post. It took 514 characters.

Here’s the same information, compressed (using gzip) and then made

readable using a unix utility called uuencode:

M’XL(“)E8$$0“VIU;FL`19$Q=JPP#$7[6≤7KIN',`M+]≤HHLPH“)\8BMOPY

M[#Y/GCDG#

MLY2EI8$9H5GLX=*R(_+ZP/,-5-1#\HRJNT`77@LL,MZD,”H7LSUKDW3@$#V2

MH(KTO$Z^$%CN1Z>3L*J^?6ZW?^Y2+10;\+SOO’OC”/;7T5QA%987SY02I3I$

MLKW”W,VZ-(J$E”[$;'2KYI^\-_L./3BW.+WF3XE8)≤@D8X^U59DQ7IA*F+X/

MM?I!RJ*%FE%])Z+EXE+LSN*,P$YNX5/P,OCVG;IK=5_K&CK6J7%’+5M,R&J]

M95*W6O5EI%G^K)8B/XV#L=:5_`=5ELP#Y#\UJ??>[[DI=J''*2D];K_F230″

$`@(““`

`

That’s only 465 characters; and if I didn’t have to encode it to stop

it from crashing your web-browser, it would only have been 319

characters. Those characters are a lot more random than the original –

so they have a higher information density. The closer we get

to the shortest possible compressed length, the higher the information

density gets.

Actually, I’m lying slightly; there’s an additional factor that needs to be considered in K-C theory.

Remember that K-C comes from the foundational mathematics of computer science, something called recursive function theory. We’re talking about strings being processed algorithmically by a program. So really, we get to talk about programs that generate strings. So it’s not just strings: you have a program and a data string. Really, you measure information content of a string as the total size of the smallest pairing of program and data that generates that string. But for most purposes, that doesn’t make that much difference: the combination of the program and the data can be encoded into a string.

In the two examples above, the program to follow is implied by the contents of the string; if we wanted to really model it precisely, we’d need to include the programs:

* For the string “XXXXYYYYY”, the program is roughly:

while there are more characters in the data:

read a character

output the character that you read

end

* For the string “4X5Y”, the program is roughly:

while there are more characters in the data:

read a number n from the data

read a character c from the data

output c n times

end

Now, specifics aside: the definitions I gave earlier are pretty much correct in both K-C and Shannon information theory: information content is proportional to randomness is proportional to minimum compressed length.

So – what happens if I take a string in very non-compressed format (like, say, an uncompressed digitized voice) and send it over a noisy phone line? Am I gaining information, or losing information?

The answer is: *gaining* information. Introducing randomness into the string is *adding* information.

“AAAABBBB” contains less information than “AAAABxBB”.

The way this is used to refute bozos who claim things like “You can’t create information” should be obvious.