Cryptography 101

This week's NOVA Science NOW on PBS has an interesting piece on the Kryptos sculpture in front of CIA headquarters. The segment does a decent job of showing some of the basic techniques used such as substitution and transposition, in just a few minutes.

I am not a cryptographer but it is an area I have studied a little. It's a great topic to introduce to my first and second year programming students. Some of them really perk up when we start talking about it. Invariably, someone will ask if I can show them how to "crack" protected software. I always tell them that, although I have the knowledge, it would not be ethical. Some of them give me strange looks at this point.

I usually introduce a basic substitution scheme first, pretty much what the NOVA piece explains. I tend to approach this from the ASCII code angle rather than from the idea of arbitrary abstract symbols in order to get into some code fairly quick. So, we'll enter a string to be encoded and then add one to each character. Thus, an "a" turns into a "b", a "b" turns into a "c" and so on. For example, suppose we start with the string "cat". A simple +1 substitution would yield "dbu" (i.e., add one to the numeric value of each letter to get the replacement letter). We then move on to any arbitrary shift. This is all well and good but it is clearly limited. From here I like to introduce the concept of a look-up table to make arbitrary swaps instead of a simple offset. This also offers the opportunity to examine the efficiency of using a look-up table instead of the neophyte's "very, very long set of if-else clauses", both in terms of code size and execution speed. In the simple look-up table, to get the substitution for "c", you'd go to the third item in the table because "c" is the third letter in your alphabet. That table entry could be any letter in your alphabet. The only thing to remember is that the mapping must be distinct one to one. That is, two different characters cannot map to the same encoded symbol because when you went to decode it, you wouldn't know which of the two it came from. Thus, each source character will result in a single corresponding and distinct encoded character. The look-up table is the same size as the original alphabet, it's just that its ordering is jumbled.

One of the weaknesses of a simple substitution is that it won't hide symbol frequency; that is, the tendency of some symbols to be used more than others. Anyone who has ever played Scrabble knows that there's a good reason why "j" and "q" have higher point values than "e" or "a". I usually end the section showing how you can create a variable substitution to help hide this. For example, the position of the letter can be used as the index into a table of offsets. Note that the table doesn't include the replacement symbols, just the numeric value used to compute the new symbol. This idea is a little more challenging. Using the "cat" example, to encode the "c", we go to the third entry in our table. There, we do not find a replacement symbol, but rather, a number. We will use this number as an offset. So, if that number is 4, our "c" turns into "g" (four characters past "c"). Note that the table entries may be randomly placed; they do not have to follow some sort of pattern. When we decode, that "g" will be in the third position, so the table tells us the original offset was four. "g" minus four yields the original "c". This technique helps to hide the symbol frequency problem because you no longer have a one to one correspondence between original and encoded symbols. Rather, the encoding is a function of both the original character and its position in the string.

It should be noted that if all the numbers in this look-up table were the same, you'd have a simple substitution cipher as explained originally. I also like this technique as a way of showing the usefulness of modulo math. Without it, either your table would have to be as big as your message or you'd have to do some stupid code tricks to recompute an effective index into the table once you went beyond the table size. BTW, there is nothing that says the table must be the same size as your alphabet. It could be larger or smaller.

FYI, my first year students learn Python and in their second year they learn C (not traditional C, but something that is more geared toward embedded controllers).

More like this

I usually introduce a basic substitution scheme first, pretty much what the NOVA piece explains. I tend to approach this from the ASCII code angle rather than from the idea of arbitrary abstract symbols in order to get into some code fairly quick.

What's funny there is that ASCII is itself a substitution cipher mapping letters, numbers, etc. to hex and, ultimately, binary.

I've actually made my own stego program, burying data into the LSBs of graphics. It's fun to play around with stuff like that.

I used to teach Assembly and Computer architecture and I always gave a morals and "classyness" lecture after studying executable formats. You can write nice viruses and such when you understand what these files look like and what is done with them. I tried to give examples of people with class vs idiots and that the "community" of programmers who were professional would look down on you if you did stupid things like viruses. Also pointed out you didn't have to be really smart to do this stuff.