# Hash Week! (Part 1)

Last week NIST anounced the winner of its Cryptographic Hash Function Competition. After five years of review and many rounds of discussion and elimination, the winner is a hash function called Keccak, and its developers deserve many congratulations. It’s a shame hash functions aren’t better known in the general public, because not only are they a vital part of keeping data safe online, they’re one of the most interesting bits of applied math. Better still, their basic concept is not complicated at all.

Hash functions are so cool, in fact, that I want to spend several posts this week discussing them. Today we’ll just talk about what they are, and why they might be useful.

A hash function is just a mathematical function like any other. You put in a number, and the function spits out another number. In high school you dealt with functions like $f(x) = x^2$. Put in 2, it spits out 4. Put in 3, it spits out 9. A hash function is less predictable. It’s designed to be less predictable. In fact, a working definition of “hash function” is a function whose output is a number with no obvious relation to the input.

Computers internally represent text as numbers, so the first letter of my name – “M” – could be represented by its numeric ASCII code – 77 – and fed into a function. Here we won’t worry about the details of the text-to-number conversion. We’ll just say that it’s easy for a computer do so we’ll just write the input of the function as text (It could also be an image, movie, or any other kind of digital data.). One hash function is called CRC-32, and if we denote it c(x) some examples of its output are:

c(“Matt Springer”) = 2690866847

c(“ScienceBlogs”) = 385760650

c(“Science Blogs”) = 531647807

Even very similar inputs can result in very different outputs, as expected. It’s also important to notice that CRC-32 is a 32 bit hash, which just means every output is a number between 0 and 2^32 (which is 4,294,967,296). Of course there’s more than four billion possible pieces of data in the world, and therefore some identical inputs which will correspond to the same outputs. While two random inputs will only have a 1 in 2^32 chance of producing identical hashes, the birthday paradox means that if you have a bunch of random inputs, odds are you only need on average 2^(32/2) = 65,536 inputs before you end up with a duplicate hash. That’s not so many. There are even single words in the dictionary that have the same CRC-32 hash. Both “plumless” and “buckeroo” hash to 1306201125, for instance.

If you really really need to avoid duplicates, there no alternative but to use a hash function with a longer output. 128, 256, and 512 bits are common sizes. With a 256 bit hash, there’s 2^256 possible outputs (more than 10^77) and odds are you’d need about 2^(256/2) random messages (about 10^38) before you end up with a duplicate hash. Thee are astronomical numbers, and unless something is mathematically wrong with the particular hash function in question we will never live to see see two different inputs produce the same 256-bit hash. One 256 bit hash is called SHA-512, and my name hashed using that function is

SHA-256(“Matt Springer”) = 20005913487026535327234686971684230532576326699423435238922925367385304409606

Which is a gigantic number, and vanishingly unlikely to recur by chance for a different input.