Yesterday we looked at hash functions. As you recall, they’re functions which take an input and generate a random-seeming output. As a quick example, here’s the output of the SHA-256 hash function for the name of the Scottish physicist James Maxwell and a misspelling thereof:

SHA256(“James Clerk Maxwell”) = 2667629603913530690117759428994407894024237387971995154086108064226397\

5353322

SHA256(“James Clark Maxwell”) = 9129664885155451589341762461551711693832872424126676652783015499131718\

4589063

A tiny change in the input generates a wildly different output, so tt looks like SHA256 is a pretty good hash function. For every input, it dumps out some 256-bit number that looks entirely random. For cryptographic purposes it’s not enough that the digits look random, they need to satisfy three specific properties, which we’ll go through one at a time.

**1. Preimage resistance.**

If I give you a *hash value*, you should not be able to find a message whose hash is that value. In other words, if I say SHA(x) = 1402163220222678497648226475128810495847235325536749812516677580084870\

9608774, you should not be able to come up with some x that works. Of course you could always just start hashing random strings and odds are after about 2^256 of them you’d hit a string that hashes to that value just by chance. But 2^256 is a gigantic number and in practice you’ll never be able to do it.

Why care about preimage resistance? With digital signature algorithms, it is possible to make mathematical versions of statements like “The owner of this cryptographic key asserts that the message with hash [some value] did in fact originate with them.” If you can generate a distinct message with the same hash, this authentication can be compromised.

**2. Second-preimage resistance.**

If I give you a *message*, and you compute its hash, you should not able to generate a different message with the same hash. This is a slightly more difficult test for a hash algorithm to pass. Here the attacker effectively has two pieces of information – the original message, and its hash. If a hash algorithm is weak, the attacker might be able to tweak the original message in such a way that it still has the same hash value. You’d hate for an attacker to be able to take “Operation Overlord to commence at midnight” and generate a message like “Operation Overlord to be cancelled” and cause the replacement to have the same hash by judicious arrangements of wording or typos.

This might seem somewhat academic. If the attacker has access to the original message, aren’t you already in deep trouble? Not always. Sometimes the message is meant to be public, and the sender is using a digital signature algorithm to sign the hash and thus verify the authenticity of the message. Your online banking website, for instance, has a public cryptographic key whose validity is checked by some certificate authority, and the certificate authority uses the hash of that public key to validate its authenticity to your browser. If it were possible to generate a fake key that hashed to the right value, this authentication would be compromised.

**3. Collision resistance.**

This is the hard one. You should not be able to find *any* two messages with the same hash. You don’t care what the messages are, and you don’t care what the hashes are, you just care that you can find two messages that hash to the same value.

Unfortunately this is much easier in general. If I want to find someone who shares my birthday, April 8, odds are I’ll have to ask about 365 people before I find a match. But if I just start asking people their birthdays and I’m willing to settle for any match between any two people, I only have to ask about 26 people before I have an even chance of finding a match. In general the number of samples I need to find a birthday match – or a hash collision – is proportional to the square root of the number of different possible birthdays – or hashes. So if I have a 128-bit hash, there’s some 10^38 possible hashes, but I only have to hash some 2^64 = 10^19 strings before I find a collision. And while that’s a big number, it’s not inconceivable that a collision could be generated by brute force checking of 10^19 hashes. But it would be tough.

If I need more security than that, I can just use a bigger hash. If I use a 256-bit hash then I’d still have to check some 2^128 = 10^38 hashes before a collision. And that’s a ridiculously huge number even for computers. Or I could use a 512-bit hash and I’d have an inconceivably intractable 10^77 hashes to calculate before I’m likely to find a collision.

That is, if my hash function is collision-resistant. One of the most common cryptographic hash functions is called MD5, and cryptographers have figured out a way to easily generate collisions with that function. Even though it’s a 128-bit hash, it only takes a home computer a few seconds to generate a collision with the right algorithms.

While the ability to generate an a collision with possibly random-looking messages might not seem so bad, in practice a sufficiently clever attacker can use collisions to compromise security. For instance, the authors of the Flame malware that attacked Iran’s computer systems were able to use an MD5 collision to generate a fake “This software came from Microsoft and is trustworthy” certificate.

With the widely used MD5 hash comprehensively broken, its replacement SHA-1 showing serious theoretical weaknesses, and the newer SHA-2 possibly vulnerable to similar attacks, NIST decided to put out a call for proposals for a new hash function whose design takes into account the great advances in cryptography over the last decade or two. Tomorrow we’ll talk a about the just-announced winner of NIST’s competition.

[The SHA-2 hash comes in several variants with different bit sizes. The SHA-256 hash at the beginning of this post is in the SHA-2 family of hashes. This awkward naming convention has been the subject of a considerable dust-up in the SHA-3 mailing list, as interested parties debate various alternatives to long designations like SHA-3-256.]