Heisenberg's Uncertainty Checksum

So here's an interesting problem I ran into today. You have metadata in an XML file. You want to make the file self-describingly self-correcting, so you want to embed its checksum inside it. The problem is, you can't add the checksum to the XML file without changing the file's checksum!

Is there an XML verification tool not subject to this particular tail-chase? I don't know of one offhand.

More like this

That post about how hard it is to clean up the scientific literature has spawned an interesting conversation in the comments. Perhaps predictably, the big points of contention seem to be how big a problem a few fraudulent papers in the literature really are (given the self-correcting nature of…
As a skeptic and a blogger, my main interest has evolved to be the discussion of science-based medicine and how one can identify what in medicine is and is not based in science. Part of the reason for this is because of my general interest in skepticism dating back to my discovery that there…
After a little messing around with interesting emacs goodies, we might as well get right on to the good stuff. emacs uses a concept called "modes." You'll learn about that if you use emacs. For now, what you need to know is that there are "major modes" and "minor modes" and we're only interested…
The next interesting variant on graphs is directed graphs, or digraphs for short. A digraph is a graph where each edge distinguished between its source and its target - so an edge is from one node, and to another node. Unlike a simple graph, where if A is adjacent to B, then you can follow the…

A sufficiently clever CRC implementation could do the job.

You don't need the last 32 bits; *any* 32 linearly independent bits can be used to make a valid CRC. So just say 0x00000000, and then solve for the values that make the CRC work.

That would be easier if each character had 4 independent bits you could fiddle rather than the jump between '9' and 'A', but it might still be solvable.

It's just a system of linear equations to be solved so the overall file checksum comes out to the desired value.

It all depends upon exactly how strong and for what purpose you need the checks. If you really want self-correcting, then a mere crc32, or even a 2048 bit RSA signature, is not going to do the trick. I forget the exact relationship, but you will need about 3 or 4 bits paralleling every single character in the file to get enough redundancy to correct single bit errors, plus longitudinal crc32 protection. OTOH, if you need only to verify that the document has not been changed, then just a crc32 could fit the bill. Maybe a crc16 will too, but a checksum will not - not enough redundancy, too simple a computation.

In the crc world, the typical algorithm is to run the original document without the final CRC representation through the crc generator then append the result. Checking it means running the whole thing, including the result, through the same engine and expecting a particular number at the end. I would have to do the research to determine the numbers, but this was the method for the IBM standard used in SNA, SDLC, HDLC, and X.25 - crc16(16,15,2,1) (the parens indicate the XOR feedback taps in the CRC generation shift register). 16 bit crc was generally considered OK only up to a few KB message size. Todays documents would probably need at least crc32. Encoding is an issue - the crc value needs to be in raw binary, not hex or ASCII encoding.

Today the preferred method is to use RSA signing, typically via gpg (or pgpg) for change detection and origin verification. 2048 bits vs 32 makes for a pretty reliable test. Unlike crc engines, which are low level data comms tools and usually implemented in IO hardware, gpg signatures are available as command line tools, which also makes them scriptable.

to generate:

gpg --sign document

to verify:

gpg --check-sigs document

By Gray Gaffer (not verified) on 10 Nov 2009 #permalink

A checksum won't help in correcting errors. For that you need a dedicated error-correcting code. The simplest is Hamming. The problem is that the application that uses the file must be aware of the code.

If you only need to detect errors, you still need to define if the threat is accidental or intentional modifications. CRC can detect accidental errors, but it is easy to circumvent. To protect against forgeries you need a cryptographic hash, e.g. SHA.

By Lassi Hippeläinen (not verified) on 10 Nov 2009 #permalink

You need to embed both the checksum and minus the checksum in the data. These two will then sum to zero and consequently the value of the checksum will not affect the value of the checksum!

Thanks for the comments and suggestions! Just to clarify, I'm aware that a checksum without code to actually check it is pointless -- what I'm after here is a way to embed a file's digital fingerprint inside the file, so that anyone who looks at the file can do a basic integrity check.

I'm definitely taking a look at FITS. Thanks for the tip!