Heisenberg's Uncertainty Checksum

By dsalo on November 10, 2009.

So here's an interesting problem I ran into today. You have metadata in an XML file. You want to make the file self-describingly self-correcting, so you want to embed its checksum inside it. The problem is, you can't add the checksum to the XML file without changing the file's checksum!

Is there an XML verification tool not subject to this particular tail-chase? I don't know of one offhand.

More like this

Can't boot from a DVD/CD drive

In order to install a new operating system on a computer, you can make a bootable DVD that includes the software to install the new system, put it in the DVD/CD reader, and reboot your computer.

A sufficiently clever CRC implementation could do the job.

You don't need the last 32 bits; *any* 32 linearly independent bits can be used to make a valid CRC. So just say 0x00000000, and then solve for the values that make the CRC work.

That would be easier if each character had 4 independent bits you could fiddle rather than the jump between '9' and 'A', but it might still be solvable.

It's just a system of linear equations to be solved so the overall file checksum comes out to the desired value.

Put the checksum in the name of the file?

FITS files (an astronomy image/data format) can contain their own checksum. Check out http://heasarc.gsfc.nasa.gov/docs/heasarc/fits/checksum23may02/

It all depends upon exactly how strong and for what purpose you need the checks. If you really want self-correcting, then a mere crc32, or even a 2048 bit RSA signature, is not going to do the trick. I forget the exact relationship, but you will need about 3 or 4 bits paralleling every single character in the file to get enough redundancy to correct single bit errors, plus longitudinal crc32 protection. OTOH, if you need only to verify that the document has not been changed, then just a crc32 could fit the bill. Maybe a crc16 will too, but a checksum will not - not enough redundancy, too simple a computation.

In the crc world, the typical algorithm is to run the original document without the final CRC representation through the crc generator then append the result. Checking it means running the whole thing, including the result, through the same engine and expecting a particular number at the end. I would have to do the research to determine the numbers, but this was the method for the IBM standard used in SNA, SDLC, HDLC, and X.25 - crc16(16,15,2,1) (the parens indicate the XOR feedback taps in the CRC generation shift register). 16 bit crc was generally considered OK only up to a few KB message size. Todays documents would probably need at least crc32. Encoding is an issue - the crc value needs to be in raw binary, not hex or ASCII encoding.

Today the preferred method is to use RSA signing, typically via gpg (or pgpg) for change detection and origin verification. 2048 bits vs 32 makes for a pretty reliable test. Unlike crc engines, which are low level data comms tools and usually implemented in IO hardware, gpg signatures are available as command line tools, which also makes them scriptable.

to generate:

gpg --sign document

to verify:

gpg --check-sigs document

A checksum won't help in correcting errors. For that you need a dedicated error-correcting code. The simplest is Hamming. The problem is that the application that uses the file must be aware of the code.

If you only need to detect errors, you still need to define if the threat is accidental or intentional modifications. CRC can detect accidental errors, but it is easy to circumvent. To protect against forgeries you need a cryptographic hash, e.g. SHA.

You need to embed both the checksum and minus the checksum in the data. These two will then sum to zero and consequently the value of the checksum will not affect the value of the checksum!

Thanks for the comments and suggestions! Just to clarify, I'm aware that a checksum without code to actually check it is pointless -- what I'm after here is a way to embed a file's digital fingerprint inside the file, so that anyone who looks at the file can do a basic integrity check.

I'm definitely taking a look at FITS. Thanks for the tip!

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…