Like a lot of other bloggers, I often get annoying email from people. This week, I've been dealing with a particularly annoying jerk, who's been bothering me for multiple reasons. First, he wants me to "lay off" the Christians (because if I don't, God's gonna get me). Second, he wants to convince me to become a Christian. And third, he wants to sell me on his brilliant new compression scheme.
See, aside from the religious stuff, he's a technical visionary. He's invented a method where he can take a source document, and repeatedly compress it, making it smaller each time.
This is a stupid idea that I've seen entirely too many times. But instead of just
making fun of it, I thought it would be interesting to explain in detail why it doesn't work. It touches on a bunch of basic facts about how data compression works, and provides a nice excuse for me to write a bit about compression.
The basic idea of data compression is that you're eliminating redundancies in the original
text. You can't discard information. Mathematically, a compression function is an
invertible function C from an array of characters to an array of characters (or you
could use bits if you prefer), such that if y=C(x), then on the average input, the
length of y is smaller than the length of x.
An ideal compression system is one
where for all possible values of x, C(x) is shorter than x. C is your compressor; and since C
is reversible, it has a unique inverse function C-1 such that C-1(C(x))
= x. An illustration of this basic compression system is in the diagram to the side.
An ideal compression system can't possibly exist, for a very simple reason. In an
ideal compression system, the compression function in mapping from a larger set to a smaller set. That means that for multiple inputs, it must produce
the same output. The better the ideal compressor - i.e, the smaller it manages to
to make its compressed output - the more inputs must be compressed to the
same output. But if multiple inputs compress to the same output, then you
don't have an invertible compression function: you can't decompress your
text.
Repeated compression is really trivial mathematically. The basic idea of it is that you've
got your original compression function, C(x). For some set of inputs, C(x) is smaller than x.
You can't just re-apply C to its own output; if C is worth anything as a compression function,
it's eliminated all of the redundancy that it's algorithm is capable of finding - so its
output is uncompressible by its own algorithm. So what you do is introduce an intermediate
function, I. You evaluate I on C(x), and then you evaluate C on the result of that: C(I(C(x))). The reason that this is mathematically trivial is because this just corresponds
to a better compression function, D. If there exists an intermediate function I, which renders
the output of C compressible by C, that means that there's an improved version of C which
incorporates I.
This far, there's nothing controversial. In fact, there are variations on some of
the common compression algorithms that work in two-pass modes. For example, for appropriate
inputs, you can do a two-pass version of LZW that does a better job of finding the longest
repeated strings than the single pass version. It does better than standard LZW compression,
at the cost of a significant increase in execution time (more than double), and a significant increase in memory usage (you basically need to be able to hold the entire string to be
compressed in memory). We don't typically use those algorithms, because the improvement in
compression rates is modest at best - and for many inputs, there's no improvement at all.
The two-pass LZW could be written as LZW-compress, transform, LZW-compress - the
two pass algorithm is just a more efficient version of that. The basic idea of this is
illustrated in the diagram here; you can't decompress the document without having some kind of extra information available. In the common case of delta compression, that extra information is the original version of the document.
What many clueless folks claim is something much stronger that the two-pass approach. That
is, they claim that they can repeatedly apply the intermediate function, and each
time, it will transform the original string into something which can be compressed more
by the original compression function.
That is, length(x) > length(C(x)) > length(C(I(C(x)))) > length(C(I(C(I(C(x)))))), and so
on.
The usual explanation of why this works (and the explanation provided by my correspondent)
is that the intermediate function is removing information; but that the reverse
transform on the decompression side re-inserts the information. So the amount of information
in the compressed string is being reduced by I; and the resulting string can be shorter, since
it needs to encode less information.
That's fine. In fact that can, in theory, work. In fact, that does in practice work - if you look at what's called delta compression, that's almost exactly
what it's doing: since you know an original text, you can create a compressed copy of
an altered version of that text by taking advantage of the fact that you know the original.
It works incredibly well.
Of course, doing it with an intermediate function is really kind of silly; it's far more
efficient to encode it into the original compression function. It's pretty easy to encode
things like that into the function. For example, I can create an improved english-language LZW
compression system. The way that I'd do that is to pre-load the LZW dictionary: take common
english words and phrases, and assign them dictionary entries. Then when I go to compress,
I'll get a better compression rate than the standard LZW - because normally, LZW would take up
space in its compression dictionary building up those entries. By requiring my decompressor to
have the information about the pre-populated dictionaries, I can do a better job of
compressing, because I don't need to put all of the information into the compressed file.
You can, theoretically, do this as part of a multi-pass system. That is, you can
look at a large number of compressed documents, and find common structures in the compressed
file. Then you can populate a dictionary of those structures, and use that to transform
the file, rendering it compressible. It's harder than just incorporating it into
the original algorithm. But it's possible. The catch, of course, is that you have to
do that in advance. The intermediate function can't just look for structures
and set up a dictionary - the specific structures that it's going to look for have to
be incorporated into the design of the intermediate - if they're not, then you need to
include that dictionary into the compressed text - in which case you're not
removing any information.
Can that work iteratively? No. You can only remove a given chunk of information once.
Then it's gone. To use the english example, you can write a compression system
for text that knows how to encode common words as short bit strings. (In LZW terms, you
can pre-populate the dictionary with common words.) But once you've run the compressor
with that once, re-using it isn't going to work.
No matter how you try to weasel around it, you'll come back to the same basic problem. There's a certain amount of information inherent in a text. Most texts contain a huge amount of redundancy, which is why they're compressible. But once you've eliminated a particular
kind of redundancy, it's gone. You can reduce the length of a text by removing
information from it - but to be able to decompress it, the decompressor needs to know exactly
what information was removed - so you can only remove fixed, pre-determined chunks
of information. And you can only do that once - then the information is gone; trying to remove it again doesn't do anything.
The takeaways from this saga?
- Most things aren't compressible.
- If you're using a halfway decent compression system, the output from it is pretty much guaranteed to fit into the category of non-compressible strings.
- There are ways to "tighten" a compressed text by removing information from
it, but they're limited to fixed, predetermined sets of information. - You can use an intermediate function to remove information from a compressed
text, but it's easier to just incorporate the removal process into the
original compression algorithm. - Iterative information removal can't work.
- Log in to post comments
You mean, Mark, I cannot zip and then re-zip this article until it is one single byte big ?
that's quite a bummer.
Funny you should say that. I just spent a lot of money on a highly, highly compressed torrent of The Watchmen. Unfortunately, I haven't yet figured out how to uncompress it. Perhaps you can help. Here it is.
1
The Science Pundit:
I got a 0 maybe we could wok something out.
Ok Guys, don't be *too* nasty.
Even the bozo who bragged to me about his scheme doesn't claim anything like that. If I followed his rambling correctly, he doesn't claim it works for all input data, and the iterations have diminishing returns.
Dude, what a fascinating explanation of compression! One question: it seems like you are implying that LZW compression--which I use all the tim for TIFFs--treats the data file as a one-dimensional string. Is that correct? If so, is it mathematically provable that treating the TIFF as a two-dimensional array--and thus making use of two-dimensional proximity--can or cannot lead to more efficient compression?
There's an interesting story on compression here: http://www.geocities.com/patchnpuki/other/compression.htm
Summary: a guy posted a challenge in the comp.compression FAQ offering $5,000 to anyone who could compress a single file of arbitrary data such that the decompressor and the compressed file together are smaller than the original. The webpage is the story of a man who took him up on that challenge, paid his $100 ante, received 3MB of data from random.org, and supplied a decompressor and a collection of files that together were smaller than the original and combined to produce the original.
He never did get his five grand, though..
Pi contains itself to an aribitrary number of significant figures, e.g., 314159265358 from 1,142,905,318,634-th digit after the decimal. One can then discard all the preamble and simply write pi from itself. "8^>)
PhysioProf: no, making use of two-dimensional proximity can be used to improve compression ratios. PNG does this, for example; it improves the compressibility of the data stream by âpredictingâ each byte based on corresponding byte in the three pixels above, to the left, and above and to the left. In pictures that are not pure noise, this leads to more zeroes and other low numbers, increasing the chance of repetitions in the byte stream.
However, there is still a lower bound to this. More importantly, the two-dimensional relationships are not preserved when you use LZW compression, so the only way to exploit them would be to decompress your data and recompress it using a different algorithm. In other words, itâs possible to beat LZW starting from the raw data, but highly unlikely that you can gain anything by recompressing the LZW-compressed data, which brings us back to the point of the post.
Uncle Al: you can represent any (finite-length) data stream as an offset and length in pi. The problem is that the offsets tend to be bigger than the data represented, and of course itâs very slow to decompress if you donât have a look-up table handy. :-)
Ahruman, thanks so much for that great explanation! And just to be sure I understand, everything we're talking about here is limited to lossless compression, correct?
Ahruman, The problem of whether in any given base any arbitrary finite string shows up in Pi is still open. I think this is known for base 2 though. In general, questions about what decimal or other expansions of irrational look like are very difficult.
MarkCC, how would you take it if a right winger replaced Christian with Jew?
Of course, that is not politically correct, so the hypothetical is absurd.
I do wonder, though, do you have reason to believe that this Christian is a real person?
Clearly, the idea that you can losslessly and iteratively compress data is absurd from a Shannon perspective.
And the majority of your audience is no doubt already aware of this.
Just some observations...
Regarding #6, I just read the link, it was interesting. I don't think Patrick won, but given that the challenge was for a particular file, it seems that it is possible that an algorithm to compress that one file is possible, such that the decompression and compressed file were smaller than the original. Especially if the random sequence held a latent compressible pattern. I'll think about it more.
Re #12:
William, in the reverse case, I would have absolutely no problem with a conservative Christian mocking a Jew. Of course, you won't believe me about that, because you're so sure that everyone who disagrees with you is a lying hypocrite.
And, of course, the reverse story is hard to imagine. Because Jews don't proselytize, and that's a key part of the story. The jerk who pestered me was pestering me with the specific goal of converting me to Christianity.
As for telling if he's real... How am I supposed to do that? If I get email from what appears to be a legitimate account; and it uses what appears to be a real name; and its content isn't particularly out of the ordinary - then why would I think it's fake?
The content wasn't exactly unusual. It's the typical self-righteous style of writing, with the typical proselytizing arguments. It was no different from the kind of thing that I've heard numerous times, both through email and in person, throughout my life.
Re physioprof:
LZW compression is stream-based - so yes, it's basically one-dimensional. You can improve compression by taking advantage of two-dimensional structure.
The most common techniques that I know of are based on two things:
(1) Neighbor hinting. In general, most pixels in an image are nearly the same color as at least one of their neighbors. Using that, you've found a bit more redundancy in the image, which you can eliminate with compression.
(2) Block coloring. Every pixellated image can be treated as a set of rectangles of different colors. Each rectangle can basically be stored using fewer bits that most compression methods for the individual pixels.
Block coloring is particularly interesting, because it can be used for what we call lossy compression. If you've got a rectangle of pixels that are almost the same color, or you've got a block of a single color that's almost a rectangle, you can
just treat it as a single-colored rectangle. When you uncompress the image, you don't get back exactly the same image that you had originally; but the compression ratios can be absolutely astonishing.
It doesn't really matter, you know, which particular stand of religious looniness a religious loony happens to believe in. Christian, Jew, Buddhist, Muslim, Scientologist or Amway distributor - it's all the same stuff.
As for compression - the simplest argument I ever heard was the pigeonhole argument. There are far more possible 10 megabyte files than there are possible 5 meg files, so no reversible compression can compress every 10-meg file down to a distinct 5-meg file (it has to be distinct, because otherwise the decompressor doesn't know what to decompress it to) - there just aren't enough to go around.
The trick is - almost all possible 10 meg files are random noise. The 10 meg files that humans want compressed (eg: word documents) is only a tiny, tiny subset of the possible ones. Compressors are a mapping of those files into the smaller 5-meg space, swapping out some of the possible 5-meg files that also are random noise.
The other thing about real compressors is that when given data that they can't compress very well (random numbers, for instance), they usually don't make it MUCH worse - for instance, a dictionary-based scheme might make a single entry of the entire file, resulting in increasing the overall size by only a couple of bytes - the amount required to specify the size of the dictionary entry and then to reference back to it.
WW: You won't be able to accomplish compression on a single adversarily-selected file such that the decompressor+compressed file are smaller than the input, barring tricks such as Patrick used, which of course don't actually reduce the amount of storage space required. This is because you can't do it with any sort of general compressor, and specificity will require storage of the data somewhere.
(Now, I did specify adversarily-selected, which means we can presume it will be generated in a fashion with no underlying algorithm, and will be random - such as random.org data.)
Michael Ralston,
I tend to lean toward your view of the challenge, if the random data was also scanned to ensure that it did not, for example, randomly contain a compressible pattern.
I assume that random.org is using a true random number source, and not a pseudo random number generator, in which case, all you would need is the size, seed, and generator algorithm.
I still might take a look at the file used in the challenge because I am a glutton for punishment, and it would be interesting if the particular file in question has a latent compressible pattern. My guess is the challenger at least looked at bit, 2-bit, 3-bit, 4-bit, etc., entropy.
WW
Mark, I think you've been trolled. The suspicious thing is the guy preached to you, then switched to a ridiculous technical idea - it just sounds a little too unconvincing.
Re #19:
It's possible that it's a fake, but I don't think so.
The way that he introduced his brilliant cryptography scheme was part of his argument. It was a sort of combination of a bribe and a demo of the benefits of closeness with God.
He first used it as an example of the kinds of "out of the box" thinking that was possible with Gods help; then he basically offered to let me in on the secret.
The thing is, the series of messages from him had the sincerity of the truly deluded. There's a sense of desperation to the loony preachers, which is really hard to fake. Like I said, it's possible that he's just a troll, but I still think he's o genuine nutcase.
The gzip people found a couple patents that were granted for ideal compression algorithms: http://gailly.net/05533051.html. As they note, it's like granting a patent on a perpetual motion machine.
Funny you should say that. I just spent a lot of money on a highly, highly compressed torrent of The Watchmen. Unfortunately, I haven't yet figured out how to uncompress it. Perhaps you can help. Here it is.
1
I recognize that algorithm. You take the original file and interpret it as one long binary number. To compress it, you subtract 1 from that number and remove leading zeroes. The compressed file will usually be the same length as the original, but it'll sometimes be one bit shorter. It will never be larger than the original.
This isn't very efficient, but you can run the compression process as many times as you want and the file will keep getting shorter.
To decompress the file, just reverse the process. Obviously you need to do exactly one decompression for every compression or you'll get bad results. To be safe, you should keep track of the number of times you ran the compression program and write it down on a piece of paper somewhere.
I think you are referring to http://en.wikipedia.org/wiki/Kolmogorov_complexity for dictionary compression.
Sarang: No, that's not what I meant by dictionary-based compression, since actual Kolmogorov complexity is essentially useless from a practical standpoint, as all we can do is approximate; it's an analytic tool, not a computational one.
What I meant is a scheme where you build a "dictionary" of repeated sequences and the like, then represent the actual file as a list of dictionary references. There are various tricks to how one does this, but in general it can provide quite good compression for any sort of repetitive input - such as language, which has many repeated words and phrases, hence the name 'dictionary'.
Actually, you can ensure that you never increase the size of the input by more than one bit. Just run the compression scheme, and compare the size of the output to the size of the input. If the output is smaller, prepend a 0 to the output and return it. If it's larger, prepend a 1 to the input, and return that instead.
When decompressing, just check the first bit. If it's a 1, strip it off and immediately return the input. If it's a 0, strip it off and run the standard decompressor.
This is very slight cheatery, but hey, it works.
Your talking linear maths not non-linear with non-linear representation you can compress repeatedly because the input a is different to the output b which you can then consider B in linear form ready to become non-linear C I know it can be done as I have the non-linear twin bifurcation maths needed and have tested it already on random digit data and yes you can unpack this information too. not to worry I'm am just getting ready to produce the compressor myself and no I'm not a Christian. There of course for a given data set will be limits as to how compressed you can make the data but this will not be relative to linear laws of data compression.