Simple Lempel-Ziv Compression in Erlang

I decided to do a little bit of something useful with Erlang, both to have some code to show, and to get some sense of what it's like writing something beyond a completely trivial example. Because the capabilities of Erlang shine when you get into low-level bit oriented things, I thought that writing a bit of data compression code would be make for a good example. I'm going to present it in two parts: first a simple but stupid version of the algorithm; and then the encoding part, which into bit twiddling, and potentially gets more interesting.

I'm going to use a simple version of Lempel-Ziv compression. Before I get into the detail
of how LZ compression works, I've got an interesting story about how I learned about it. My first summer in grad school, I was looking for a job. One of my college friends worked for a subsidiary of Travellers insurance, and he got me an internship with them.

Our group (3 of us) worked on programs that ran on salepeoples' laptops. Since this
was 1991, laptops were still pretty primitive. We had to run in 128K of memory, with the best machines having 10 megabytes of hard disk, and the worst machines having two floppies. So memory use was always a huge issue.

Being an insurance company, naturally things were pretty bureaucratic. They hired me to write a prototype of a new program. I don't remember the details of what I was supposed to write. But the way things worked, they wanted to build this new tool for the salespeople. But the company wouldn't let them propose a new project without having staffing for it. So they hired me to work on it. But because they hadn't proposed the project before they hired me,
I had nothing to do while they proposed it, and worked out their requirements. So I worked for them for a total of about 12 weeks; and it took them about 9 1/2 weeks to get to the point where they had an approved project for me to work on. So I had over two months with nothing
to do. So I kept pestering the other two guys about what I could do to help them.

One of the things they wanted was a way of doing a splash screen. They had a GIF image of the company logo, and they thought it would be nice to be able to splash it on the screen whenever the sales app loaded. But they didn't have an image viewer that they could call from inside their program. So they asked me to write one. GIF images are encoded using LZ. So I coded that up the LZ decoder to get the bitmap out of a GIF in C, and they were happy. Then I decided, as long as I had an LZ decompressor, I should write an LZ compressor, and then we'd have some generic data compression code. So I went ahead and did that, and added a
nice, effective set of data compression routines to our libraries. But the manager was actually pissed at me: I'd added a feature to our library - data compression - without getting permission. The right way of doing things was to write a proposal, and pass it around the various levels of petty bureaucrats for two months while I sat and twiddled my thumbs on the payroll.

Anyway... Back to compression.

LZ is a remarkably simple compression algorithm, but it's quite effective as long as you've got enough data. It relies on a simple scheme for finding repeated sequences of
characters in a file. If you've got a file like text, which typically has lots of repeated
subsequences, then LZ can be quite effective - particularly if you're willing to go to the trouble of doing variable-length encodings, as you'll see in the next post. But even with simple encodings, you can see a nice benefit.

The key is, how do you find repeated sequences? Finding maximum repeated sequences is
extremely difficult - it's a pretty complicated dynamic programming problem. Dynamic
programming is nice, but in a case like this, it's slow: there are O((N/2)2) possibly repeatable sub-sequences in a file of size N. You can use some tricks to trim that down to a smaller set of reasonable subsequences - but it's still a lot. For a really big file, you're talking about a lot of time computing those exhaustively to find maximum repetition. And for a really big file, you're also talking about a whole lot of memory to hold those sequences, and two full passes through the file to do the compression. Not good.

LZ takes a simple approach. You set a minimum length of sequence to encode, and every time
it sees a sequence of that length or longer, it assigns it a short compression code in a
compression dictionary. Next time it sees that sequence, it outputs the compression code, and
then adds that sequence plus the next character to the dictionary. In pseudo-code form, here's the algorithm:

Compress(chars):
   dict = new compression dictionary
   current = remove first char from chars
   for b in bytes:
      if (current+b in dict):
         current = current + b
      else
         code = dict.codeFor(current)
         dict.addNewCode(current+b)
         current = b
       end
   end
end

It's easiest to understand with an example. Suppose you had a string "abcabcabcabcabcabcabcabc" (8 repetitions of "abc"), and suppose you were
generating codes for anything two characters or longer.

You'd start by outputting the "a". Then you'd look at the pair, "ab". It's not in the
compression dictionary, so you'd generate a compression code for it, CC1. Next you'd see "bc",
which isn't in the dictionary, so you'd output "b", and assign it CC2. Next you'd see "ca",
assign it CC3, and output "c". Then you'd see "ab". That's in the dictionary. So you'd add
a character to to: "abc". That's not in the dictionary. So you'd output CC1,
and add "abc" to the dictionary with code CC4. Then you' see "ca", which is in the dictionary. So you'd get another character, giving you "cab". That's not in the dictionary, so you'd
output CC3, and give "cab" the code CC5. Now you'd look at "bc", which is in the dictionary; then "bca" which isn't, so output CC2, and assign CC6 to "bca". Then you'd see "ab", which is in the dictionary, so you'd add to it, getting "abc", which is in the dictionary, so you'd add to it, getting "abcd", which isn't in the dictionary. So you'd output CC4, and then assign
"abcd" the code CC7. And so on.

So let's try implementing that in Erlang. We'll start with an extremely naive implementation - we'll just try to translate the algorithm as directly as possible
into Erlang code, and we'll just use simple data structures. That's dumb, but it's
an easy way to start.

First, we need a compression dictionary. Taking the naive route, we'll assume that we're compressing a string of 8 bit characters. We'll assign compression codes that are integers
larger that the largest 8 bit character, so compression codes will start at 256. The way
that we'll represent our naive dictionary is by a tuple whose first element is the integer value of the next compression code to be assigned, and whose second element is a list of
pairs, where each pair is a compression code followed by the string assigned to that code.
Here's the implementation of that in Erlang:

new_dict() ->
   {256, []}.

add_to_dict(V,{I, ES}) ->
   {I+1, [[I,V]|ES]}.

find_value(K,{_, ES}) -> find_value_in_list(K, ES).

find_value_in_list(K, []) -> [K];
find_value_in_list(K, [[K,V]|_]) -> V;
find_value_in_list(K, [_|Rest]) -> 
   find_value_in_list(K, Rest).

find_key(V, {_, ES}) -> find_key_in_list(V, ES).

find_key_in_list([V|_], []) -> V;
find_key_in_list(V, [[K,V]|_]) -> K;
find_key_in_list(V, [_|Rest]) -> 
   find_key_in_list(V, Rest).

dict_contains(V,{_, ES}) -> list_contains(V, ES).

list_contains(_,[]) -> false;
list_contains(V,[[_,V]|_]) -> true;
list_contains(V, [_|Rest]) -> list_contains(V, Rest).

dict_contains_key(V, {Max, _}) ->
   V < Max.

It's pretty simple. We rely on the fact that Erlang treats strings as lists of characters,
so we can write it all as list code. There's one little bit of cleverness in it: when we
lookup a code for a string, if there is no code for that string, we return the string itself. This is very standard looking functional code. As you can see, there's no need for any
control structures in this code - it's all pattern matching. It was extremely easy to write. (It had better have been, given how totally trivial this code is.)

Now, let's move on to the basic compression algorithm.

compress(CurrentSequence, [B|Bytes], Dict) ->
   NewSeq = CurrentSequence ++ [B],
   InDict = dict_contains(NewSeq, Dict),
   if
      InDict -> compress(NewSeq, Bytes, Dict);
      true ->
         [find_key(CurrentSequence, Dict)]  ++ 
            compress([B], Bytes, add_to_dict(NewSeq, Dict))
   end;
compress(CurrentSequence, [], Dict) ->
   InDict = dict_contains(CurrentSequence, Dict),
   if
      InDict ->[find_key(CurrentSequence, Dict)];
      true -> CurrentSequence
   end.

compress([B|Bytes]) ->
   compress([B], Bytes, new_dict()).

For the most part, this is a pretty straightforward translation of the algorithm into
functional form. Since the algorithm is written in terms of mutable state, we translate that into a recursion with a couple of carrier parameters replacing the state. Here, however, we need some control structures, and this illustrates one thing that I find horribly ugly and frustrating in Erlang, which I'm constantly tripping over.

Control statements like "if" in Erlang are horribly restrictive. You can't write "if dict_contains(CurrentSequence, Dict)" in Erlang - because you're
not allowed to call a function in the condition clause of a control statement! You've got to do a pattern match, assigning a name to the result of the function call, and then
use the name in the control statement: "InDict = dict_contains(CurrentSequence, Dict), if InDict". Gah! Hideous. What were they thinking? Why can't you just use
a function in a control statement?

My suspicion about the answer is that in Erlang, functions aren't necessarily pure. There isa hidden notion of state. There are I/O functions for file IO, and there are message send and receive statements that can be hidden inside functions. But they want conditions and control flow to be absolutely side-effect free. So anything that could
potentially have a state effect - including function calls - gets bumped out of conditions.

Even though I can see the logic, I hate it. I really hate it. It's ugly, and
obfuscatory, and ugly, and counterintuitive, and, well, ugly. Given that rant, the code is still pretty clear and clean. Once you know a tiny bit of Erlang syntax, you should be able to read and understand it pretty easily. The "if" syntax is a bit odd - it's more like an infixy
version of Lisp "cond" than like a typical "if". The basic syntax is: if (cond -> value;)+ end. It's a list of alternatives, which are evaluated in order; if the condition part of one of them is true, then the expression following its "->" is evaluated, and returned as the result of the "if". There's no "else" at the end of an "if" statement; if you want one, you just use true as a condition.

It's not much use to be able to compress something if you can't decompress it later. (Someone made a whole lot of money on a patent based on that premise.) LZ decompression
is a bit harder to understand than compression. The trick is, you need to generate the
same dictionary that was generated by the compression phase, but the table is never stated
explicity. You need to carefully keep track of stuff in order to be able to correctly
regenerate the dictionary as you decompress. Here's the rough algorithm (relying on
our hack above, that looking up a key which isn't in the dictionary returns the
key itself; effectively, that means that looking up a character which isn't a compression
code will return that character.)

Decompress(Chars):
   dict = new Dictionary
   old = remove first character from Chars
   output(old)
   for (new in Chars):
      str = dict.lookup(new)
      output(str)
      first = first character in str
      dict.add(old+first)
      old = new
   end

So, each step, you take the previous code, plus the first character of
the string for the next code, and add it to the dictionary. That will end up
generating the same dictionary as the compressor. Again, the easiest way to see it
is to try an example. If we run our Erlang compressor on the string "abcabcabcabcabcabc", the result is: [49,50,51,256,258,257,259,262,261,257].

We start by outputting 49 (which is the numeric value of the character "1"), and setting old to [49]. Then we start the loop. The following table shows you the values of
the variables at each step; each line is one iteration of the loop.

Iteration new remaining Chars str(output) first new entry
49 50 [51,256,258,257,259,262,261,257] "2" "2" 256="12"
50 51 [256,258,257,259,262,261,257] "3" "2" 257="23"
51 256 [258,257,259,262,261,257] "12" "1" 258="31"
256 258 [257,259,262,261,257] "31" "3" 259="12"+"3" = "123"

Once again, the translation into Erlang is very straightforward:

decompress(Prev, [B|Bytes], Dict) ->
    Lookup = find_value(B, Dict), 
    NewEntry = Prev ++ [hd(Lookup)], 
    Lookup ++ decompress(Lookup, Bytes, add_to_dict(NewEntry, Dict));
decompress(_,[], _) -> [].

decompress([B|Bytes]) ->
   [B] ++ decompress([B], Bytes, new_dict()).

This is almost identical to how I'd write it in Haskell. In Haskell, I'd use a where clause, which would alter the ordering slightly, but otherwise it would look exactly the same.

This is getting quite long, so I'll stall on cleaning up the implementation to the
next installation. There are two problems with this initial implementation that need to be
fixed. The first one is that this is recursive, but it's not tail recursive. That means that unless the compiler does something clever, this bugger is going to blow up on large inputs. It should be made properly tail recursive. (I suspect that the Erlang compiler, which has a reputation for its excellent compilation, would fix that, but even so, you should implement these things correctly, rather than expecting the compiler to fix your bug; and O(N) stack growth counts as a bug in my book.) The second problem is that
using a linear list for the dictionary is mind-bogglingly stupid. We should use an
efficient mechanism for the keys. It happens that Erlang has built-in support for
a really excellent mechanism that's practically tailor-made for this, called ETS - Erlang Term Storage. So next time, we'll look at making compression and decompression properly tail recursive, and implementing our compression dictionaries using ETS.

I'll conclude with some commentary on my developing opinion of Erlang.

So far, I'm finding Erlang frustrating. There's a whole lot of goodness there - it's the
most real functional language I've seen. When you see things like ETS, and binary
types, you'll really get a sense of what I mean by that. It's not just powerful, it's
practical. This is a language for hacking real systems. For example, I've done a lot of work in SCM - I'd love to use Erlang for implementing an SCM system. There's
so much that would be incredibly useful and powerful - I'm pretty sure that I could build
a system in Erlang that just screamed.

But at the same time - it's all so frightfully ad-hoc. It's clear that it was implemented by taking whatever the Ericsson engineers needed on a given day, and finding some way to smush it in. It's sloppy. It's not a pure functional language. But it's not really a stateful language either. It's an awkward, sloppy mess that sits somewhere in between.

If you want to write a pure functional language, then you should write a real pure
functional language like Haskell. If you want to permit limited state to sneak in things like
IO, ad make them work without needing to do anything fancy like monads, then deal with the fact that your language isn't pure, and make state explicit when you need it, and design your language to handle that cleanly - like OCaml. Instead of doing either of those, Erlang goes the half-assed route, and attempts to get some of the benefit of purity without being pure, but without really admitting that it's not pure. The result is awful stuff,
like the whole "no function calls in conditions" rubbish.

In the interests of full disclosure, I'm coming at this as someone who is an advocate for the OCaml approach. I think OCaml is a wonderful language, and to my mind, its limited state approach strikes a great balance. Haskell is often frustrating to me, because things start off beautifully, but then as you start to tie things together at the top with monads, it gets hairy. But it's a good approach, and I can accept that I've got those problems with it mostly because of inexperience. I prefer the ML approach, but Haskell is brilliant.

But to reiterate: if you're designing a functional language, you need to make
a choice. How are you going to handle state? Are you going to make a pure language, where state needs to be represented explicitly, using something like continuation passing or monads? Or are you going to be impure, and allow state into the system? If you decide the former, you need to design your language with the constructs that you'll need to handle passing around explicit state - like Haskell does with monads. If you decide the latter, then you need to design your language so that your language constructs behave properly in the face of mutable state. Erlang, by taking an ad-hoc route, does neither, and the result is ugly and sloppy.

Tags
Categories

More like this

Several commenters pointed out that I made several mistakes in my description of Erlang. It turns out that the main reference source I used for this post, an excerpt from a textbook on Erlang available on the Erlang website, is quite out of date. I didn't realize this; I assumed that if they…
One of the things I discovered since writing part one of my Erlang introduction is that Erlang has grown a lot over the last few years. For example, the idiom of tagged tuple as a way of creating a record-like structure has been coded into the language. There is, unfortunately, a bit of a catch.…
Like a lot of other bloggers, I often get annoying email from people. This week, I've been dealing with a particularly annoying jerk, who's been bothering me for multiple reasons. First, he wants me to "lay off" the Christians (because if I don't, God's gonna get me). Second, he wants to convince…
Way back, about three years ago, I started writing a Haskell tutorial as a series of posts on this blog. After getting to href="http://scienceblogs.com/goodmath/2007/01/haskell_a_first_step_into_mona_1.php">monads, I moved on to other things. But based on some recent philosophizing, I think I'…

I've been dabbling in Erlang too, but coming from a java background. I love the pattern matching and process management, but I've also become hugely frustrated with the horrible conditional syntax and that "can't call functions from conditions" gripe you have. Disgusting!

There are two LZ algorithms, this is LZ78. LZ77 doesn't require maintaining a dictionary and the decompressor is as simple as for RLE. Of course most of the time it's followed up by some sort of entropy coding, which negates these points somewhat, but if you skip that and you're clever you can even avoid dealing with bit streams.

I guess selecting the algorithm is beyond the scope, but I just think they're very neat algorithms. :)

But the manager was actually pissed at me: I'd added a feature to our library - data compression - without getting permission. The right way of doing things was to write a proposal, and pass it around the various levels of petty bureaucrats for two months while I sat and twiddled my thumbs on the payroll.

...and this is why the average corporate programmer writes only 6200 lines of code per year, and Windows programmers 1000. (Yes, I know it's a cruddy metric.)

Perhaps you've said this in an old post... but is there a place I could acquire a (free?) Erlang compiler, just to try some of this stuff out?

Eric:

Go to "erlang.org". They've got a free compiler available for most versions of Unix, MacOS, and windows.

I've an idea for putting states into functional programming, one that I haven't had the time to work out in detail or experiment with. The idea is to combine Pi-calculus (or some other process calculus) and Lambda-calculus. The way it would work is that Pi-calculus processes and channels are values of lambda expressions. Expressions would be evaluated lazily into processes as necessary. A program would, of course, be an expression that evaluates (if it evaluates at all) into a process. The 'world' would be presented to the program through a set of special channels.

For example, a process receiving from a channel would be of the form c(x). E, where c is a channel and E an expression that evaluates to a process (or possibly does not terminate). If the process is paired with a process that sends the expression E' to the channel c, this process continues as E[E'/x], which is evaluated as needed.

Very little Erlang code uses "if". "case" is far more
common. The usual incantation is:

case dict_contains(NewSeq, Dict) of
true ->
compress(NewSeq, Bytes, Dict);
false ->
[find_key(CurrentSequence, Dict)] ++
compress([B], Bytes, add_to_dict(NewSeq, Dict))
end;

/Joe

By joe armstrong (not verified) on 07 Jan 2008 #permalink

Mark: Your pseudocode is *really* confusing, because of a variable swap and the fact you never specify what gets outputted. It should look like this (changes italicized):

Compress(chars):
   dict = new compression dictionary
   current = remove first char from chars
   for b in chars:
      if (current+b in dict OR current+b shorter than minlength):
         current = current + b
      else
         output dict.codeFor(current)
         dict.addNewCode(current+b)
         current = b
      end
      end
end

Hmm. I've thrown this together in Lisp for myself, and I get a slightly different answer for the compression of "abcabcabcabcabcabc" (6x abc) than you do.

You have [49,50,51,256,258,257,259,262,261,257]. I, however, get (97 98 99 256 258 257 259 262 257). The difference is that I don't have the 261 that appears as the second to last element in your encoding (the difference in the first three elements is obviously insignificant and can be ignored). Hand-decoding mine shows that it's correct, and tracing the execution of my algorithm shows it be working correctly as well. Is this just a typo?

  • The 'if' keyword evaluates guard sequences which are a subset of BIFs, defined (mostly) in the erlang module. Implementation issues aside there is no sensible motivation why there must be no user defined guards -- but as Joe Armstrong pointed out you can substitute 'if' with the 'case' syntax which evaluates patterns.
  • Instead of introducing a side-effect by using ets you could e.g. implement a gen_server behaviour and store the dictionary in it's state. Ets ought to be faster though.
  • IMHO the main benefits or Erlang are not term storages or binary pattern matching but the architectural patterns for building robust and distributed system -- which is quite a serious problem.