up to Schedule & Notes

LZW Encoding

LZW encoding is named for its developers, Abraham Lempel, Jacob Ziv, and Terry Welch.

This is a method of encoding a stream of symbols and was used in the GIF image format and in an early file compression utility on Unix. Since then, there have been many related compression methods. One of those, the DEFLATE method (not discussed here), is used for PNG images and in the zip and gzip utilities.

LZW encoding was patented in 1985 by Unisys; that patent expired in 2004. The method was used by CompuServe from 1987 for the Graphics Interchange Format (GIF) image format. The Portable Network Graphics (PNG) image format was developed in 1996 (and later) to avoid the use of the patented GIF.

LZW Encoding Algorithm

The main idea is to dynamically build a dictionary of subsequences of the symbol stream while the symbol stream is being processed.

Then subsequences of the system stream are replaced by their index into the dictionary, which is usually shorter than the subsequence itself.

The encoding algorithm is shown below, with the notation s+x meaning "sequence s with symbol x appended".

            X = symbol sequence <x1, x2, ... xn>

            D = dictionary of symbol sequences, initially
	        containing all one-symbol sequences

            s = empty symbol sequence <>

            for x in x1, x2, ..., xn
                if s+x is in D
                    s = s+x
                else
                    output the dictionary index of s
                    add s+x to D
                    s = <x>

            output dictionary index of s

The algorithm builds up sequence s by appending (in the 'if' clause) as many symbols as possible until appending the next symbol, x, would result in a sequence that is not in the dictionary.

Then the algorithm outputs (in the 'else' clause) the index of s, which is the sequence known to be in the dictionary. Since s+x is not in the dictionary, s+x is added to the dictionary.

Since the index of s was output just before, it remains to output something to represent x. So x is used as the first element in a new sequence, s. That new sequence will eventually be output, so x will eventually be output.

After the 'for' loop terminates, there may still be a sequence, s, that has not been output. So output the dictionary index of s.

Encoding Example

The possible symbols are a, b, d, n, and _ (underscore).

The sequence to encode is X = "banana_bandana", consisting of 14 symbols.

The dictionary initially contains
indexsequence
0a
1b
2d
3n
4_

The encoding proceeds as follows:

s beforexs+x in Ds afteroutputdictionary
new entry
byesb
b ano a 15: ba
a nno n 06: an
n ano a 37: na
a nyesan
an ano a 68: ana
a _no _ 09: a_
_ bno b 410: _b
b ayesba
ba nno n 511: ban
n dno d 312: nd
d ano a 213: da
a nyesan
an ayesana
ana 8

After the last symbol, "a", is read, the final sequence is "ana", so index 8 is output.

The sequence "banana_bandana" is encoded as 1,0,3,6,0,4,5,3,2,8.

LZW Decoding Algorithm

For decoding a sequence of dictionary indices, we start with the same dictionary as the encoder.

As indices are processed, this dictionary is updated, but usually one step behind the updates of the encoder.

            I = index sequence <i1, i2, ... ik>

            D = dictionary of symbol sequences, initially
	        containing all one-symbol sequences

            s = D[i1]   # s stores the sequence of the
	                # most recently-received index
            output s

            for i in i2, i3, ..., ik

                # t stores the sequence to output

                if i is in D
                    t = D[i]
                    add s + first_symbol(t) to D
                else
                    t = s + first_symbol(s)
                    add t to D

                output t
                s = t

There are two cases in each iteration:

Case 1

For an index that is in the dictionary (i.e. "i is in D"): We look up that index's sequence from the dictionary and output that sequence. Then we add a new entry to the dictionrary, which is the previously output sequence plus the first symbol of the next sequence. That corresponds to

add s+x to D

in the encoder, where s is the previously output sequence and x is the first symbol of the next sequence; this next sequence is started on the

s = <x>

line of the encoder. At that point in the encoder, the sequence has only one symbol, x, but will later get more symbols. In the final sequence, x will be the first symbol, hence first_symbol(t) in the decoder.

Case 2

For an index that is not in the dictionary, the decoder still knows what that index should hold. Here's how this happens:

  1. Initially, the encoder and decoder both have the same dictionary with indices $0 \ldots k-1$.
  2. In the encoder: With the next x, sequence s+x is not in the dictionary, so the index of s is output, s+x is added to the dictionary at index $\mathbf{k}$, and x is used as the first symbol in the next sequence:

    previous seqence s, plus new symbol x

    So we know that the next index output from the encoder will be for a sequence that starts with x.

  3. If the next index output from the encoder is not $k$, the decoder will add s + first_symbol(t) to the decoder's dictionary at index $k$ and $k$ will henceforth be recognized. But that's not what happened.

    So $k$ is not recognized only if  it is the next index output immediately after s+x was added to the dictionary at index $k$.

  4. If the next index output from the encoder is $k$, the decoder knows that sequence s is immediately followed by sequence s+x. Since the first symbol in the next sequence is x (from step 2 above), the decoder can extract x as the first symbol of s. Then decoder can add s+x to its dictionary at index $k$.

    first symbol of s must be x

Here's an example:

The sequence is "abababab".

The dictionary initially contains
indexsequence
0a
1b

The encoding proceeds as follows:

s beforexs+x in Ds afteroutputdictionary
new entry
ayesa
a bno b 02: ab
b ano a 13: ba
a byesab
ab ano a 24: aba
a byesab
ab ayesaba
ababno b 45: abab
b 1

The sequence "abababab" is encoded as 1,0,2,4,1.

In the decoder, this sequence of indices is processed as follows:
s beforeii is in Dt afteroutputdictionary
new entry
0 a
a 1yesb b 2: ab
b 2yesab ab 3: ba
ab 4no abaaba4: aba
aba1yesb b

The "no" line above corresponds to an index that is not in the decoder's dictionary.


up to Schedule & Notes