CISC 457 - LZW Encoding

LZW encoding is named for its developers, Abraham Lempel, Jacob Ziv, and Terry Welch.

This is a method of encoding a stream of symbols and was used in the GIF image format and in an early file compression utility on Unix. Since then, there have been many related compression methods. One of those, the DEFLATE method (not discussed here), is used for PNG images and in the zip and gzip utilities.

LZW encoding was patented in 1985 by Unisys; that patent expired in 2004. The method was used by CompuServe from 1987 for the Graphics Interchange Format (GIF) image format. The Portable Network Graphics (PNG) image format was developed in 1996 (and later) to avoid the use of the patented GIF.

LZW Encoding Algorithm

The main idea is to dynamically build a dictionary of subsequences of the symbol stream while the symbol stream is being processed.

Then subsequences of the system stream are replaced by their index into the dictionary, which is usually shorter than the subsequence itself.

The encoding algorithm is shown below, with the notation s+x meaning "sequence s with symbol x appended".

            X = symbol sequence <x1, x2, ... xn>

            D = dictionary of symbol sequences, initially
	        containing all one-symbol sequences

            s = empty symbol sequence <>

            for x in x1, x2, ..., xn
                if s+x is in D
                    s = s+x
                else
                    output the dictionary index of s
                    add s+x to D
                    s = <x>

            output dictionary index of s

The algorithm builds up sequence s by appending (in the 'if' clause) as many symbols as possible until appending the next symbol, x, would result in a sequence that is not in the dictionary.

Then the algorithm outputs (in the 'else' clause) the index of s, which is the sequence known to be in the dictionary. Since s+x is not in the dictionary, s+x is added to the dictionary.

Since the index of s was output just before, it remains to output something to represent x. So x is used as the first element in a new sequence, s. That new sequence will eventually be output, so x will eventually be output.

After the 'for' loop terminates, there may still be a sequence, s, that has not been output. So output the dictionary index of s.

Encoding Example

The possible symbols are a, b, d, n, and _ (underscore).

The sequence to encode is X = "banana_bandana", consisting of 14 symbols.

The dictionary initially contains

index sequence

0 a

1 b

2 d

3 n

4 _

index	sequence
0	a
1	b
2	d
3	n
4	_

The encoding proceeds as follows:

s before x s+x in D s after output dictionary
new entry

b yes b

b a no a 1 5: ba

a n no n 0 6: an

n a no a 3 7: na

a n yes an

an a no a 6 8: ana

a _ no _ 0 9: a_

_ b no b 4 10: _b

b a yes ba

ba n no n 5 11: ban

n d no d 3 12: nd

d a no a 2 13: da

a n yes an

an a yes ana

ana 8

s before	x	s+x in D	s after	output	dictionary new entry
	b	yes	b
b	a	no	a	1	5: ba
a	n	no	n	0	6: an
n	a	no	a	3	7: na
a	n	yes	an
an	a	no	a	6	8: ana
a	_	no	_	0	9: a_
_	b	no	b	4	10: _b
b	a	yes	ba
ba	n	no	n	5	11: ban
n	d	no	d	3	12: nd
d	a	no	a	2	13: da
a	n	yes	an
an	a	yes	ana
ana				8

After the last symbol, "a", is read, the final sequence is "ana", so index 8 is output.

The sequence "banana_bandana" is encoded as 1,0,3,6,0,4,5,3,2,8.

LZW Decoding Algorithm

For decoding a sequence of dictionary indices, we start with the same dictionary as the encoder.

As indices are processed, this dictionary is updated, but usually one step behind the updates of the encoder.

            I = index sequence <i1, i2, ... ik>

            D = dictionary of symbol sequences, initially
	        containing all one-symbol sequences

            s = D[i1]   # s stores the sequence of the
	                # most recently-received index
            output s

            for i in i2, i3, ..., ik

                # t stores the sequence to output

                if i is in D
                    t = D[i]
                    add s + first_symbol(t) to D
                else
                    t = s + first_symbol(s)
                    add t to D

                output t
                s = t

There are two cases in each iteration:

Case 1

For an index that is in the dictionary (i.e. "i is in D"): We look up that index's sequence from the dictionary and output that sequence. Then we add a new entry to the dictionrary, which is the previously output sequence plus the first symbol of the next sequence. That corresponds to

add s+x to D

in the encoder, where s is the previously output sequence and x is the first symbol of the next sequence; this next sequence is started on the

s = <x>

line of the encoder. At that point in the encoder, the sequence has only one symbol, x, but will later get more symbols. In the final sequence, x will be the first symbol, hence first_symbol(t) in the decoder.

Case 2

For an index that is not in the dictionary, the decoder still knows what that index should hold. Here's how this happens:

Initially, the encoder and decoder both have the same dictionary with indices $0 \ldots k-1$.
In the encoder: With the next x, sequence s+x is not in the dictionary, so the index of s is output, s+x is added to the dictionary at index $\mathbf{k}$, and x is used as the first symbol in the next sequence:

So we know that the next index output from the encoder will be for a sequence that starts with x.
If the next index output from the encoder is not $k$, the decoder will add s + first_symbol(t) to the decoder's dictionary at index $k$ and $k$ will henceforth be recognized. But that's not what happened.
So $k$ is not recognized only if it is the next index output immediately after s+x was added to the dictionary at index $k$.
If the next index output from the encoder is $k$, the decoder knows that sequence s is immediately followed by sequence s+x. Since the first symbol in the next sequence is x (from step 2 above), the decoder can extract x as the first symbol of s. Then decoder can add s+x to its dictionary at index $k$.

Here's an example:

The sequence is "abababab".

The dictionary initially contains

index sequence

0 a

1 b

index	sequence
0	a
1	b

The encoding proceeds as follows:

s before x s+x in D s after output dictionary
new entry

a yes a

a b no b 0 2: ab

b a no a 1 3: ba

a b yes ab

ab a no a 2 4: aba

a b yes ab

ab a yes aba

aba b no b 4 5: abab

b 1

s before	x	s+x in D	s after	output	dictionary new entry
	a	yes	a
a	b	no	b	0	2: ab
b	a	no	a	1	3: ba
a	b	yes	ab
ab	a	no	a	2	4: aba
a	b	yes	ab
ab	a	yes	aba
aba	b	no	b	4	5: abab
b				1

The sequence "abababab" is encoded as 1,0,2,4,1.

In the decoder, this sequence of indices is processed as follows:

s before i i is in D t after output dictionary
new entry

0 a

a 1 yes b b 2: ab

b 2 yes ab ab 3: ba

ab 4 no aba aba 4: aba

aba 1 yes b b

s before	i	i is in D	t after	output	dictionary new entry
	0			a
a	1	yes	b	b	2: ab
b	2	yes	ab	ab	3: ba
ab	4	no	aba	aba	4: aba
aba	1	yes	b	b

The "no" line above corresponds to an index that is not in the decoder's dictionary.