LZW encoding is named for its developers, Abraham Lempel, Jacob Ziv, and Terry Welch.
This is a method of encoding a stream of symbols and was used in the GIF image format and in an early file compression utility on Unix. Since then, there have been many related compression methods. One of those, the DEFLATE method (not discussed here), is used for PNG images and in the zip and gzip utilities.
LZW encoding was patented in 1985 by Unisys; that patent expired in 2004. The method was used by CompuServe from 1987 for the Graphics Interchange Format (GIF) image format. The Portable Network Graphics (PNG) image format was developed in 1996 (and later) to avoid the use of the patented GIF.
The main idea is to dynamically build a dictionary of subsequences of the symbol stream while the symbol stream is being processed.
Then subsequences of the system stream are replaced by their index into the dictionary, which is usually shorter than the subsequence itself.
The encoding algorithm is shown below, with the notation s+x meaning "sequence s with symbol x appended".
X = symbol sequence <x1, x2, ... xn> D = dictionary of symbol sequences, initially containing all one-symbol sequences s = empty symbol sequence <> for x in x1, x2, ..., xn if s+x is in D s = s+x else output the dictionary index of s add s+x to D s = <x> output dictionary index of s
The algorithm builds up sequence s by appending (in the 'if' clause) as many symbols as possible until appending the next symbol, x, would result in a sequence that is not in the dictionary.
Then the algorithm outputs (in the 'else' clause) the index of s, which is the sequence known to be in the dictionary. Since s+x is not in the dictionary, s+x is added to the dictionary.
Since the index of s was output just before, it remains to output something to represent x. So x is used as the first element in a new sequence, s. That new sequence will eventually be output, so x will eventually be output.
After the 'for' loop terminates, there may still be a sequence, s, that has not been output. So output the dictionary index of s.
The possible symbols are a, b, d, n, and _ (underscore).
The sequence to encode is X = "banana_bandana", consisting of 14 symbols.
The dictionary initially contains
index | sequence |
---|---|
0 | a |
1 | b |
2 | d |
3 | n |
4 | _ |
The encoding proceeds as follows:
s before | x | s+x in D | s after | output | dictionary new entry |
---|---|---|---|---|---|
b | yes | b | |||
b | a | no | a | 1 | 5: ba |
a | n | no | n | 0 | 6: an |
n | a | no | a | 3 | 7: na |
a | n | yes | an | ||
an | a | no | a | 6 | 8: ana |
a | _ | no | _ | 0 | 9: a_ |
_ | b | no | b | 4 | 10: _b |
b | a | yes | ba | ||
ba | n | no | n | 5 | 11: ban |
n | d | no | d | 3 | 12: nd |
d | a | no | a | 2 | 13: da |
a | n | yes | an | ||
an | a | yes | ana | ||
ana | 8 |
After the last symbol, "a", is read, the final sequence is "ana", so index 8 is output.
The sequence "banana_bandana" is encoded as 1,0,3,6,0,4,5,3,2,8.
For decoding a sequence of dictionary indices, we start with the same dictionary as the encoder.
As indices are processed, this dictionary is updated, but usually one step behind the updates of the encoder.
I = index sequence <i1, i2, ... ik> D = dictionary of symbol sequences, initially containing all one-symbol sequences s = D[i1] # s stores the sequence of the # most recently-received index output s for i in i2, i3, ..., ik # t stores the sequence to output if i is in D t = D[i] add s + first_symbol(t) to D else t = s + first_symbol(s) add t to D output t s = t
There are two cases in each iteration:
For an index that is in the dictionary (i.e. "i is in D"): We look up that index's sequence from the dictionary and output that sequence. Then we add a new entry to the dictionrary, which is the previously output sequence plus the first symbol of the next sequence. That corresponds to
add s+x to D
in the encoder, where s is the previously output sequence and x is the first symbol of the next sequence; this next sequence is started on the
s = <x>
line of the encoder. At that point in the encoder, the sequence has only one symbol, x, but will later get more symbols. In the final sequence, x will be the first symbol, hence first_symbol(t) in the decoder.
For an index that is not in the dictionary, the decoder still knows what that index should hold. Here's how this happens:
So we know that the next index output from the encoder will be for a sequence that starts with x.
So $k$ is not recognized only if it is the next index output immediately after s+x was added to the dictionary at index $k$.
Here's an example:
The sequence is "abababab".
The dictionary initially contains
index | sequence |
---|---|
0 | a |
1 | b |
The encoding proceeds as follows:
s before | x | s+x in D | s after | output | dictionary new entry |
---|---|---|---|---|---|
a | yes | a | |||
a | b | no | b | 0 | 2: ab |
b | a | no | a | 1 | 3: ba |
a | b | yes | ab | ||
ab | a | no | a | 2 | 4: aba |
a | b | yes | ab | ||
ab | a | yes | aba | ||
aba | b | no | b | 4 | 5: abab |
b | 1 |
The sequence "abababab" is encoded as 1,0,2,4,1.
In the decoder, this sequence of indices is processed as follows:
s before | i | i is in D | t after | output | dictionary new entry |
---|---|---|---|---|---|
0 | a | ||||
a | 1 | yes | b | b | 2: ab |
b | 2 | yes | ab | ab | 3: ba |
ab | 4 | no | aba | aba | 4: aba |
aba | 1 | yes | b | b |
The "no" line above corresponds to an index that is not in the decoder's dictionary.