CISC/CMPE 365 - Huffman Coding

Compression is an important aspect of signal encoding, as a more compressed signal will take a shorter time to transmit and will require less storage. Streaming videos can be heavily compressed ... sometimes too heavily. There are many methods of signal compression. We will consider one of them called Huffman Coding.

The problem is to take a stream of "tokens", each with a certain probability of appearing in the stream, and to encode each token with a variable-length string of 1/0 bits such that the stream of encoded tokens has minimum expected length.

In image processing, the tokens might be pixel values, interspersed with other symbols used to represent an image.

Let the set of possible tokens be $\{ a_1, a_2, \ldots, a_n \}$. Let the probability that $a_i$ appears in the stream be $P(a_i)$.

A stream of tokens might look like $$a_7\ a_1\ a_7\ a_2\ a_{10}\ a_7\ a_3\ a_1\ a_9\ \ldots$$

where you see that $P(a_7) > P(a_{10})$, at least in this part of the stream.

Let $\ell(a_i)$ be the length, in bits, of the 1/0 encoding of token $a_i$. Then we want an encoding that minimizes the expected encoded token length, $L$: $$E( L ) = \sum_i P(a_i)\ \ell(a_i)$$

Furthermore, we want an encoding in which the encoding of one token is not a prefix of the encoding of another token. For example, if we encoded $a_1$ as 10 and encoded $a_7$ as 1001, the encoding of $a_1$ would be a prefix of the encoding of $a_7$. After seeing 10 in the stream of bits, we would not know whether we just saw $a_1$, or are waiting for more bits that might be the encoding of $a_7$.

Huffman Coding has this property: It is a "prefix code".

Suppose we have these tokens, probabilities, and encodings:

Token $a_1$ $a_2$ $a_3$ $a_4$ $a_5$

Probability 0.10 0.10 0.15 0.33 0.32

Encoding 000 001 01 10 11

Token	$a_1$	$a_2$	$a_3$	$a_4$	$a_5$
Probability	0.10	0.10	0.15	0.33	0.32
Encoding	000	001	01	10	11

Then the string 10010011100001 would represent $a_4\ a_3\ a_2\ a_5\ a_1\ a_3$.

This encoding can be thought of as a binary tree in which the leaves correspond to the tokens, and each node (leaf or internal) stores the sum of probabilities in its descendant leaves:

The edges of the binary tree are labelled with 1 or 0, and the root-to-leaf path provides the encoding of a leaf token.

Huffman developed a greedy algorithm to compute this encoding:

Start with only the leaves of the tree. Then combine the two nodes of minimum probability and assign 1 and 0 to the left and right child edges. Repeat until no more nodes can be combined.

Above, this algorithm combines 0.10 and 0.10 to get 0.20, then combines 0.20 and 0.15 to get 0.35, then combines 0.33 and 0.32 to get 0.65, and finally combines 0.35 and 0.65 to get 1.00.

Huffman proved that this computes the minimum expected length prefix code. Note there are other encodings, such as "arithmetic encoding", the provide a lower expected length; but these are not prefix codes.

The algorithm for this would need to maintain a priority queue in which the elements are the nodes remaining for consideration. A priority queue allows insertion of a new element and removal of the minimum element in time ${\cal O}( \log n )$ for $n$ elements, so the algorithm would run in time ${\cal O}(n \log n)$ for $n$ tokens.

But we won't prove this.