Entropy is a measure of information content in a stream of data, such as the pixels of an image.
A random event (such as the values of the next pixel in the stream) conveys information, which was not known before the event occurred (i.e. before the next pixel was seen).
Consider that the next pixel is in [0...255].
Does the next pixel really give 8 bits of information? Not necessarily ...
Let $P(k)$ be the probability of a particular event. For example, $P(100)$ could be the probability that the next pixel in the stream has value 100.
Suppose $P(k) = {1 \over 256}$. In other words, all pixels are equally likely to appear next in the stream.
Then the information content of event $k$ is defined as
$I(k) = \log_2 {\Large 1 \over P(k) } = - \log_2 P(k) = \textrm{ 8 (measured in bits)}$
The base 2 determines the unit in which information is measured (in this case, the bit).
Some examples:
So a higher-probability event conveys less information.
We will assume base-2 from now on.
Entropy, $H$, is the probability-weighted information content of a source of events:
$H = - \sum_k P(k) \log P(k)$
summed over all possible events, $k$.
The entropy is the average information content of the next event.
For example, consider three events, $A$, $B$, and $C$, with $P(A) = \frac18$, $P(B) = \frac14$, and $P(C) = \frac58$. Then the entropy of this set of events is
$\begin{array}{rl} H & = - (\frac18 \times (-3) + \frac14 \times (-2) + \frac58 \times (-0.68)) \\ & = 1.3 \\ \end{array}$
Entropy is measured in the same units as the information content. In the example above, this is in bits.