Information Theory can tell us how well we have encoded an image by giving a lower bound on the average number of bits per pixel.
A random event, $k$, conveys information. $k$ could be the reading of the next pixel in a stream of pixels that encode an image. Until we read the next pixel, its value is not known, so we can think of the next value being random (perhaps with some non-uniform probability distribution).
Let $P(k)$ = probability of $k$ occurring, or the probability that the next pixel will have value $k$.
How much information does $k$ give us?
For example, consider that the next pixel is in [0...255]. If all pixel values are equally likely, $P(k) = {1 \over 256}$ for any pixel value, $k$, and reading the next pixel gives us 8 bits of information.
In general, the information gained from event $k$ is $$\begin{array}{rl} I(k) & = \log_2 {1 \over P(k)} \\ & = - \log_2 P(k) \\ \end{array}$$
If $P(k) = {1 \over 256}$ then $I(k) = 8$ bits.
The base 2 determines the unit of information (the bit in this case).
Example: Base 10 could have the decimal digit as the unit of information. $P(k) = {1 \over 256}$ gives $I(k) = - \log_{10} {1 \over 256} = 2.408$ decimal digits.
Example: Suppose the event always produces the same value (i.e. $P(k) = 1$). Then $I(k) = - \log_2 P(k) = 0$, so this event has no information content.
Example: Suppose the event has one of four values, equally likely. Then $I(k) = - \log_2 P(k) = 2$ bits of information.
Example: Suppose $P(1) = 0.1$, $P(2) = 0.4$, and $P(3) = 0.6$ . Then $$\begin{array}{l} I(1) = - \log_2 0.1 = 3.322 \textrm{ bits} \\ I(2) = - \log_2 0.4 = 1.322 \textrm{ bits} \\ I(3) = - \log_2 0.6 = 0.737 \textrm{ bits} \\ \end{array}$$
So a higher-probability event conveys less information than a lower-probability event.
We will assume base 2 from now on.
Entropy is the probability weighted information content of a source of events.
Entropy can be though of as the average information content of the next event, where "average" is weighted according to the probability distribution of the events.
Entropy is denoted $H$ and defined as $$H = \sum_k P(k)\ I(k)$$
and is almost always written as $$H = - \sum_k P(k)\ \log P(k) .$$
Example: Suppose the event has one of four values, equally likely. Then $I(k) = - \log_2 P(k) = 2$ and $$H = \sum_k P(k)\ I(k) = \sum_k 0.25 \times 2 = 2$$
Example: Suppose $P(1) = 0.1$, $P(2) = 0.4$, and $P(3) = 0.6$. Then $$H = \sum_k P(k)\ I(k) = 0.1 \times 3.322 + 0.4 \times 1.322 + 0.6 \times 0.737 = 1.303$$
So, on average, the event produces 1.303 bits of information (sometimes more (e.g. 3.322 bits or 1.322 bits) and sometimes less (e.g. 0.737 bits))
This assumes that the events are independent (i.e. pixels occur randomly with no spatial redundancy).
This is not true, in general, so compression can usually do better than $H$ bits per pixel.
An image's histogram can be used compute the image's entropy.
Given $N$ pixels in all, and a histogram, $h$, in which $h(k)$ is the number of pixels with value $k$, the probability that the next pixel is $k$ is $$P(k) = {h(k) \over N}$$
So the image's entropy is $$H = - \sum_k P(k) \log {h(k) \over N}$$
The entropy, $H$, is a lower bound on the average number of bits to represent a pixel, but only when the pixels are independent.
Which has higher entropy: a low-contrast image or a high-contrast image? Consider the image histograms.