"MIP" is from the Latin "multum in parvo", meaning "much in little".
The problem addressed with mip-maps is that many texels might project to the same pixel, so a pixel should be coloured by the average colour of the texels that project to it.
A mip-map consists of a $2^k \times 2^k$ base texture map, plus a set of reduced texture maps of sizes
$2^{k-1} \times 2^{k-1}, \; \; 2^{k-2} \times 2^{k-2}, \; \; \ldots \; \; 4 \times 4, \; \; 2 \times 2, \; \; 1 \times 1$
Each pixel in the $2^{i-1} \times 2^{i-1}$ map is the average colour of four pixels (a $2 \times 2$ block) in the larger $2^i \times 2^i$ map.
The averaging of texels is noticable in the smaller (coarser) levels:
When looking up a texture value, the texture unit picks the mip-map level at which a texel projects to approximately the same size as a screen pixel.
Let $T_x$ and $T_y$ be vectors on the texture map that are the projections of the $x$ and $y$ pixel edges onto the texture map, measured in unit texel dimensions. That is, if $|T_x| = 3.5$, the projection of the pixel's horizontal edge onto the texture map is 3.5 texels wide, as in the image below.
Let $p = \textrm{max}( |T_x|, |T_y| )$.
Then the mip-map level to use is $\log_2 p$, since that is the number of times $p$ has to be divided by $2$ to become the size of one texel.
Let the pixel coordinates on the screen be $(x,y)$. The pixel above has coordinates $(x,y+1)$ and the pixel to the right has coordinates $(x+1,y)$.
Let $T(x,y)$ be the texture coordinates for pixel coordinates $(x,y)$. Here, $T(x,y)$ is not the texture value; it is the texture coordinates that project to pixel coordinates $(x,y)$.
Then $T_x$ is the approximately the amount that $T(x,y)$ changes as the pixel coordinate goes from $(x,y)$ to $(x+1,y)$. (This is approximate because $T_x$ may be different if measured at the top or bottom pixel edges.) In other words,
$T_x = {\large \partial T(x,y) \over \large \partial x}$
Similarly,
$T_y = {\large \partial T(x,y) \over \large \partial y}$
Recall that $T_x$ and $T_y$ are 2D vectors.
The GPU provides two functions in the fragment shader only to compute the partial derivative of a quantity with respect to the $x$ and $y$ pixel coordinates. These are dFdx( quantity ) and dFdy( quantity ).
Using those GPU functions,
$T_x$ = dFdx( texCoords )
$T_y$ = dFdy( texCoords )
So the GPU can calculate the level as
$\begin{array}{rl} \log_2 p & = \log_2( \textrm{max}( |T_x|, |T_y| ) ) \\ & = \log_2( \textrm{max}( \sqrt{ T_x \cdot T_x }, \sqrt{ T_y \cdot T_y } ) ) \\ & = \log_2( \sqrt{ \textrm{max}( \; T_x \cdot T_x, \; T_y \cdot T_y \; ) } ) \\ & = 0.5 \; \log_2( \textrm{max}( \; T_x \cdot T_x, \; T_y \cdot T_y \; ) ) \end{array}$
See getMipMapLevel in 04-textures/wavefront.frag in openGL.zip.
The texture unit can also perform bilinear interpolation within a mip-map level.
Alternatively, the texture unit can pick two adjacent mip-map levels between which the screen pixel best fits. Those would be $\lfloor \log_2 p \rfloor$ and $\lfloor \log_2 p \rfloor + 1$.
Then look up the texel using bilinear interpolation in level $\lfloor \log_2 p \rfloor$, and look up the texel using bilinear interpolation in level $\lfloor \log_2 p \rfloor+1$.
Finally linearly interpolate between the two texel values in proportion to the fractional part of $\log_2 p$. This is trilinear interpolation.
See glGenerateMipmap in 04-textures/wavefront.cpp in openGL.zip.
Below, a mipmapped texture appears on a surface. A pixel projects to an area on the surface. But the closest-fitting texel of mipmap does not have the same shape as the projection of the pixel.
This causes the colour of the pixel to be incorrect: The pixel colour will include some texel colours that shouldn't appear in the pixel.
The pixel and its neighbours will appear to be blurred in the side-to-side direction, because that's the direction from which the incorrect texel colours are added to the pixel colour.
Mipmaps cannot avoid this problem because they always use a square area of the texture, even if the pixel projects to a non-square area.
Anisotropic filtering (AF) solves this problem by evaluating the texture from many different positions within the pixel, then averaging the texture values to get the pixel colour.
Since the texture is evaluated from positions within the pixel, only texture values from within the pixel's projection will be used. Here is an example of "8x AF", where eight samples are used. The sampling pattern is just an example.
With anisotropic filtering, each of the individual texture evaluations (of which there are 8 in the example above) can, itself, use one of the texture lookup methods, like "nearest" or "bilinear" or "mipmap". A mipmap lookup will sample a larger area of the texture, but is more costly to compute.