The Rendering Equation describes how light reflects off a surface at a point $x$ with surface normal $N$:
$L(\oo,x) = \int_\Omega f(x,\oi,\oo) \; \cos\theta_\textrm{in} \; L(\oi,x) \; d\oi$
where
Radiance is measured in Watts/m$^2$/sterradian, where a sterradian is a measure of solid (i.e. 3D) angle.
For more information, please see the supplementary notes on Introduction to Illumination.
Note that the position, $x$, and the outgoing direction, $\oo$, are constants inside the integral.
Ideally, this would be evaluated at each ray/surface intersection during raytracing. However, integrating over all of $\Omega$ by sampling many directions, $\oi \in \Omega$, is very expensive, since each sample requires a recursive call to the raytracer.
Below, we'll see how to do this more efficiently. That is, we'll see how to choose directions, $\oi$, to reduce the variance of our estimate of $L(\oo,x)$.
The central problem is to evaluate a generic integral like
$F = \int_D f(x) \; dx$
This $f$ has no relation to the BRDF above; it's just a generic function to be integrated over a domain, $D$.
Monte Carlo integration estimates $F$ by evaluating $f(x)$ at randomly chosen places, $x \in D$.
The simplest Monte Carlo estimator chooses $N$ samples taken uniformly randomly in $D$:
$\widehat{F} = |D| \; {1 \over N} \sum_i f(x_i)$
where $|D| = \int_D \; dx$ is the size of the domain.
This estimator is unbiased because its expected value is the true integral:
$\begin{array}{rcll} E(\widehat{F}) & = & E( |D| \; {1 \over N} \sum_i f(x_i) ) \\ & = & |D| \; {1 \over N} \sum_i E( f(x_i) ) & \textrm{by linearity of expectation}\\ & = & |D| \; E( f(x) ) & \textrm{since all of these expected values are equal}\\ & = & |D| \; \int_D f(x) \; p(x) \; dx & \textrm{by the definition of expectation $E()$}\\ & = & |D| \; \int_D f(x) \; {1 \over |D|} \; dx & \textrm{since we are doing uniform sampling}\\ & = & \int_D f(x) \; dx \\ \end{array}$
With Importance Sampling, the samples are chosen according to a non-uniform distribution. The non-uniform distribution give more probability to samples that contribute more to the intergral. These samples are more important, hence the name.
If we sample $x_i$ according to a probabiliity density function, $p(x)$, the estimator is:
$\widehat{F_p} = {1 \over N} \sum_i {f(x_i) \over p(x_i)}$
Note that uniform sampling has $p(x) = {1 \over |D|}$, so $\widehat{F_p}$ is a generalization of the uniformly sampled estimator, $\widehat{F}$, above.
This estimator is unbiased:
$\begin{array}{rcll} E(\widehat{F_p}) & = & E( {1 \over N} \sum_i {f(x_i) \over p(x_i)} ) \\ & = & {1 \over N} \sum_i E( {f(x_i) \over p(x_i)} ) & \textrm{by linearity of expectation}\\ & = & E( {f(x) \over p(x)} ) & \textrm{since all of these expected values are equal} \\ & = & \int_D {f(x) \over p(x)} \; p(x) \; dx & \textrm{from the definition of expectation $E()$} \\ & = & \int_D f(x) \; dx \\ \end{array}$
With importance sampling, the sampling PDF, $p(x)$, is chosen to reduce the variance of $\widehat{F_p}$ by reducing the variance of the integrand, ${f(x) \over p(x)}$.
For example, if $f(x)$ is known, we could choose $p(x) = f(x)$. Then the integrand would be one and, since this integrand is constant, the variance would be zero.
In this case, a single sample would be sufficient to estimate $\int f(x) \; dx$ exactly. However, to sample according to $p(x)$ requires knowing the cumulative density function of $p(x)$. For a one-dimensional domain:
$\textrm{CDF}_p( x ) = \int_{-\infty}^x p(x) \; dx$
and this is as hard as the original problem, $\int f(x) \; dx$.
So we can only sample according to PDFs for which we can easily build the CDF.
The integrand of the rendering equation is
$f(x,\oi,\oo) \; \cos\theta_\textrm{in} \; L(\oi,x)$
Recall that this is integrated over incoming directions, $\oi \in \Omega$, so the position, $x$, and outgoing direction, $\oo$, are constants.
Determining $L(\oi,x)$ for all $\oi$ requires many expensive raytracing steps, so it is not feasible to model $p(x)$ from $L(\oi,x)$.
But $\cos\theta_\textrm{in}$ is easy to evaluate, and $f(x,\oi,\oo)$ is sometimes easy to evaluate. For example, with purely diffuse shading,
$f(x,\oi,\oo) = {\rho \over \pi}$
for surface albedo $\rho$. This is constant because it does not depend on the incoming direction, $\oi$.
So we could choose $p(\oi) = \cos\theta_\textrm{in}$ as the sampling PDF. This is good because the variance of
$f(x,\oi,\oo) \; \cos\theta_\textrm{in} \; L(\oi,x) \over \cos\theta_\textrm{in}$
is less than the variance of the original
$f(x,\oi,\oo) \; \cos\theta_\textrm{in} \; L(\oi,x)$
in general (although it depends on $L(\oi,x)$).
Sampling according to $\cos\theta_\textrm{in}$ gives higher probability to samples that are in the direction of the surface normal. Radiance coming from these directions makes a larger contribution to the integral over $\Omega$ because of the $\cos\theta_\textrm{in}$ term in the integrand.