In preparation for the next lecture, which will discuss tool tracking with stereo camera, we'll look at how to establish correspondences between two cameras.
That is, given a point in the image of the first camera, where is that point in the image of the second camera?
The focal point of a pinhole camera is the origin of the camera's coordinate system and is denoted $c$. The camera's coordinate system has three basis vectors of which $u$ points right, $v$ points up, and $n$ points forward out of the camera. The $u$, $v$, and $n$ are unit vectors.
The camera's pose is denoted by $$\left[ R | t \right] = \left[ \begin{array}{cccc} u_x & u_y & u_z & -c \cdot u \\ v_x & v_y & v_z & -c \cdot v \\ n_x & n_y & n_z & -c \cdot n \\ 0 & 0 & 0 & 1 \end{array} \right] $$
where $R$ is the rotation matrix with $u$, $v$, and $n$ in its rows (i.e. the upper-left $3 \times 4$ matrix above) and $t = -R\ c$. Note that $\mathbf{t \neq c}$, so this notation does not tell you where the camera centre is, except indirectly.
This matrix transforms a 4D (homogeneous) point from the world coordinate system into the camera coordinate system. For more details, see The Viewing Transform, V section of these CISC 454 notes (where our $n$ is $-z$ in those notes).
The camera's projection transform takes a 3D point in the camera coordinate system and projects it onto the camera's image plane. This transformation looks like $$K = \left[ \begin{array}{cccc} f & 0 & p_x & 0 \\ 0 & f & p_y & 0 \\ 0 & 0 & 1 & 0 \end{array} \right]$$
where $f$ is the camera's focal distance (measured in pixel units) and $(p_x, p_y)$ is offset position on the focal plane through which the vector $n$ passes (called the "principal point").
$K$ is called the camera calibration matrix. For more details, see Projection Transformations in these CISC 454 notes.
Applying $K$ to a 4D homogeneous point in the camera's coordinate system results in a 3D homogeneous point measured in pixel units on the camera's image plane. (Since these are homogeneous coordinates, we finally divide by the last coordinate to get a 2D point on the camera's image plane.)
Then $$x_\mathrm{image} = K \ [R|t]\ x_\mathrm{world}$$
maps a world point onto the corresponding image point.
The figure below shows two camera located at $c_1$ and $c_2$.
Each camera has its own world-to-camera transformation, so $$q_1 = [R_1|t_1]\ p \qquad \mathrm{and} \qquad q_2 = [R_2|t_2]\ p.$$
We will ignore, for now, the camera calibration matrices, $K_1$ and $K_2$.
To make things simpler, we will assume that camera 1 is located at the world origin with its axes aligned with those of the world coordinate system. Then we can say $$q_1 = [I|0]\ p \qquad \mathrm{and} \qquad q_2 = [R|t]\ p.$$
We need to find the correspondence between $q_1$ and $q_2$, which are both projections of the same point, $p$.
The line joining the camera centres (called the baseline) intersects the image plane at points $e_1$ and $e_2$ (called the epipoles), as shown below. Note that the points $p$, $c_1$, and $c_2$ define a plane in 3D.
From camera 1's viewpoint, $p$ projects to $q_1$, as does any point on the line through $c_1$ and $p$. So camera 1 does not know where $p$ is on that line.
However, as $p$ moves along the line through $c_1$ and $q_1$, its projection, $q_2$, onto camera 2's image plane moves along the line through $e_2$ and $q_2$.
Thus, to find the point in camera 2's image plane that corresponds to $\mathbf{q_1}$, we can restrict our search along the line through $\mathbf{e_2}$ and $\mathbf{q_2}$.
The lines through $e_1$ and $q_1$, and through $e_2$ and $q_2$, are called the epipolar lines.
If we are to search along the epipolar line through $e_2$ and $q_2$, we need first to define that line mathematically.
In the figure below, $n$ is perpendicular to the plane through $p$, $c_1$, and $c_2$. $x$ is a point on the epipolar line of $q_1$ in camera 2's image. $x$ can also be thought of as a vector from $c_2$ to that point.
If $x$ is written in camera 2's coordinate system, then the corresponding vector in the world coordinate system is $R^T x$.
To find $n$, just use the cross product of two vectors in the plane (again in the world coordinate system): $$n = c_2 \times q_1$$
Since $R^T x$ is a vector in the plane and $n$ is perpendicular to the plane, their dot product is zero: $$\begin{array}{rcll} (R^T x) \cdot (c_2 \times q_1) &=& 0 \\ (R\ (R^T x)) \cdot (R\ (c_2 \times q_1)) &=& 0 \\ x \cdot (R\ (\widetilde{c_2} q_1)) &=& 0 & \textrm{(see below)} \\ x^T\ \underbrace{(R\ \widetilde{c_2})}_\textrm{essential matrix}\ q_1 &=& 0 \\ \end{array}$$
Where $\widetilde{c_\textrm{ }}$ as a skew-symmetric matrix $$\widetilde{c_\textrm{ }} = \left[\begin{array}{ccc} 0 & -c_z & c_y \\ c_z & 0 & -c_x \\ -c_y & c_x & 0 \end{array}\right]$$
which has the property that $\widetilde{c_\textrm{ }}\ x = c \times x$.
Note that $x$ and $q_1$ are in the coordinate systems of their respective cameras. To move them into pixel coordinates, apply the $K_1$ and $K_2$ camera calibration matrices: $$x_\mathrm{px} = K_2\ x \qquad \textrm{and} \qquad q_{1,\mathrm{px}} = K_1 q_1.$$
Then $$\begin{array}{rcl} x^T\ (R\ \widetilde{c_2})\ q_1 &=& 0 \\ (K_2^{-1}\ x_\mathrm{px})^T\ (R\ \widetilde{c_2})\ (K_1^{-1}\ q_{1,\mathrm{px}}) &=& 0 \\ x_\mathrm{px}^T\ \underbrace{(K_2^{-T} R\ \widetilde{c_2}\ K_1^{-1})}_\textrm{fundamental matrix}\ q_{1,\mathrm{px}} &=& 0 \\ x_\mathrm{px}^T\ F\ q_{1,\mathrm{px}} &=& 0 \\ \end{array}$$
Given two calibrated camera, we can compute the fundamental matrix, $F$.
Then a point $q_\mathrm{px}$ in camera 1's image will correspond to any point $x_\mathrm{px}$ in camera 2's image that satisfies $$x_\mathrm{px}^T\ F\ q_\mathrm{px} = 0$$
We'll see in the next lecture how to use this fact to search for correspondences between images.