We start with a 3D point, $\pocs$, defined in its own Object Coordinate System. This point gets transformed to move it into the coordinate system of the window:
Object coordinate system $\pocs$ $\big\downarrow$ Modeling transform World coordinate system $\pwcs$ $\big\downarrow$ Viewing transform Viewpoint coordinate system $\pvcs$ $\big\downarrow$ Projection transform Clipping coordinate system $\pccs$ $\big\downarrow$ Perspective division Normalized device coordinate system $\pndcs$ $\big\downarrow$ Viewport transform Device coordinate system $\pdcs$
At the end of these transformations, the 2D point, $\pdcs$, is a pixel location in the window.
The object $O$ on which the point $\pocs$ lives is has its own Object Coordinate System (OCS). Point $\pocs$ is defined within that coordinate system.
To position $O$ somewhere else in the world, we apply a transformation to its points. Typically, this consists of scaling, then rotating, then translating:
$\Owcs = (T \; R \; S) \; \Oocs$where $T$, $R$, and $S$ are the transformations. Note that scaling ($S$) is applied first, then rotation ($R$), and finally translation ($T$). Scaling and rotation could be swapped.
Any point $\pocs$ on $O$ undergoes the same transformations:
$\pwcs = (T \; R \; S) \; \pocs$
Suppose the camera (or eye) is at location $e$ in the world, and its attached orthogonal coordinate system, $\langle x, y, z \rangle$, has $x$ pointing right, $y$ pointing up, and $z$ pointing backward along the direction of view.
Note that $x$, $y$, and $z$ are vectors, not scalars.
To convert $\pwcs$ into the Viewing Coordinate System (VCS), we first find the vector from the camera origin to the point:
$p' = \pwcs - e$
In other words:
$p' = \begin{bmatrix} 1 & 0 & 0 & \vdots \\ 0 & 1 & 0 & -e \\ 0 & 0 & 1 & \vdots \\ 0 & 0 & 0 & 1 \end{bmatrix} \; \pwcs$
Second, we convert $p'$ (which is a vector in the WCS) into the VCS. To do so, we project $p'$ onto each of the three axes of the VCS:
$\pvcs = \begin{bmatrix} \lh & x & \rh & 0 \\ \lh & y & \rh & 0 \\ \lh & z & \rh & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \; p'$
Combining the two transforms above, we get the Viewing Transform:
$\pvcs = \begin{bmatrix} \lh & x & \rh & -e \cdot x \\ \lh & y & \rh & -e \cdot y \\ \lh & z & \rh & -e \cdot z \\ 0 & 0 & 0 & 1 \end{bmatrix} \; \pwcs$
where $-e \cdot u$ denotes the dot product of $-e$ and $u$.
The view volume, shown below, is the volume of space from which objects get rendered. Anything outside this volume is clipped. The view volume has left, right, top and bottom clipping planes which correspond to the edges of the window. The view volume also has "near" and "far" clipping planes, which limit the range of depths that visible objects can have.
In the diagram above, the near and far clipping planes are at distances $n$ and $f$, respectively, but their equations are $z = -n$ and $z = -f$ since they're on the negative $z$ axis.
The $\ell$, $r$, $t$, and $b$ are the locations of the left, right, top, and bottom clipping planes where they intersect the near clipping plane. For example, the top clipping plane intersects the near clipping plane at $y = t$, and the right clipping plane intersects the near clipping plane at $x = r$.
The Projection Transform takes a point $\pvcs$ in the view volume and transforms it to a point $\pndcs$ in the canonical view volume, shown below. (The strange labeling of the axes will be explained shortly.)
The canonical view volume ranges from $-1$ to $+1$ in each of the axes, and the coordinate system (called the Normalized Device Coordinate System, or NDCS) is "left handed": See that the $z'\over w'$ axis points in the opposite-to-usual direction.
Important Note: The NDCS is in 3D, but we're working with homogeneous points in 4D. So, transforming a point from the VCS to the NDCS involves two steps:
The Projection Transform does the first step. It transforms a point $\pvcs$ into a point $\pccs$. The point $\pccs$ is said to be in the Clipping Coordinate System (CCS). The Projection Transform looks like this (we'll fill in the details in a later section):
$\pccs = \begin{bmatrix} E & 0 & A & 0 \\ 0 & F & B & 0 \\ 0 & 0 & C & D \\ 0 & 0 & -1 & 0 \\ \end{bmatrix} \; \pvcs$
In OpenGL, using the linalg.h, you can create a $4 \times 4$ projection matrix with
perspective( float fovy, float aspect, float n, float f )
where n and f are the distances to the near and far clipping planes, aspect is the aspect ratio of the window (i.e. $\textrm{width} \over \textrm{height}$), and fovy is the field of view in the $y$ direction (i.e. the angle between the top and bottom edges of the window, as seen from the eye, in radians).
The Model-View-Projection (MVP) transform is the $4 \times 4$ matrix that transforms a point $\pocs$ in the OCS to the corresponding point $\pccs$ in the CCS. The transform is
$\pccs = (P \; V \; M) \; \pocs$
because $M$ is applied first, then $V$, then $P$.
The $4 \times 4$ matrix, $P \: V M$ is usually passed to the vertex shader, which applies it to all vertices.
Different objects will have different $M$ matrices, so $P \: V M$ will change for each object rendered.
Different viewpoints will have different $V$ matrices, so $V$ will change if there's a change in viewpoint. But $V$ is usually constant for one rendered frame.
$P$ doesn't change unless the projection does, which usually happens only if the view is zoomed (which changes the field of view, fovy) or the window is resized (which might change its aspect).
See 00-intro/demo8.cpp in openGL.zip for an example of the MVP used in 3D rendering
The next step, the Perspective Division, transforms the point 4D $\pccs$ into the 3Dd $\pndcs$:
$\begin{bmatrix} p_{\textrm{ndcs},x} \\[5mm] p_{\textrm{ndcs},y} \\[5mm] p_{\textrm{ndcs},z} \end{bmatrix} = \begin{bmatrix} \large p_{\textrm{ccs},x} \over \large p_{\textrm{ccs},w} \\[2mm] \large p_{\textrm{ccs},y} \over \large p_{\textrm{ccs},w} \\[2mm] \large p_{\textrm{ccs},z} \over \large p_{\textrm{ccs},w} \end{bmatrix}$
As a final step, we project the points in the canonical view volume onto the viewport (also called the window). The viewport is the area on the screen in which you're drawing. Its coordinates are pixel locations in the Device Coordinate System (DCS):
In the VCS, a line-of-sight from the camera origin corresponds, in the NDCS, to a line parallel to the $z' \over w'$ axis: All points on that line project to the same $(x,y)$ location on the image plane.
That means that we need to map $({x' \over w'}, {y' \over w'})$ in the NDCS to $(x,y)$ in the DCS. The $z' \over w'$ coordinate in the NDCS corresponds to depth, and is only used when the depth buffer ("z-buffer") is enabled. For convenience, we map the $z' \over w'$ coordinate to the range $[0,1]$.
The Viewport Transforms does this:
$\begin{array}{rl} x & = 0.5 ({x' \over w'} +1) (R-L) + L \\ y & = 0.5 ({y' \over w'} +1) (T-B) + B \\ z & = 0.5 ({z' \over w'} +1) \end{array}$
For example, the ${x' \over w'}$ values in NDCS are in the range $[-1,+1]$. The Viewport Transform converts them (linearly) into $x$ values in DCS, which are in the range $[L,R]$.
In OpenGL, this can be set up as follows:
glViewport( x, y, width, height )
where (x,y) is the location of the lower-left corner of the viewport, and width and height are its dimensions. Note that (x,y) is relative to the origin of the OpenGL window and that everything is measured in pixels. You only use glViewport if you want to restrict drawing to a rectangular area inside your window. By default, the viewport is the entire window.