The Model-View-Projection Transform

We start with a 3D point v_ocs defined in the Object Coordinate System. This point gets transformed to move it into the coordinate system of the window:

v_ocs

| Modeling Transform (into World Coordinate System)

v_wcs

| Viewing Transform (into Viewing Coordinate System)

v_vcs

| Projection Transform (into Clipping Coordinate System)

v_ccs

| Perspective Division (into Normalized Device Coordinate System)

v_ndcs

| Viewport Transform (into Device Coordinate System)

v_dcs

At the end of these transformations, the point v_dcs is a 2D pixel location in the window.

The Modeling Transform

The object O on which the point v_ocs lives is, by default, positioned at the origin of the World Coordinate System (WCS). To position O somewhere else in the world, we apply a transformation to it. Typically, this consists of scaling, then rotating, then translating:
O_wcs = (T R S) O
where T, R, and S are the transformations. Note that scaling (S) is applied first, then rotation (R), and finally translation (T).
Any point v_ocs on O undergoes the same transformations:
v_wcs = (T R S) v_ocs

The Viewing Transform

Suppose the camera is at location p in the world, and its attached orthogonal coordinate system, < u,v,n >, has u pointing right, v pointing up, and n pointing backward along the direction of view.
To convert v_wcs into the Viewing Coordinate System (VCS) we first find the vector from the camera origin to the point:
v' = v_wcs - p

In other words:

[ 1 0 0 -p_x ]

v' = [ 0 1 0 -p_y ]   v_wcs

[ 0 0 1 -p_z ]

[ 0 0 0 1 ]

Second, we convert v' (which is a vector in the WCS) into the VCS. To do so, we project v' onto each of the three axes of the VCS:

[ u_x u_y u_z 0 ]

v_vcs = [ v_x v_y v_z 0 ]   v'

[ n_x n_y n_z 0 ]

[ 0 0 0 1 ]

Combining the two transforms above, we get the Viewing Transform:

[ u_x u_y u_z -p.u ]

v_vcs = [ v_x v_y v_z -p.v ]   v_wcs

[ n_x n_y n_z -p.n ]

[ 0 0 0 1 ]

where -p.u denotes the dot product of -p and u.

The Projection Transform and Perspective Division

The view volume, shown below, is the volume of space from which objects get rendered. Anything outside this volume is clipped. The view volume has left, right, top and bottom clipping planes which correspond to the edges of the window. The view volume also has "near" and "far" clipping planes, which limit the range of depths that visible objects can have.

In the diagram above, the near and far clipping planes are at distances n and f, respectively, but their locations are z = -n and z = -f since they're on the negative z axis. The l, r, t, and b are the locations of the left, right, top, and bottom clipping planes where they intersect the near clipping plane. For example, the top clipping plane intersects the near clipping plane at y = t.
The Projection Transform takes a point v_vcs in the view volume, and transforms it to a point v_ndcs in the canonical view volume, shown below. (The strange labeling of the axes will be explained shortly.)

The canonical view volume ranges from -1 to +1 in each of the axes, and the coordinate system (called the Normalized Device Coordinate System, or NDCS) is "left handed": See that the z'/w' axis points in the opposite-to-usual direction.
Important Note: The NDCS is in 3D, but we're working with homogeneous points in 4D. So, transforming a point from the VCS to the NDCS involves two steps:

Transform the point v_vcs = [ x y z 1 ]^T to a homogeneous point v_ccs = [ x' y' z' w' ]^T.
Transform v_ccs to a 3D point v_ndcs = [ x'/w' y'/w' z'/w' ]^T by dividing the first three coordinates by the last coordinate. This is why the axes in the diagram above are labeled x'/w', y'/w', and z'/w'.

The Projection Transform does the first step. It transforms a point v_vcs into a point v_ccs. The point v_ccs is said to be in the Clipping Coordinate System (CCS). The Projection Transform looks like this (we'll fill in the details in a later section):

[ E 0 A 0 ]

v_ccs = [ 0 F B 0 ] v_vcs

[ 0 0 C D ]

[ 0 0 -1 0 ]

The next step, the Perspective Division, transforms the point v_ccs into v_ndcs:

[ v_x,ndcs ] [ v_x,ccs / v_w,ccs ]

[ v_y,ndcs ] = [ v_y,ccs / v_w,ccs ]

[ v_z,ndcs ] [ v_z,ccs / v_w,ccs ]

The Viewport Transform

As a final step, we project the points in the canonical view volume onto the viewport (also called the window). The viewport is the area on the screen in which you're drawing. Its coordinates are pixel locations in the Device Coordinate System (DCS):

In the VCS, a line-of-sight from the camera origin corresponds, in the NDCS, to a line parallel to the z axis: All points on that line project to the same (x,y) location on the image plane.
That means that we need to map (x,y) in the NDCS to (x',y') in the DCS. The z coordinate in the NDCS corresponds to depth, and is only used when the depth buffer ("z-buffer") is enabled. For convenience, we map the z coordinate to the range [0,1].
The Viewport Transforms does this:

x' = 0.5 * (x+1) * (R - L) + L

y' = 0.5 * (y+1) * (T - B) + B

z' = 0.5 * (z+1)

For example, the x values in NDCS are in the range [-1,+1]. The Viewport Transform converts them (linearly) into x' values in DCS, which are in the range [L,R].
In OpenGL, this can be set up as follows:
glViewport( x, y, width, height );
where (x,y) is the location of the lower-left corner of the viewport, and width and height are its dimensions. Note that (x,y) is relative to the origin of the OpenGL window and that everything is measured in pixels. You only use glViewport if you want to restrict drawing to a rectangular area inside your window. By default, the viewport is the entire window.

Details of the Projection Transform

In this section, we'll derive the Projection Transform, which was given above as a matrix with many unknown elements:

[ E 0 A 0 ]

v_ccs = [ 0 F B 0 ] v_vcs

[ 0 0 C D ]

[ 0 0 -1 0 ]

Let v_ccs = [ x' y' z' w' ]^T and let v_vcs = [ x y z 1 ]^T.
First, consider how y' is calculated:

From the transformation matrix above, y' = F y + B z. Consider a 2D slice at x=0 through the view volume (shown on the left below) and the corresponding 2D slice at x'/w'=0 through the canonical view volume (shown on the right below):

Points on the top line of the view volume satisfy y/-z = t/n. That is, there points are of the form (-tz/n , z). These points correspond to points on the top line of the canonical view volume that satisfy y'/w' = 1.
Substitute a point (-tz/n , z) into the y' and w' lines of the transformation matrix:

y' = F y + B z

= F (-tz/n) + B z

w' = - z

As stated above, the VCS point (-tz/n , z) is transformed to a CCS point for which y'/w' = 1. So

y'/w' = (F (-tz/n) + B z) / (-z)

= t/n F - B

= 1

So we know that
t/n F - B = 1.

By a similar argument (using VCS points ( -b/nz, z ) on the bottom line which map to CCS points for which y'/w' = -1) we can show that
b/n F - B = -1.

We can then solve the two equations in F and B to get:

F = 2n / (t-b)

B = (t+b) / (t-b)

The mapping for x' is analagous:

Given l and r which are the left and right limits in the x direction of the view volume, we apply the same method as for y' and get

E = 2n / (r-l)

A = (r+l) / (r-l)

The mapping for z' is a bit different:

VCS points on the near plane (z = -n) map to CCS points on the line z'/w' = -1, and VCS points on the far plane (z = -f) map to CCS points on the line z'/w' = +1.
From the transformation matrix:

z' = C z + D

w' = - z

Substitute for z = -n and for z = -f to get two equations in the unknowns C and D:

z'/w' = C (-n) + D) / (n)

= -C + D/n

= -1

z'/w' = C (-f) + D) / (f)

= -C + D/f

= +1

Solving the equations yields:

C = -(f+n) / (f-n)

D = -2fn / (f-n)

All of the above work gives us the VCS-to-CCS transform matrix:

[ 2n / (r-l) 0 (r+l) / (r-l) 0 ]

v_ccs = [ 0 2n / (t-b) (t+b) / (t-b) 0 ] v_vcs

[ 0 0 (f+n) / (n-f) 2fn / (n-f) ]

[ 0 0 -1 0 ]

In OpenGL, this matrix can be set as follows:
glMatrixMode( GL_PROJECTION );
glLoadIdentity();
glFrustum( l, r, t, b, n, f );
where l, r, t, b, n, f are as defined above. Alternatively, you can do this:
glMatrixMode( GL_PROJECTION );
glLoadIdentity();
gluPerspective( fovy, asp, n, f );
where n and f are as above, fovy is the angle of the field of view in the y direction (vertical), in degrees, and asp is the aspect ratio of the view frustum: (r-l)/(t-b).

Clipping in the CCS

Points outside the view frustum must be clipped. The clipping could be done in almost coordinate system, but it's best to do clipping in the CCS because:

The canonical view volume in the CCS is independent of camera parameters. That means that clipping in the CCS can be implemented in hardware which doesn't have to be parameterized ... so it's fast. There's no point earlier in the sequence of transformations where clipping can be done in a camera-independent manner.

After the perspective division (CCS-to-NDCS) some depth information is lost, which can result in improper clipping (as discussed below). So it doesn't make sense to clip after the CCS.

An example of improper clipping in the NDCS

In the figure below, the segment pq in the VCS is transformed to the segment p'q' in the NDCS. If we clip p'q', the segment will appear to be exiting the far plane of the canonical view volume. But it should really exit the top plane!

This occurs because the z' and w' coordinates of q' are both negative after the Projection Transform. Then, when we do the Perspective Division, the last coordinate (z'/w') is positive, and q' appears on the positive z axis.
For a better intuition, consider what happens to q' as q is moved along the z axis of the VCS:

As q approaches the origin, the segment pq approaches, then passes, the top-front corner of the VCS view volume. At the same time, the transformed q' in NDCS will move right (more distant) along the z'/w' axis.

When q touches the origin, the transformed q' in NDCS has w' = 0 (recall the matrix above, in which w' = -z), and q' lies at infinity on the z'/w' axis. That is, p'q' is parallel to the z'/w' axis.

As soon as q passes the origin, q' in NDCS has w' > 0 and z' < 0, so q' lies on the negative z'/w' axis, and moves in toward the NDCS origin as q continues inward.

After q passes the centre of the VCS view volume, z' becomes positive (while w' remains positive), and q' moves to the right of the NDCS origin.

Clipping in CCS

If the point were in NDCS, we would clip against the six planes that define the faces of the canonical view volume:

x'/w' = +1 x'/w' = -1

y'/w' = +1 y'/w' = -1

z'/w' = +1 z'/w' = -1

The corresponding planes in the CCS are:

x' - w' = 0 x' + w' = 0

y' - w' = 0 y' + w' = 0

z' - w' = 0 z' + w' = 0

Given a segment p'q' in the CCS, we determine the six outcodes (the string of six bits which indicates to which side of each of the six planes the point lies) of the segment endpoints by substituting the [ x' y' z' w' ]^T coordinates of the endpoints into the six CCS equations, and testing the signs. Then we can clip in exactly the same was as we did in 2D: plane by plane with the Sutherland-Hodgeman algorithm.

v_ocs
\|	Modeling Transform (into World Coordinate System)
v_wcs
\|	Viewing Transform (into Viewing Coordinate System)
v_vcs
\|	Projection Transform (into Clipping Coordinate System)
v_ccs
\|	Perspective Division (into Normalized Device Coordinate System)
v_ndcs
\|	Viewport Transform (into Device Coordinate System)
v_dcs

	[	u_x	u_y	u_z	-p.u	]
v_vcs =	[	v_x	v_y	v_z	-p.v	]	v_wcs
	[	n_x	n_y	n_z	-p.n	]
	[	0	0	0	1	]

[ v_x,ndcs ]		[ v_x,ccs / v_w,ccs ]
[ v_y,ndcs ]	=	[ v_y,ccs / v_w,ccs ]
[ v_z,ndcs ]		[ v_z,ccs / v_w,ccs ]

x' =	0.5 * (x+1) * (R - L) + L
y' =	0.5 * (y+1) * (T - B) + B
z' =	0.5 * (z+1)

	[	2n / (r-l)	0	(r+l) / (r-l)	0	]
v_ccs =	[	0	2n / (t-b)	(t+b) / (t-b)	0	]	v_vcs
	[	0	0	(f+n) / (n-f)	2fn / (n-f)	]
	[	0	0	-1	0	]

x'/w' = +1		x'/w' = -1
y'/w' = +1		y'/w' = -1
z'/w' = +1		z'/w' = -1

x' - w' = 0		x' + w' = 0
y' - w' = 0		y' + w' = 0
z' - w' = 0		z' + w' = 0