CISC 472 - Optimization

Optimization is the process of finding a set of parameters, $x$, to minimize an ``objective function'', $F(x)$. (If the objective function is actually maximized at the optimal $x$, we can minimize $- F(x)$.)

For image registration, the parameters are those of the transformation needed to transform one image onto another. The object function is the similarity measure.

For 2D/2D image registration, the parameters would be $(x,y,\theta,s)$ for $(x,y)$ translation, $\theta$ rotation, and (perhaps) $s$ scaling. The objective function would be NCC or MI.

Most optimization methods are iterative, in that they start with an initial guess of the parameters, $x_0$, and iteratively refine that guess as $x_1, x_2, x_3, \ldots, x_i$ until $F(x_i) < \epsilon$ for some threshold $\epsilon$.

Gradient-Based Methods

Gradient-based methods require knowledge of the gradient, $\nabla F(x)$, of the objective function.

The gradient is the direction in the parameter space in which the objective function, $F(x)$, increases fastest. If we are trying to minimize the objective function, we move from $x$ in the direction of the negative gradient.

Gradient Descent

This method computes $i_{i+1}$ from $x_i$ as

$x_{i+1} = x_i - \lambda_i \; \nabla F(x_i)$

The parameter $\lambda_i$ must be large enough (so as to be efficient), but not too large (so as not to skip over the global minimum in $F$).

As the divergence, $\nabla^2 F(x)$, of the opimization function increases, $\lambda$ should decrease so that $x$ does not stray off of the curve of steepest descent. The Barzilai–Borwein method does this by setting $\lambda_i$ to be the projection of the previous step direction, $x_i - x_{i-1}$, onto an approximation of $\nabla^2 F$:

$\lambda_i = { (x_i - x_{i-1})^T \; (\nabla F(x_i) - \nabla F(x_{i-1})) \over (\nabla F(x_i) - \nabla F(x_{i-1}))^T \; (\nabla F(x_i) - \nabla F(x_{i-1})) }$

Many other gradient-based methods exist, including the Levenberg-Marquardt algorithm, which uses the Jacobian of $F$ to compute each step's parameter changes.

Analytic evaluation of the gradient can often be impossible. In such cases, gradient-based methods may approximate the gradient use a first-order finite difference. But this can be expensive.

Non-gradient Methods

Non-gradient methods avoid evaluating the gradient.

Powell's Method

This method is quite popular in image registration:

This method starts with an initial guess, $x_0$, and a set of (typically unit-length and axis-aligned) vectors, $\{ s_1, s_2, \ldots, s_n \}$ in the parameter space.
At each step, the method does a one-dimensional search along each vector's direction to find the position, $x_0 + \alpha_i s_i$, at which $F( x_0 + \alpha_i s_i )$ is minimized.
So $\{ \alpha_1, \alpha_2, \ldots, \alpha_n \}$ are the distances along each vector to travel, and the next estimate is $x_{i+1} = x_i + \sum_j \alpha_j s_j$.
Finally, the vector $s_j$ with largest $\alpha_j$ is removed. It is replaced with a unit vector in the direction $\sum_j \alpha_j s_j$.
The process is repeated until convergence or exhaustion.

The one-dimensional search in each direction is essentially a root-finding problem, which can be implemented efficiently with Brent's Method, for example.

Downhill Simplex

Downhill Simplex is not the "simplex method", which is used to minimize a linear cost function subject to linear constraints.

A simplex is set of $d+1$ points in a $d$-dimensional space and forms the simplest object in that space. For example: In 1D, a simplex (2 points) is a line segment; in 2D, a simplex (3 points) is a triangle; and in 3D, a simplex (4 points) is a tetrahedron.

Downhill Simplex works by incrementally moving a simplex across the parameter space, much as an amoeba would move over a surface. With each move, the simplex approaches closer a local minimum.

The simplex has points $\{ x_0, x_1, \ldots x_d \}$. With each step, the simplex is "moved" by modifying its points.

There are several rule that determine how the points are modified. Once such is the reflection rule, which takes the $x_i$ at which $F(x_i)$ is worst (i.e. maximal) and reflects $x_i$ through the centroid of all the points:

$x_i' = x_i + \alpha (x_i - {1 \over d+1} \sum_j x_j)$

$\alpha$ is typically chosen to be one.

Many other rules exist with the full algorithm.