\input fremtex
%\fullsize
\smallprint
\filename{ma305.tex}
\versiondate{1.9.03}
\def\zz#1{#1}
\Centerline{\bf MA305 Non-linear Optimisation
\footnote{This course was taught without proofs. I devised the proofs
in Chapter 4 here just to check that I was getting the theorems right. If you wish to use them for a class at this level you may need to refine the
details. In addition, I had second thoughts when writing parts of Chapter 3, and they don't perfectly match the problem sets.}{}}
\medskip
\noindent{\bf\zz{3}. Numerical approximation}
\medskip
I look at numerical methods for dealing with problems of the form
\inset{find $x$ such that $f(x)=c$ (one-dimensional),
find $x$ maximizing/minimizing $f(x)$ (one-dimensional),
find $\tbf{x}$ maximizing/minimizing $f(\tbf{x})$ (many-dimensional)}
\noindent where $f$ is a real-valued function of one or more real
variables. In each case it is supposed that we have effective methods
of calculating $f(x)$ or $f(\tbf{x})$, for given $x$/$\tbf{x}$, to any
required level of accuracy; but that the calculation may be `expensive'
(ordinarily, that just means `slow'), and is presented as the output of
an oracle with no hints as to what is really happening, so that we are
searching blindly. All the methods to be discussed will assume that
$f$ is, at the very least, continuous; some will require us to be able
to calculate derivatives of $f$ directly. Note that in practice we
shall wish to apply the methods to functions which are not defined
everywhere on $\Bbb R$ or $\BbbR^n$. The recipes below will mostly
ignore the problems which arise if a method demands the calculation of
$f$ at a point outside its domain, but you must remember that any
computer program you write for general use must include escape options
to deal with this possibility.
Since the problem
\Centerline{find $x$ such that $f(x)=c$}
\noindent can be changed at once to
\Centerline{find $x$ such that $f(x)-c=0$,}
\noindent and calculating $f(x)-c$ will be
virtually as fast as calculating $f(x)$, the discussion
below will consider only the case $c=0$. In the
same way, since the problems
\Centerline{find $x$ maximizing $f(x)$,
\quad find $\tbf{x}$ maximizing $f(\tbf{x})$}
\noindent are identical to the problems
\Centerline{find $x$ minimizing $-f(x)$,
\quad find $\tbf{x}$ minimizing $-f(\tbf{x})$,}
\noindent I will look only at minimization problems. I should also
note here that the methods I discuss will, at best, find {\it local}
minima; finding {\it global} minima is very much harder, and in my view
no blind-search method can be reliable unless it is supported by a real
understanding of the nature of the function $f$.
\bigskip
{\bf\zz{3A}. Binary search}
\medskip
\inset{\noindent To solve the equation $f(x)=0$ when we have found
initial points $a$, $b$ such that $a**0$.
In other cases, you will be satisfied when you have a pair $x_n$, $y_n$
such that $00$.
{\it When does the method work\/}? Provided $f$ is continuous and
defined on the whole initial interval $[a,b]$, the algorithm will halt
in finitely many steps for either stopping criterion.
\medskip
{\bf (b)}{\it Advantages of the method} (i) It works for all continuous
functions $f$, provided we can find initial values $a$ and $b$ for which
$f(a)f(b)<0$ and $f$ is defined on the whole interval $[a,b]$.
\quad (ii) The values $f(x)$ need be calculated only to sufficient
accuracy to determine whether they are positive or negative. (Remember
that in many cases we shall be calculating $f(x)$ by a numerical method
of the kind described in this chapter. So it may be very much cheaper
to calculate $f(x)$ to two decimal places rather than four.)
\quad(iii) When the algorithm halts (other than in the exceptional case
in which we have found an $x$ for which we cannot distinguish $f(x)$
from zero), we shall have an explicit interval-of-uncertainty
$[x_n,y_n]$ with a guarantee that there is at least one solution in that
interval.
\quad(iv) If our stopping criterion is that the interval-of-uncertainty
$[x_n,y_n]$ should be of length at most $\delta$, then we can say at the
start how many times we shall have to calculate $f$, since
$y_n-x_n=2^{-n}(b-a)$ for every $n$, so that we shall need exactly
$\lceil\log_2\Bover{b-a}{\delta}\rceil$ further calculations after
finding $f(a)$ and $f(b)$.
\medskip
{\it Disadvantage of the method} The `geometric' reduction, by which the
interval-of-uncertainty $y_n-x_n$ halves each time we calculate a new
value of $f$, is slow by the standards of numerical methods in general.
\bigskip
{\bf\zz{3B} False position (one-dimensional case)}
\medskip
\inset{\noindent To solve the equation $f(x)=0$ starting from distinct
initial points $a$, $b$.}
\medskip
{\bf (a)}{\it The algorithm\/}: Set $x_0=a$, $x_1=b$. Given
$x_n\ne x_{n+1}$, set
\Centerline{$x_{n+2}
=\Bover{x_nf(x_{n+1})-x_{n+1}f(x_n)}{f(x_{n+1})-f(x_n)}$;}
\noindent continue.
{\it The picture\/}: imagine a straight line drawn through the points
$(x_n,f(x_n))$ and $(x_{n+1},f(x_{n+1}))$ on the graph of $f$; this
crosses the horizontal axis at $x_{n+2}$.
{\it What if $f(x_{n+1})=f(x_n)$\/}? There is a theoretical
possibility that we may come to points $x_n$, $x_{n+1}$ such that
$f(x_n)$ and $f(x_{n+1})$ are either equal, or cannot be distinguished
from each other by the method we have for calculating $f$. In this
case, the algorithm must be abandoned.
{\it When should you stop\/}? Normally you use this method only when
you expect to be satisfied when you have found an $x$ such that
$|f(x)|\le\epsilon$ for some pre-determined $\epsilon>0$.
There is however a serious danger that the method will break down
altogether, with the $x_n$ moving apart rather than together as the
algorithm proceeds. As a general rule, therefore, you should be
willing to abandon the method whenever $|f(x_{n+2})|\ge|f(x_{n+1})|$.
\medskip
{\bf (b)}{\it Advantages of the method} (i) When $f$ is differentiable,
and there is a solution $x$ with $f(x)=0\ne f'(x)$, we can hope that the
values $f(x_n)$ will converge rapidly to zero; typically, when the
method is successful, we find that, for large $n$, $|f(x_n)|$ is of the
order of $\alpha\exp(-\beta n^{(1+\sqrt 5)/2})$ for some $\alpha$,
$\beta>0$. In contrast, for similar functions, binary search will make
$|f(x_n)|$ about $\alpha\exp(-n\ln 2)$.
\quad(ii) Although the method cannot be expected to work for
non-differentiable functions $f$, it does not rely on our being able to
calculate the derivative of $f$.
\medskip
{\it Disadvantages of the method} (i) It is rarely effective unless
there is a solution $x$ for which $f$ is differentiable at $x$.
\quad(ii) Even if $f$ is very well behaved, if we start from the `wrong'
points $x_0$ and $x_1$ the method can fail.
\bigskip
{\bf\zz{3C} Newton's method (one-dimensional case)}
\medskip
\inset{\noindent To solve the equation $f(x)=0$ starting from an initial
point $a$.}
\medskip
{\bf (a)}{\it The algorithm\/}: Set $x_0=a$. Given
$x_n$, set
\Centerline{$x_{n+1}=x_n-\Bover{f(x_n)}{f'(x_n)}$;}
\noindent continue.
{\it The picture\/}: imagine a tangent to the graph of $f$ drawn
through the point $(x_n,f(x_n))$; this crosses the horizontal axis at
$x_{n+1}$.
{\it What if $f'(x_n)=0$\/}? There is a theoretical possibility that
we may come to a point $x_n$ such that $f'(x_n)$ is either zero, or
cannot be distinguished from zero by the method we have for calculating
$f'$. In this case, the algorithm must be abandoned.
{\it When should you stop\/}? Normally you use this method only when
you expect to be satisfied when you have found an $x$ such that
$|f(x)|\le\epsilon$ for some pre-determined $\epsilon>0$.
As with the method of false position, the method can break down
altogether, with the $x_n$ moving apart rather than together as the
algorithm proceeds. You should therefore be willing to abandon the
method if you ever find that $|f(x_{n+1})|\ge|f(x_n)|$.
\medskip
{\bf (b)}{\it Advantage of the method} When $f$ is differentiable, and
there is a solution $x$ with $f(x)=0\ne f'(x)$, we can hope that the
values $f(x_n)$ will converge rapidly to zero.
\medskip
{\it Disadvantages of the method} (i) It requires calculation of $f'(x)$
as well as $f(x)$.
\quad(ii) Even if $f$ is very well behaved, if we start from the `wrong'
point $a$ the method can fail.
\quad(iii) Each step of the algorithm requires two calculations, since
we need to know both $f(x_n)$ and $f'(x_n)$. Accordingly, even though
the number of {\it iterations} is generally fewer (for a given
$\epsilon>0$) than it is for the method of false position, the number of
{\it calculations} is generally higher. For this reason, the method
can be recommended only when it is actually quicker to calculate $f'$
than to calculate $f$. (This may be the case if $f(x)$ is defined as
an integral $\int_0^xg(t)dt$, so that $f'(x)=g(x)$ but we need a
numerical integration to find $f(x)$.)
\bigskip
{\bf\zz{3D} Minimization in one dimension} If we have a differentiable
real function $f$, with an effective method of calculating its
derivative, then the best route to a (local) minimum of $f$ is likely to
be to solve the equation $f'(x)=0$. If you seek to do this with binary
search, remember that you want $f'$ to be negative on the left and
positive on the right, so you must make sure that you start with
$f'(a)<0$ and $f'(b)>0$; this will necessarily (provided $f'$ is
defined and continuous in $[a,b]$) lead to a local minimum of $f$.
If you are using the method of false position or (when $f$ is twice
differentiable) Newton's method, again remember to check that
$f'(x_{n+1})-f'(x_n)$ always has the same sign as $x_{n+1}-x_n$, since
otherwise you may be chasing a maximum rather than a minimum.
I give two further methods which do not depend on calculation of $f'$.
\bigskip
{\bf\zz{3E} Three-point method}
\inset{\noindent To find a local minimum of $f$, starting from points
$a$, $b$, $c$ such that $a****f(y_n)$ set $x_{n+1}=x$, $y_{n+1}=y_n$, $z_{n+1}=z_n$;}
-- if $x=\bover12(y_n+z_n)$ then
\inset{-- if $f(x)f(y_n)$ set $x_{n+1}=x_n$, $y_{n+1}=y_n$, $z_{n+1}=x$.}}
{\it What if $f(x)=f(y_n)$\/}? There is a theoretical possibility that
we may come to points $x$, $y_n$ such that $f(x)$ and $f(y_n)$ are
either equal, or cannot be distinguished from each other by the method
we have for calculating $f$. In this case, it may be worth looking at
the three points $x$, $\bover12(x+y_n)$ and $y_n$; but if you cannot
quickly find a convincing triplet in which the middle term gives the
least value of $f$, the algorithm becomes unsafe.
{\it When should you stop\/}? Normally the stopping criterion will be
when you have a triplet $(x_n,y_n,z_n)$ with $z_n-x_n\le\delta$ for some
pre-determined $\delta>0$.
{\it When does the method work\/}? Provided $f$ is continuous and
defined on the whole initial interval $[a,c]$, the algorithm will halt
in finitely many steps, and there will then be a local minimum in the
interval $[x_n,z_n]$.
\medskip
{\bf (b)}{\it Advantages of the method} (i) It works for all continuous
functions $f$, provided we can find initial values $a$, $b$ and $c$ for
which $f(b)<\min(f(a),f(c))$ and $f$ is defined on the whole interval
$[a,c]$.
\quad (ii) The values $f(x)$ need be calculated only to sufficient
accuracy to determine which of $f(x)$, $f(y_n)$ is least. If we set
our $f$-calculating routine to produce only approximate values, then we
may need to re-calculate some of the $f(y_n)$ to greater accuracy when
we find that the new $f(x)$ is at first sight indistinguishable.
\quad(iii) When the algorithm halts (other than in the exceptional case
in which we have found an $x$ for which we cannot distinguish $f(x)$
from $f(y_n)$), we shall have an explicit interval-of-uncertainty
$[x_n,z_n]$ with a guarantee that there is at least one solution in that
interval.
\quad(iv) If our stopping criterion is that the interval-of-uncertainty
$[x_n,z_n]$ should be of length at most $\delta$, then we can estimate
at the start how many times we shall have to calculate $f$, since every
step except perhaps one must reduce the length of the interval of
uncertainty by at least 25\%.
\medskip
{\it Disadvantages of the method} (i) The geometric reduction in the
interval-of-uncertainty is slow.
\quad(ii) Unless we have further information about our function $f$
(e.g., that its derivative is bounded, or that it is convex) we can
never be sure that our `best' value $f(y_n)$ is really close to the
minimum value of $f$.
\bigskip
{\bf\zz{3F} Quadratic fit}
\inset{\noindent To find a local minimum of $f$, starting from distinct
points $a$, $b$ and $c$.}
\medskip
{\bf (a)} {\it The algorithm\/}: Set $x_0=a$, $x_1=b$ and $x_2=c$.
Given that $x_n$, $x_{n+1}$ and $x_{n+2}$ are distinct, set
\Centerline{$x_{n+3}
=\Bover{(x_{n+1}^2-x_{n+2}^2)f(x_n)+(x_{n+2}^2-x_n^2)f(x_{n+1})
+(x_n^2-x_{n+1}^2)f(x_{n+2})}
{2((x_{n+1}-x_{n+2})f(x_n)+(x_{n+2}-x_n)f(x_{n+1})
+(x_n-x_{n+1})f(x_{n+2}))}$;}
\noindent continue.
{\it The picture\/}: imagine a parabola drawn through the points
$(x_n,f(x_n))$, $(x_{n+1},f(x_{n+1}))$ and $(x_{n+2},f(x_{n+2}))$ on the
graph of $f$; this will be horizontal at $x_{n+3}$. (Its equation is
\Centerline{$g(x)
=\Bover{(x-x_{n+1})(x-x_{n+2})}{(x_n-x_{n+1})(x_n-x_{n+2})}f(x_n)
+\Bover{(x-x_n)(x-x_{n+2})}{(x_{n+1}-x_n)(x_{n+1}-x_{n+2})}f(x_{n+1})
+\Bover{(x-x_{n+1})(x-x_{n+2})}
{(x_{n+2}-x_n)(x_{n+2}-x_{n+1})}f(x_{n+2})$.)}
{\it What if
$(x_{n+1}-x_{n+2})f(x_n)+(x_{n+2}-x_n)f(x_{n+1})+(x_n-x_{n+1})f(x_{n+2})
=0$\/}? There is a theoretical possibility that we may come to points
$x_n$, $x_{n+1}$ and $x_{n+2}$ such that the three points
$(x_n,f(x_n))$, $(x_{n+1},f(x_{n+1}))$ and $(x_{n+2},f(x_{n+2}))$ are
either not all distinct or are collinear, so that we cannot draw a
unique parabola through them.
Equally catastrophic, but not so obvious and not requiring such a
coincidence, is the possibility that the parabola is the wrong way up,
so that we are looking at a {\it maximum} of the approximating function
$g$ rather than a minimum. Any computer program for this algorithm
must therefore check that the coefficient of $x^2$ in $g$ is positive,
that is, that
\Centerline{$\Bover{f(x_n)}{(x_n-x_{n+1})(x_n-x_{n+2})}
+\Bover{f(x_{n+1})}{(x_{n+1}-x_n)(x_{n+1}-x_{n+2})}
+\Bover{f(x_{n+2})}{(x_{n+2}-x_{n+1})(x_{n+2}-x_n)}>0$.}
\noindent If this is not the case the algorithm must be abandoned at
once. It would also be sensible to abort the process if you ever find
that $f(x_{n+3})$ is not less than $f(x_{n+2})$.
{\it When should you stop\/}? There is no built-in criterion for
stopping. If, however, you are interested in finding an interval of
length at most $\delta$ with a local minimum in it, you can continue the
method until $|x_{n+3}-x_n|\le\bover12\delta$, and then look at the
values $f(x_{n+3})$, $f(x_n)$ and $f(z)$ where $z=x_{n+3}\pm\bover{\delta}2$
is on the other side of $x_{n+3}$ from $x_n$; if $f(x_{n+3})$ is the
least of these, then you will know (provided, as usual, that $f$ is
continuous and defined on the whole interval between $x_n$ and $z$) that
you have a local minimum in this interval.
\medskip
{\bf (b)}{\it Advantages of the method} (i) When $f$ is twice
differentiable, and there is a solution $x$ with $f'(x)=0\ne f''(x)$, we
can hope that the $x_n$ will converge rapidly to a stationary point.
\quad(ii) Although the method cannot be expected to work for
non-differentiable functions $f$, it does not rely on our being able to
calculate the derivative of $f$.
\medskip
{\it Disadvantages of the method} (i) It is rarely effective unless
there is a solution $x$ for which $f$ is twice differentiable at $x$.
\quad(ii) Even if $f$ is very well behaved, if we start from the `wrong'
points $x_0$, $x_1$ and $x_2$ the method can fail.
\bigskip
{\bf\zz{3G} Remarks} Observe here that `binary search' and the
`three-point method' go together as slow-but-sure, while `false
position', `Newton's method' and `quadratic fit' are
fast-but-unsafe. If, therefore, you are interested in writing programs
which will be effective with totally unknown functions, you should be
willing to alternate between methods; e.g., using a `slow' method until
you seem to be within sight of an answer, then a `fast' method until it
shows signs of breaking down, and back to the `slow' method for a couple
of steps if necessary, or to confirm that the `fast' method has indeed
located a solution, as suggested in \zz{3F}.
\bigskip
{\bf\zz{3H} Many dimensions} When we come to optimization in more than
one dimension, we have difficulties of new kinds, and must expect to
move very much more slowly. I describe two methods, neither of them
either fast or safe.
\medskip
\noindent{\bf Line search: steepest descent}
\inset{\noindent To minimize a differentiable real function $f$ of many
real variables, starting from an initial point $\tbf{a}$.}
\medskip
{\it The algorithm\/}: Set $\tbf{x}^{(0)}=\tbf{a}$.
Given $\tbf{x}^{(n)}$, set $\tbf{u}^{(n)}=\nabla f(\tbf{x}^{(n)}$,
$g_n(t)=f(\tbf{x}^{(n)}-t\tbf{u}^{(n)})$ for $t\in\Bbb R$; seek a
minimum $t_n$ for the real function $g_n$ by any of the one-dimensional
methods \zz{3D}-\zz{3F}, or otherwise; set
$\tbf{x}^{(n+1)}=\tbf{x}^{(n)}-t_n\tbf{u}^{(n)}$.
{\it The idea\/}: At least for $\tbf{x}\bumpeq\tbf{0}$, we expect
$f(\tbf{x}^{(n)}+\tbf{x})$ to be approximately
$f(\tbf{x}^{(n)})+\tbf{u}^{(n)T}\tbf{x}$. So we can reduce $f$ fastest
by moving in the direction $-\tbf{u}^{(n)}$. Since one-dimensional
searching is enormously faster than anything we can do in more than one
dimension, it is worth investing a bit of time in estimating how far to go. But note that it is not worth while to locate $t_n$
particularly accurately, at least to begin with, since we expect to have
to repeat the process.
{\it When should you stop\/}? The natural criterion to impose is that
you terminate when $\|\nabla f(\tbf{x}^{(n)})\|$ seems to be
small enough. But unless you know a lot about the function $f$ there
is no way to be sure that you are close even to a local minimum.
{\it Remarks} (i) Note that since we can calculate $\nabla f$, we shall
also be able to calculate
\Centerline{$g_n'(t)
=-\nabla f(\tbf{x}^{(n)}-t\tbf{u}^{(n)})^T\tbf{u}^{(n)}$}
\noindent for any $t$, and when minimising $g_n$ we have the option of
seeking a solution of $g'_n=0$, as in \zz{3D}.
\quad(ii) It is helpful also to note that as
$g'_n(0)=-\|\tbf{u}^{(n)}\|^2<0$ (except in the case
$\tbf{u}^{(n)}=\tbf{0}$, in which case we expect to stop), we know that
we are looking for a minimum $t_n>0$; so if we want to use binary
search, we will try to start with a pair $(0,\hat t)$ where $\hat t>0$
and $g'_n(\hat t)>0$. But in many cases finding such a $\hat t$ is
already a non-trivial problem.
\bigskip
{\bf\zz{3I} Newton's method} If we can calculate the second derivative
$\tbf{H}f$, we have a further method available.
\medskip
\inset{\noindent To minimize a twice differentiable real function $f$ of
many real variables, starting from an initial point $\tbf{a}$.}
\medskip
{\it The algorithm\/}: Set $\tbf{x}^{(0)}=\tbf{a}$.
Given $\tbf{x}^{(n)}$, set
\Centerline{$\tbf{x}^{(n+1)}
=\tbf{x}^{(n)}-\tbf{H}f(\tbf{x}^{(n)})^{-1}\nabla f(\tbf{x}^{(n)})$;}
\noindent continue.
{\it The idea\/}: At least for $\tbf{x}\bumpeq\tbf{x}^{(n)}$, we expect
$\nabla f(\tbf{x})$ to be approximately
$\nabla f(\tbf{x}^{(n)})+\tbf{H}f(\tbf{x}^{(n)})(\tbf{x}-\tbf{x}^{(n)})$. So we can look for a solution of $\nabla f(\tbf{x})=\tbf{0}$ by solving
$\nabla f(\tbf{x}^{(n)})+\tbf{H}f(\tbf{x}^{(n)})(\tbf{x}-\tbf{x}^{(n)})
=\tbf{0}$, that is, $\tbf{x}
=\tbf{x}^{(n)}-\tbf{H}f(\tbf{x}^{(n)})^{-1}\nabla f(\tbf{x}^{(n)})$.
{\it When should you stop\/}? As before, the natural criterion to
impose is that you terminate when $\|\nabla f(\tbf{x}^{(n)})\|$ seems to
be small enough. Here we are supposing that we can calculate
$\tbf{H}f$, so it would be sensible to check that
$\tbf{H}f(\tbf{x}^{(n)})$ is positive definite, giving us a fair chance
that we are close to a minimum.
{\it Remark} As usual, Newton's method is unsafe in that it can easily
go astray if we start in the wrong place; steepest descent will at
least give $f(\tbf{x}^{(n+1)})\le f(\tbf{x}^{(n)})$. If our function
is indeed twice differentiable, and the second derivative $\tbf{H}f$ is
continuous, and there is a solution $\tbf{x}^*$ at which
$\nabla f(\tbf{x}^*)=0$ and $\tbf{H}(\tbf{x}^*)$ is strictly positive,
and if we start close to $\tbf{x}^*$, then Newton's method can be
expected to converge rapidly to $\tbf{x}^*$.
\vfill\eject
\noindent{\bf\zz{4}. Convex Programs}
\medskip
{\bf\zz{4A} Definitions (a)} A {\bf convex program} is a problem of the
form
\quad (P) minimise $f(\tbf{x})$
\qquad subject to $\tbf{x}\in C$
\qquad and $g_1(\tbf{x}),\ldots,g_k(\tbf{x})\le 0$
\noindent where $C\subseteq\BbbR^n$ is a non-empty convex set and $f$,
$g_1,\ldots,g_k$ are continuous convex functions defined (at least) on
$C$.
\medskip
{\bf (b)} The {\bf dual} of the convex program (P) is now
\quad (D) maximise $h(\pmb{\lambda})
=\inf_{\tbf{x}\in C}L(\tbf{x},\pmb{\lambda})$
\qquad subject to $\lambda_1,\ldots,\lambda_k\le 0$,
\noindent where
\Centerline{$L(\tbf{x},\pmb{\lambda})
=f(\tbf{x})-\lambda_1g_1(\tbf{x})-\ldots-\lambda_kg_k(\tbf{x})
=f(\tbf{x})-\pmb{\lambda}^T\pmb{g}(\tbf{x})$}
\noindent for $\tbf{x}\in C$, $\lambda\in\BbbR^k$. (Here
$\tbf{g}(\tbf{x})$ is the vector
$\Matrix{g_1(\tbf{x})\\ \ldots \\ g_k(\tbf{x})}$.)
In this expression, I allow myself to write $h(\pmb{\lambda})=-\infty$
if $\{L(\tbf{x},\pmb{\lambda}):\pmb{x}\in C\}$ is unbounded below.
\medskip
{\bf (c)} Suppose that a convex program (P) is set out as in (a) above.
\medskip
\quad{\bf (i)} The {\bf feasible set} of (P) is
$F=\{\tbf{x}:\tbf{x}\in C$, $\tbf{g}(\tbf{x})\le\tbf{0}\}$
\medskip
\quad{\bf (ii)} (P) is {\bf consistent} if $F$ is not empty, that is,
there is an $\tbf{x}\in C$ such that $\tbf{g}(\tbf{x})\le\tbf{0}$.
\medskip
\quad{\bf (iii)} (P) is {\bf superconsistent} if there is an
$\tbf{x}\in C$ such that $g_j(\tbf{x})<0$ for every $j\le k$.
\bigskip
{\bf\zz{4B} Dual optimal pairs (a)} Suppose that a convex program (P)
and its dual (D) is set out as in \zz{4A}. If $\tbf{x}\in F$ and
$\pmb{\lambda}\le\tbf{0}$, then $\lambda_j\le 0$ and $g_j(\tbf{x})\le 0$
for every $j$, so $\lambda_jg_j(\tbf{x})\ge 0$ for every $j$ and
$\pmb{\lambda}^T\pmb{g}(\tbf{x})\ge 0$. Consequently
$$\eqalignno{h(\pmb{\lambda})
&\le L(\tbf{x},\pmb{\lambda})\cr
\displaycause{because $\tbf{x}\in C$}
&=f(\tbf{x})-\pmb{\lambda}^T\tbf{g}(\tbf{x})
\le f(\tbf{x}).\cr}$$
\noindent It follows that
\Centerline{$\sup_{\pmb{\lambda}\le\tbf{0}}h(\lambda)
\le\inf_{\tbf{x}\in F}f(\tbf{x})$,}
\noindent where I allow myself to write
$\inf_{\tbf{x}\in F}f(\tbf{x})=\infty$ if $F$ is empty, and
$\inf_{\tbf{x}\in F}f(\tbf{x})=-\infty$ if $f$ is not bounded below on
$F$.
\medskip
{\bf (b)} The {\bf duality gap} of the pair (P), (D) is the interval
\Centerline{$\ooint{\sup_{\pmb{\lambda}\le\tbf{0}}h(\pmb{\lambda}),
\inf_{\tbf{x}\in F}f(\tbf{x})}$.}
\medskip
{\bf (c)} A {\bf dual optimal pair} for (P) and (D) is a pair
$(\tbf{x}^*,\pmb{\lambda}^*)$ such that $\tbf{x}^*\in F$,
$\pmb{\lambda}^*\le\tbf{0}$ and $h(\pmb{\lambda}^*=f(\tbf{x}^*)$. Note
that in this case we must have
$(\pmb{\lambda}^*)^T\tbf{g}(\tbf{x}^*)=0$, so that
$\lambda^*_jg_j(\tbf{x}^*)=0$ for every $j$. At the same time,
\Centerline{$h(\pmb{\lambda})\le f(\tbf{x}^*)
=h(\pmb{\lambda}^*)\le f(\tbf{x})$}
\noindent whenever $\tbf{x}\in F$ and $\pmb{\lambda}\le\tbf{0}$, so that
$\tbf{x}^*$ is an optimum for (P) and $\pmb{\lambda}^*$ is an optimum
for (D). Observe that if there is a dual optimal pair then the optimal
values of (D) and (P) are the same, so that the duality gap is empty.
\bigskip
{\bf\zz{4C} Why we take the trouble} In the applications of this theory,
we expect to understand the convex set $C$; for instance, it might be
$\{\tbf{x}:\tbf{x}\ge\tbf{0}\}$, or
$\{\tbf{x}:\|\tbf{x}\|<1\}$. We are often very unclear about the set
$F$; we may not even know whether it is empty or not. The point of
the dual problem (D) is that it replaces the single optimisation in (P)
by two optimisations: find $h(\pmb{\lambda})$, for enough
$\pmb{\lambda}$, and then optimise. In each of these, although the
{\it function} $L(\tbf{x},\pmb{\lambda})$ or $h(\pmb{\lambda})$ which we
are minimising or maximising is likely to be more complicated than the
function $f$ involved in (P), the {\it sets} over which we are
optimising, $C$ or $\{\pmb{\lambda}:\pmb{\lambda}\le\tbf{0}\}$ are
likely to be simpler than the set $F$. For this reason, (D) can give
us two problems each a little easier than the original problem (P).
Almost the most important idea in the whole of mathematics is that it is
sometimes worth going the long way round.
Of course we don't like to set out on the longer path without some
assurance that the answer is to be found at the end of it. For this
reason we look for general theorems to assure us that the duality gap is
empty in a particular case; if possible, to guarantee the existence of
a dual optimal pair.
\bigskip
{\bf\zz{4D} Lemma} Let $C\subseteq\BbbR^n$ be a convex set, and $f$,
$g:C\to\Bbb R$ continuous convex functions. Suppose that
$V_f=\{x:x\in C,\,f(x)<0\}$ and $V_g=\{x:x\in C,\,g(x)<0\}$ are
disjoint. Then there is a $t\in[0,1]$ such that $tf+(1-t)g\ge 0$
everywhere on $C$.
\medskip
\noindent{\bf proof} For $t\in[0,1]$, set $W_t=\{x:tf(x)+(1-t)g(x)<0\}$.
Then $W_t$ is convex and $W_t\subseteq V_f\cup V_g$; we seek some $t$
such that $W_t=\emptyset$. Now $G=\{t:W_t\cap V_f\ne\emptyset\}$ and
$H=\{t:W_t\cap V_g\ne\emptyset\}$ are relatively open sets in $[0,1]$
(because $f$ and $g$ are continuous) and are disjoint. \Prf\Quer\ If
$x\in W_t\cap V_f$ and $y\in W_t\cap V_g$, then every point on the line
segment $[x,y]$ belongs to $V_f\cup V_g$; but these meet $[x,y]$ in
disjoint non-empty relatively open sets.\ \Bang\QeD\ So
$G\cup H\ne[0,1]$, which is what we need to know.
\bigskip
{\bf\zz{4E} Lemma} Let $C\subseteq\BbbR^n$ be a convex set, and
$f_0,\ldots,f_k:C\to\Bbb R$
continuous convex functions. Suppose that $\max_{i\le k}f_i(x)\ge 0$
for every $x\in C$. Then there is a convex combination $g$ of
$f_0,\ldots,f_k$ such that $g(x)\ge 0$ for every $x\in C$.
\medskip
\noindent{\bf proof} Induce on $k$. For $k=0$ the result is trivial
and for $k=1$ it is the last lemma. For the inductive step to
$k+1\ge 2$, set $C'=\{x:x\in C,\,f_i(x)<0$ for every $i0$ and $\beta_i\ge 0$ for every $i$. Set
$\lambda_i^*=-\beta_i/\beta$ for each $i$, so that
$L(\tbf{x},\pmb{\lambda}^*)=\alpha+\beta^{-1}g(\tbf{x})\ge\alpha$ for
every $\tbf{x}\in C$ and $h(\pmb{\lambda}^*)\ge\alpha$. But also
$h(\pmb{\lambda}^*)\le f(\tbf{x}^*)=\alpha$ so
$h(\pmb{\lambda}^*)=\alpha=f(\tbf{x}^*)$ and we have a dual optimal
pair.
\bigskip
{\bf\zz{4G} Theorem} Suppose we are given a consistent convex program
(P) in the form set out in \zz{4A} in which $C$ is closed and bounded.
Then the duality gap is empty.
\medskip
\noindent{\bf proof} Because
$F=\{\tbf{x}:\tbf{x}\in C$, $\tbf{g}(\tbf{x})\le 0\}$, $\tbf{g}$ is
continuous and $C$ is closed, $F$ is closed. Because $C$ is bounded,
$F$ is bounded; because (P) is consistent, $F$ is not empty; because
$f$ is continuous, it attains its minimum on $F$ at $\tbf{x}^*$ say,
which is now a solution of (P).
Set $\alpha=f(\tbf{x}^*)$. Let $\epsilon>0$. Then
$C'=\{\tbf{x}:\tbf{x}\in C$, $f(\tbf{x})\le\alpha-\epsilon\}$ is a
compact set not meeting the compact set
$\{\tbf{x}:\tbf{x}\in C$, $\max_{j\le k}g_j(\tbf{x})\le 0\}$. So there
is an $\eta>0$ such that $C'$ does not meet
$\{\tbf{x}:\tbf{x}\in C$, $\max_{j\le k}g_j(\tbf{x})\le\eta\}$.
Applying Lemma 4E to $f-\alpha+\epsilon$, $g_j-\eta$ we have $\beta$,
$\beta_j\ge 0$, not all $0$, such that
$\beta(f(\tbf{x})-\alpha+\epsilon)
+\sum_{j=1}^k\beta_j(g_j(\tbf{x})-\eta)\ge 0$ for every $\tbf{x}\in C$.
Since $g_j(\tbf{x}^*)\le 0$ for every $j$, $\beta>0$ and
$f(\tbf{x})-\alpha+\epsilon
+\sum_{j=1}^k\Bover{\beta_j}{\beta}(g_j(\tbf{x})-\eta)\ge 0$ for every
$\tbf{x}\in C$.
Set $\lambda_j=-\Bover{\beta_j}{\beta}$ for $j\le k$. Then
$$\eqalign{L(\tbf{x},\pmb{\lambda})
&=f(\tbf{x})+\sum_{j=1}^k\Bover{\beta_j}{\beta}g_j(\tbf{x})\cr
&\ge f(\tbf{x})+\sum_{j=1}^k\Bover{\beta_j}{\beta}(g_j(\tbf{x})-\eta)
\ge\alpha-\epsilon\cr}$$
\noindent for every $\tbf{x}\in C$. Taking the minimum over $\tbf{x}$,
$h(\pmb{\lambda})\ge\alpha-\epsilon$. As $\epsilon$ is arbitrary,
$\sup_{\pmb{\lambda}\le\tbf{0}}h(\pmb{\lambda})\ge\alpha$ and the
duality gap is $\ooint{\alpha,\alpha}=\emptyset$.
\bigskip
{\bf\zz{4H} Theorem} Suppose we are given a convex program (P) and its
dual (D) set out in the form in \zz{4A}. Suppose that moreover
\inset{$f$ and every $g_j$ is differentiable;
we have a differentiable function $\tbf{x}^*(\pmb{\lambda})$, defined at
least for every $\pmb{\lambda}\le\tbf{0}$, such that
$\tbf{x}^*(\pmb{\lambda})\in C$ and
$h(\pmb{\lambda})=L(\tbf{x}^*(\pmb{\lambda}),\pmb{\lambda})$ for each
$\pmb{\lambda}\le\tbf{0}$;
we have a $\pmb{\lambda}^*\le\tbf{0}$ such that
$h(\pmb{\lambda})\le h(\pmb{\lambda}^*)$ for every
$\pmb{\lambda}\le\tbf{0}$;
$x^*(\pmb{\lambda}^*)\in\interior C$.}
\noindent Then $(x^*(\pmb{\lambda}^*),\pmb{\lambda}^*)$ is a dual
optimal pair for (P) and (D).
\medskip
\noindent{\bf Remark} Note that in this theorem we do not have to
examine the set $F$ at any point, even to confirm that it is not empty.
We need a comparatively strong hypothesis on the functions $f$ and $g_j$
(in the other theorems it is enough if they are continuous), and we need
to be able to solve (D) in a very strong sense, because we have to solve
the problems
\Centerline{minimise $L(\tbf{x},\pmb{\lambda})$ for $\tbf{x}\in C$}
\noindent in a `uniform' way, getting the solution as a (differentiable)
function $\tbf{x}^*(\pmb{\lambda})$, and then we need to compute the
optimum for $h$. But {\it if} we can do all this, and $C$ is open, or
for some other reason $\tbf{x}^*(\pmb{\lambda}^*)$ belongs to its
interior, we shall have a solution for (P).
\medskip
\noindent{\bf proof (a)} The function $\tbf{x}^*$ is continuous, so
\Centerline{$G=\{\pmb{\lambda}:x^*(\pmb{\lambda}*)\in\interior C\}$}
\noindent is a neighbourhood of $\pmb{\lambda}^*$. Now for any
$\pmb{\lambda}\in G$, $i\le n$ we have
\Centerline{$\Pd{f}{x_i}(\tbf{x}^*(\pmb{\lambda})
=\sum_{l=1}^k\lambda_l\Pd{g_l}{x_i}(\tbf{x}^*(\pmb{\lambda})$.}
\noindent\Prf\ Because all our functions are differentiable, so is
$\tbf{x}\mapsto L(\tbf{x},\pmb{\lambda})$. This function takes an
optimal value at $\tbf{x}^*(\pmb{\lambda})$; but as this belongs to the
interior of $C$, this must be a stationary point, that is,
$\nabla L(\tbf{x},\pmb{\lambda})$ must be zero at
$\tbf{x}^*(\pmb{\lambda})$. But this means that
$$\eqalign{0
&=\Pd{}{x_i}L(\tbf{x},\pmb{\lambda})\cr
&=\Pd{}{x_i}(f(\tbf{x})-\sum_{l=1}^k\lambda_lg_l(\tbf{x}))
=\Pd{f}{x_i}-\sum_{l=1}^k\lambda_l\Pd{g_l}{x_i}\cr}$$
\noindent at $\tbf{x}^*(\pmb{\lambda})$; rearranging, we have the given
formula.\ \Qed
\medskip
{\bf (b)} Next, we have
$h(\pmb{\lambda})=L(\tbf{x}^*(\pmb{\lambda}),\pmb{\lambda})$ for
$\pmb{\lambda}\le\tbf{0}$, so this also is differentiable. It does not
follow that $\nabla h$ is zero at $\pmb{\lambda}^*$, because it may be
that $\lambda_j^*=0$ for one or more $j$. But since we are always
allowed to shift $\lambda_j^*$ {\it downwards} without leaving the
feasible set $\{\pmb{\lambda}:\pmb{\lambda}\le\tbf{0}\}$, we must have
$\Pd{h}{\lambda_j}\ge 0$ at $\pmb{\lambda}^*$, with equality if
$\lambda_j^*<0$. Now this means that, at $\pmb{\lambda}^*$,
$$\eqalign{0
&\le\Pd{h}{\lambda_j}
=\Pd{}{\lambda_j}L(\tbf{x}^*,\pmb{\lambda})\cr
&=\Pd{}{\lambda_j}(f(\tbf{x}^*)-\sum_{l=1}^k\lambda_lg_l(\tbf{x}^*))\cr
&=\sum_{i=1}^n\Pd{f}{x_i}\Pd{x_i^*}{\lambda_j}
-g_j(\tbf{x}^*)-\sum_{l=1}^k\lambda_l\Pd{}{\lambda_j}g_l(\tbf{x}^*))\cr
&=\sum_{i=1}^n\Pd{f}{x_i}\Pd{x_i^*}{\lambda_j}
-g_j(\tbf{x}^*)
-\sum_{l=1}^k\sum_{i=1}^n\lambda_l\Pd{g_l}{x_i}\Pd{x_i^*}{\lambda_j}\cr
&=\sum_{i=1}^n\Pd{x_i^*}{\lambda_j}(\Pd{f}{x_i}
-\sum_{l=1}^k\lambda_l\Pd{g_l}{x_i})
-g_j(\tbf{x}^*)
=-g_j(\tbf{x}^*)\cr}$$
\noindent with equality when $\lambda^*_j\ne 0$. Accordingly
$g_j(\tbf{x}^*)\le 0$ for every $j$, and $\tbf{x}^*\in F$; moreover,
$\lambda_j^*g_j(\tbf{x}^*)=0$ for every $j$ so
\Centerline{$h(\pmb{\lambda}^*)=L(\tbf{x}^*,\pmb{\lambda}^*)
=f(\tbf{x}^*)$,}
\noindent and $\tbf{x}^*=\tbf{x}^*(\pmb{\lambda}^*)$, $\pmb{\lambda}^*$
are a dual optimal pair.
\end
**