OptCorner: The price of derivatives - Derivative-free Optimization

Posted on
22 Nov 2017

$ \newcommand{\R}{\mathbb{R}} \newcommand{\costR}{\mathcal{R}} $

This post is a part of The NAG Optimization Corner series.

In the previous post we discussed ways to provide derivatives and we focussed on a finite difference (FD) approximation. This time we address, in more detail, algorithms which neither require derivatives nor approximate them internally via finite differences. This class of optimization algorithms is usually referred to as Derivative-free Optimization (DFO).

We assume that the problems are of the form \begin{equation*} \min_{x\in \R^n}{f(x)} \end{equation*} where $f$ is a nonlinear smooth objective function which might also be

  • a black box in which the derivatives are unknown,
  • expensive to evaluate, or
  • potentially noisy.
Such problems are best tackled by DFO solvers. It should be emphasized that whenever the derivatives are available and are easy to compute, a derivative-based method should be preferred.

 

Comparison of DFO approaches

There are three main families of DFO algorithms:

  • stochastic (global) methods (genetic algorithms, Bayesian optimization, ...)
  • direct search methods (Nelder–Mead, Pattern-Search, ...)
  • model-based methods

Stochastic methods do not assume any properties of the objective function (not even the continuity) and explore the variable space by random function evaluations. The evaluations are driven by underlying heuristics which weight an attraction to the best point so far (a possible local minimum) and the global search of the space. As such they are not trapped to one local minimum so they provide an approximation for a global solution of the problem. They can be very successful on some applications (for example, non-smooth problems), however, they usually require a huge number of objective function evaluations and therefore are not suitable for expensive functions.

Direct search methods (including the very popular Nelder–Mead simplex algorithm) explore the space in a systematic fashion based on an heuristic from a given starting point. They are somewhat popular due to their ease of implementation but generally converge quite slowly towards a local minimum and model-based methods usually outperform direct search algorithms by a significant margin.

As an illustration, we minimized Rosenbrock's banana function \begin{equation*} f(x,y) = (1-x)^2 + 100(y-x^2)^2 \end{equation*} with four different derivative-free solvers from the NAG Toolbox for MATLAB®: a quasi-Newton algorithm relying on finite differences (e04jy), a particle swarm (PSO) algorithm (e05sa) as a representative of stochastic methods, the Nelder–Mead simplex algorithm (e04cb) and a model-based algorithm (e04jc).

As mentioned earlier, DFO solvers do not approximate derivatives which would normally be used to measure the convergence and to compare the solutions. One way to quantify progress of DFO solvers is to observe how many function evaluations were needed to reach a point with the function value in $\varepsilon$-proximity to the solution, in this case the optimal objective value is $0.0$ (reached at the solution $x=y=1.0$). The following table summarizes the number of objective evaluations needed for each $\varepsilon$ threshold for each solver. The starting point was $[-2,2]$ for all solvers except PSO where a box $-2\leq x,y \leq 2$ was provided instead. (The results for PSO are averaged over five runs.)

$\varepsilon$ $10^{-3}$ $10^{-5}$ $10^{-7}$ $10^{-9}$ $10^{-11}$
PSO, e05sa 947 3316 $\infty$ $\infty$ $\infty$
Quasi-Newton, e04jy 154 152 156 159 172
Nelder–Mead, e04cb 203 221 233 245 263
Model-based, e04jc 135 148 156 160 163

We can observe that stochastic methods as PSO are not a good choice for high precision solutions and require a significant number of function evaluations. In general derivative-based solvers tend to show a rapid convergence towards the solution thus quasi-Newton with FD approximation also demonstrates this behaviour and it easily reaches higher precisions with only a few extra evaluations. The model-based algorithm outperforms Nelder–Mead and in this case is an overall winner with a slightly lower number of function evaluations than quasi-Newton. We will see in the test below that the DFO method is much more robust than methods relying on finite differences for noisy problems, even for very low levels of noise.

Model-based DFO – algorithm description

There are two model-based derivative-free solvers in the NAG Library: e04jc, aimed at general bound-constrained nonlinear problems, and e04ff which is dedicated to least squares calibration problems. They rely on quadratic models that match the value of the objective function $f$ on a set of interpolation points and on a trust region method which monitors a region in which the model is deemed accurate and thus ensures the convergence. A brief description of such an algorithm can be stated as follows:

  1. Initialization
    • Choose a first iterate and an initial trust region around it
    • Choose a set of interpolation points inside the trust region
    • Build a quadratic interpolation model inside the trust region
  2. Loop until convergence is reached
    • Minimize the model in the trust region
    • Evaluate the objective on the new point
    • If the model predicted correctly the decrease of the objective:
      • replace a far away interpolation point by the new point
      • move the trust region around the new point
    • Otherwise
      • either replace some interpolation points if their geometry is not good
      • or shrink the trust region

 

Two iterations of the algorithm are illustrated in the following animation:
Illustration of two iterations of a model-based DFO algorithm

There are two main advantages of this kind of method over methods relying on finite differences of the derivatives. Firstly, the function evaluations are only discarded when they are far from the current iterate and can continue to be useful for many iterations, whereas the evaluations are discarded as soon as a estimation of the gradient is used in a finite-differences framework. Secondly, the algorithm tends to have well spread-out points in the trust region which helps alleviate problems arising when errors appear in the function evaluations.

Experiments with noisy functions

Often in practice, the function evaluations might be imprecise. For example, the evaluations are burdened with small errors because they involve an expensive simulation which is not computed to a full precision or because the problem is based on a stochastic model. Such functions are called noisy. In the previous post we showed that the precision of the finite difference approximation is very quickly lost with a growing error in the evaluation of $f$ (recall that typically it is assumed the precision of $f$ is up to the machine precision $\varepsilon$). It comes as no surprise that FD approximation applied on noisy functions usually do not work at all. In contrast, model-based DFO is more resilient as the evaluations are spread-out and thus small imperfections at each point are better mitigated by their respective distance.

We illustrate the behaviour on a small set of functions from the CUTEst test set: Rosenbrock, Genhumps, Arwhead and Dqdrtic. For each of these functions, we try to solve the problem: \begin{equation*} \min_{x\in \R^n}{\tilde f(x)} \end{equation*} where $\tilde f(x) = f(x) + \epsilon$ and where $\epsilon$ represents some noise following a Normal distribution $N(0,\sigma^2)$.

We solve the test problems again using e04jy, a quasi-Newton algorithm combined with finite differences, and e04jc, the model-based DFO solver. In the following table we show for varying levels of noise $\sigma$ the number of function evaluations needed by the solvers to reach a point that is within $10^3 \cdot \sigma$ of the actual solution (in average after five runs), $\infty$ indicates that the solver could not reach the desired accuracy.

Noise level $\sigma=10^{-10}$ $\sigma=10^{-9}$ $\sigma=10^{-8}$ $\sigma=10^{-7}$ $\sigma=10^{-6}$
Rosenbrock e04jc 373 382 402 345 321
e04jy 388 $\infty$ $\infty$ $\infty$ $\infty$
Genhumps e04jc 310 302 286 242 282
e04jy $\infty$ $\infty$ $\infty$ $\infty$ $\infty$
Arwhead e04jc 80 83 72 63 64
e04jy 221 1088 $\infty$ $\infty$ $\infty$
Dqdrtic e04jc 17 17 17 17 17
e04jy 26 54 1153 $\infty$ $\infty$

Things to remember

Derivative-free optimization might be a viable alternative to the finite difference approximation for problems where derivatives are not easily available. Particularly, model-based DFO algorithms, such as e04jc, represent the state-of-the-art in the field. As the finite differences are extremely sensitive to noise, methods relying on them might easily fail even for a very small amount of noise and should be avoided. It is therefore generally preferable to use dedicated derivative-free solvers.

The first experiment presented above was performed in MATLAB® with the NAG Toolbox with the following files, click here to download (1,771 bytes).