Abstract—This paper reviews the different gradient-based

schemes and the sources of gradient, their availability, precision

and computational complexity, and explores the benefits of

using gradient information within a memetic framework in the

context of continuous parameter optimization, which is labeled

here as Memetic Gradient Search. In particular, we considered

a quasi-Newton method with analytical gradient and finite

differencing, as well as simultaneous perturbation stochastic

approximation, used as the local searches. Empirical study on

the impact of using gradient information showed that Memetic

Gradient Search outperformed the traditional GA and

analytical, precise gradient brings considerable benefit to

gradient-based local search (LS) schemes. Though

gradient-based searches can sometimes get trapped in local

optima, memetic gradient searches were still able to converge

faster than the conventional GA.

I. INTRODUCTION

PTIMIZATION problems often involve finding one or

more optima of some objective functions :

,

1. In real-world engineering problems, the

objective function of interest is usually multimodal with high

dimensionality. The cost of one measurement of the function

value is not limited to computational effort, but may also

include human labor, time, fuel and many others. Any

attempts to search exhaustively will be prohibitively

expensive due to the huge domain space of these functions,

yet the multimodality implies that a search too narrow in

scope risks mistaking a local optimum for the global optimum.

The design of such optimization algorithms has consequently

become another optimization problem, minimizing the cost,

which is usually bounded by the number of objective function

evaluations, while maximizing chances of finding the global

optimum. Throughout the paper, we only consider

minimization problems as maximization can be trivially

converted to minimization.

Over the years, researchers aiming to tackle these problems

have developed various algorithms, of which the origin may

be traced back to the time of Isaac Newton. Generally,

optimization algorithms may be classified into deterministic

methods, or conventional numerical methods, and stochastic

methods, depending on whether or not the result is totally

determined at the inception of the optimization. Depending

on the scope of the search, they may also be roughly classified

into global searches and local searches.

Manuscript received December 15, 2007.

Boyang Li, Yew-Soon Ong, and Minh Nghia Le are with the School of

Computer Engineering, Nanyang Technological University, Singapore,

639798. (e-mail: libo0001@ntu.edu.sg, asysong@ntu.edu.sg,

lemi0005@ntu.edu.sg)

Chi Keong Goh is with the Data Storage Institute, Agency for Science,

Technology and Research, Singapore (e-mail: ckgoh@nus.edu.sg)

A. Deterministic Search

Deterministic methods can be further classified into direct

searches, or gradient-free algorithms, and gradient-based

methods. In many methods of both classes, a line search is

employed although the heuristics that determine the direction

and step size vary. The strengths of the conventional

numerical methods lie in fast convergence. However, their

scope of exploration is usually limited compared to the whole

objective space. Many methods can be “trapped” in the first

local optimum they find and will not be able to improve

further. Since these methods are completely determined, they

are destined to produce poor results if the starting position is

unfortunate. Thus they are commonly classified as local

searches.

Direct search methods do not depend on other information

other than the measurement of the objective function. No

assumptions of differentiability are made. The very first

direct search is now known as the coordinate strategy, or the

Gauss-Seidel strategy among many other names [1-3]. This

particular strategy singles out one dimension at a time, and

performs a line search on that dimension. An important

variation of this strategy, the Davies-Swann-Campey method,

initially proposed by Swann[4], combined the idea of

coordinate strategy with Rosenbrock’s idea of rotating

coordinates[5] so the line search is not limited to go parallel to

the coordinate axes.

There are quite a number of gradient-based local searches,

each having numerous variants. The most widely used

include the method of steepest descent[6], conjugate

gradients[7], Newton’s method and a class of quasi-Newton

methods. While steepest descent and conjugate gradients

make use of only the gradient, Newton’s method requires the

costly computation of the inverse of the Hessian

matrix

. Quasi-Newton methods were introduced to

estimate the Hessian matrix based on first-order data and

simplify computation. The assumptions of these methods are,

of course, that the objective function is twice or once

differentiable, for Newton’s method and other method

respectively. The interested reader is referred to the excellent

books written by Schwefel[8] and Ruszczyński[9] for

detailed discussions.

B. Evolutionary Computing

Evolutionary computing, including genetic algorithms[10],

evolutionary strategies[11, 12], and genetic programming

[13], was inspired by the biological process that genes of

creatures are optimized by the force of natural selection. In

contrast to conventional numerical methods, genetic

algorithm is able to explore the objective space more

extensively, but it usually takes more computational effort

and reaches the optima with less precision. Hybrid methods

Memetic Gradient Search

Boyang Li, Yew-Soon Ong, Minh Nghia Le and Chi Keong Goh

O

that combine genetic algorithms with local searches have

been proposed so as to perform efficient global searches and

take the merits from the two distinct approaches. These

methods are usually named memetic algorithms (MAs)[14]

due to the resemblance of local searches to the evolution of

culture in human societies, or the evolution of memes[15].

Memetic algorithm has been successfully applied in many

disparate engineering fields, ranging from microarray gene

selection[16], aerodynamic design[17], drug Therapies

design[18] to assignment problems[19].

In addition to gradient-based LSes in MAs, other means to

utilize gradient in traditional genetic algorithms or

evolutionary strategies have also been proposed. Salomon[20]

pointed out the similarities between finite differencing (FD)

and Evolutionary Strategies and consequently introduced a

new mutation operator based on approximated gradients in

his Evolutionary Gradient Search.

In this paper, we studied the impact of gradients on

memetic search performance. In Section II we review various

gradient-based local search methods, from steepest descent to

quasi-Newton methods. Section III discusses sources of

gradients as well as their possible influence in a memetic

algorithm. Finally, in Section IV, we present an empirical

study comparing four algorithms, including a simple genetic

algorithm, a MA utilizing gradients computed with FD, a MA

utilizing analytical gradients and a MA with Simultaneous

Perturbation Stochastic Approximation (SPSA) as the local

search.

II. G

RADIENT-BASED LOCAL SEARCHES

A large number of widely-used deterministic algorithms

depend on gradient information of the objective function to

perform optimization. There are, for example, steepest

descent, Newton’s method, and a variety of quasi-Newton

methods. Different from direct searches, in a gradient-based

method a relationship between the input and output of the

function must be established or approximated. While some

methods, like steepest descent, rely on only first-order

derivatives, the Newton’s method requires second-order

information. Thus it must be assumed that the objective

function can be differentiated at least twice for application of

the Newton’s method and once for other methods.

One-dimensional line searches are usually employed

iteratively in deterministic methods. Given a pre-defined

search direction

, the search seeks the optimal step

length

:

min

Common strategies include Fibonacci Division [21],

Golden Section[21, 22], and so forth. Although the method of

steepest descent and conjugate gradient requires precise

one-dimensional searches, Newton’s method and

quasi-Newton method only needs imprecise searches that

guarantees sufficient decrease[23].

A. Steepest Descent

One of the simplest gradient-based LS is the method of

steepest descent introduced by Curry[6], which follows from

the fact that the gradient vector points to the direction of

fastest increase in the function value. The search points are

iteratively generated by stepping in the opposite direction of

the gradient. Formally, letting

denote the current search

point, we have

Although steepest descent is easy to implement, it is

commonly considered slow as the locally optimal direction

often does not coincide with the optimal direction in the

bigger picture. A zigzag path can often be observed. The

convergence rate is no better than first order except for some

special cases like concentric contours[24].

Some variants[25, 26] of the basic steepest descent involve

deriving the current direction from historical search points in

order to reduce the zigzag behavior and accelerate

convergence.

B. Conjugate gradient

In the method of conjugate gradient[7], the search

directions are a set of -conjugate vectors. If a symmetric

matrix is positive definite, non-zero vectors

,

,…

are

said to be -conjugate if

0,,1…,

It can be shown all these vectors are linearly independent.

With first-order derivatives, we may compute the direction in

the th iteration

with the following formula:

where

with the initialization

. It is shown to be

beneficial to reinitialize the direction vector periodically so as

to avoid possible linear dependence in the generated direction

vectors [27].

The power of conjugate gradient lies in the fact that

will probably become separable in the new coordinate system,

so each dimension can be minimized individually. For a

quadratic function of D dimensions, conjugate gradient needs

a maximum of D steps to converge.

C. Newton’s method

Newton’s method, or the Newton-Raphson method, was

first introduced as a method to solve equations iteratively. For

a simple equation

0 and the current position

in the

proximity of the solution, the next search point is taken as the

intercept of the tangent line at

and the axis. More

formally,

As an optimization method, the equation to be solved is

simply changed to

0. Substituting corresponding

terms, we get:

For a quadratic objective function with a positive-definite

Hessian, a local optimum can be reached with Newton’s

method in one step. Nonetheless the fast convergence is

achieved at a high cost, mainly the effort needed to compute

the Hessian and the search direction can be distorted by errors

introduced by finite differencing. This inconvenience

inspired researchers to look for alternative ways to

approximate the Hessian or its inverse and led to the

quasi-Newton methods.

D. Quasi-Newton methods

Davidon[28] originated the first method in the

quasi-Newton family. It was later refined by Fletcher and

Powell[29] and is now called the Davidon-Fletcher-Powell

method (DFP). The search direction in DFP is given by:

where

and

̂

̂

̂

̂

̂

̂

The Broyden-Fletcher-Goldfarb-Shanno or BFGS [30-33]

method is slightly different. However, both of the methods

belong to the quasi-Newton class characterized by the

formula

̂

̂

̂

̂

̂

̂

̂

̂

̂

̂

̂

̂

̂

̂

Obviously, the DFP formula is the special case where

ω

0 while in BFGS ω

1. It can be shown that

is

positive definite as long as

is positive definite for

all ω

0,1

, if all computation and line searches are exact.

In real-world applications, it is beneficial to periodically

reset

to , so as to prevent loss of the positive-definiteness

[34].

III. M

EMETIC ALGORITHM

Having seen how creatures incredibly adapt to their

environment, one may reasonably conclude that nature is

probably the most powerful optimizer. The astonishing

elegance of the ways natural creatures solve their daily

problems inspired researchers to mimic evolution, nature’s

way of optimization, in their algorithms. These algorithms

simulate the natural selection process and blind mutation and

crossover of genes in the hope that the genes will be

optimized, in a sense, “automatically”.

In 1960s, Holland invented genetic algorithm(GA)[10] and

Rechenberg and Schwefel invented evolutionary

strategies[11, 12]. In 1985 Cramer[13] conceived genetic

programming(GP), in which individuals are represented by

tree structures, and Koza[35] refined it later. The three

methods differ mostly in the representation of individuals and

use of operators such as crossover and mutation. Nevertheless

they share more or less a common algorithmic sequence. The

three methods are now sometimes grouped together under the

collective term evolutionary computing. More recent

developments includes Ferreira’s Gene Expression

Programming[36], which is alleged to be a closer metaphor to

real biological processes.

Evolutionary computing, same as the natural processes it

mimics, is stochastic. It is able to reach the global optimum

with high probability and at lower cost than a multi-start

deterministic search. However, while evolutionary

computing is generally good at identifying promising regions

that are worth exploring further, it may take time to converge

within those regions and may not converge to precisely the

optimum. On the other hand, deterministic methods converge

rapidly near optima. Memetic algorithms that hybridize

evolutionary computing and deterministic LSes are thus

proposed to perform efficient search with good precision.

Based on the way results of local searches are used,

memetic algorithms can be categorized into Baldwinian and

Lamarckian. In a Baldwinian MA, the results are merely used

in the selection of individuals. In a Lamarckian MA, the

results are also translated back to the genotypes. The

Lamarckian approach is used in most recent works.

A. Sources of Gradients

While the exact search strategy may vary, the gradient

information many method depend on does not differ, except

for truncation and rounding errors. In this section we look at

different sources of gradient, their availability and cost.

The expense we pay for the gradients is generally limited

by our knowledge about the objective function. The most

general and widely used method is finite differencing, which

requires no other information than measurements of the

objective functions. However, FD can be expensive. If more

information of the objective function is available, for example,

the algorithm to derive the function value, it can be exploited

so that the gradient information can be obtained at a much

lower cost. In some special cases the gradient may even be

readily available. Techniques to compute gradients ranges

from Automatic Differentiation and adjoint Euler solver to

symbolic differentiation. Even when little information is

known, we might still reduce the number of function

measurements at the expense of accuracy of the gradient,

thanks to stochastic gradient methods such as SPSA.

In the most special case, analytical gradient is readily

available. A good example is the Lennard-Jones cluster

problem, which aims to find the optimal positions of a cluster

of atoms so that the potential energy of the atoms is

minimized. The problem has been so well studied that one can

easily find analytical gradient in relevant literature. For some

other problems, the objective function may be completely

specified in formula form while the derivative is not; this

knowledge enables us to use computer-assisted symbolic

differentiation or even analytical derivation by hand.

In cases where the function is so complex that manual

derivation or symbolic differentiation is unbearably slow, or

the function is not known in exact formula form but only

algorithmically, i.e. as a series of steps, for instance, in

complex simulations, automatic differentiation may be used.

Automatic differentiation (AD) is superior to finite

differencing because it generates numerical results as precise

as one might obtain with analytical gradients. Besides, the

results may be generated with reasonable cost as the

computational complexity in the reverse mode is independent

of the dimensionality of the inputs.

Automatic differentiation is a transformation of the

algorithm deriving the function value to the algorithm

deriving its derivative. See for [37] a history of automatic

differentiation. AD include two strategies, or modes: the

forward mode and the reverse mode. For our function of

interest :

, the reverse mode requires less

computational cost than the forward mode and is preferred.

However, the reverse mode requires the storage of

intermediate results and may pose a challenge for memory

capacity when used with very complex algorithms.

There are also cases where gradient is not available in

analytical form, but still can be obtained at a much lower cost

compared to finite differencing. A class of problems directly

taken from engineering practice is the optimization of airfoil

design. The optimization of airfoil shapes can be viewed as a

constrained optimization problem, where the constraints

appear as Euler or Navier-Stokes flow equations. Here, an

adjoint Euler solver, first used in [38], may be utilized to

compute the gradients at only 1.5 times or 2 times the cost of

evaluating the objective function[39], which is, compared to

finite differencing, a cheap method.

Nevertheless there are cases that the function value is

generated as processes or experiments external to our

computer, such as outputs from external circuitry. Little is

known about the input-output relationship. In this case the

function becomes a black box. Generally, only finite

differencing is applicable in these situations.

Finite differencing is derived directly from the definition of

partial derivatives. There are several ways to approximate a

local derivative. For the objective function :

, let

u

ℓ

denote the unit vector of the ℓth dimension, and

·

ℓ

denote the corresponding approximated partial derivative.

For a small positive perturbation

, the one-sided estimate

(forward difference) at the point

can be express as

ℓ

1

ℓ

We might also use the central difference, or the two-sided

estimate:

ℓ

1

2

ℓ

ℓ

If it can be assumed the function is twice differentiable,

from the expansion of Taylor series we may obtain:

ℓ

1

2

2

ℓ

4

ℓ

3

Although the two-point estimate and the result yielded

from Taylor series produces generally more accurate results,

the one-point estimate is usually preferred on the grounds of

economics. However, if the objective function is rough or

noise is present, using the central difference may reduce the

noise level and achieve faster convergence[40].

Despite being universally applicable, finite differencing

has its own drawbacks. Choosing the value of

can be tricky

because too great a

will, by definition, yields an inaccurate

gradient, and too small a

may be obscured by the noise in

the external system and susceptible to rounding errors. In

addition, computing FD is expensive because it requires D+1

measurement for a D-dimensional objective function.

When gradients are not cheaply available, it may be

advantageous to use stochastic gradient methods that

stochastically approximate the gradient of the function rather

than the combination of a deterministic method with gradient

obtained from finite differencing[41]. Although the gradient

information acquired by stochastic approximation can be

tightly coupled with the strategy of its utilization, we still

classify these methods mainly as a source of gradient instead

of gradient-based methods. In the next a few paragraphs we

review the simultaneous perturbation stochastic

approximation method (SPSA).

SPSA is due to Spall [42, 43]. The approximated gradient

can be expressed as follows:

∆

∆

2

∆

∆

∆

where the components of the perturbation vector ∆

is

drawn from the Bernoulli ±1 distribution. The independent

variable is updated as

The choice of the decreasing gain sequence a

and

are

respectively determined by

/

and

/

.

For effective values of the constants A, a, γ, α, and c, see [44].

For each search point, the algorithm only requires two

evaluations of ·, this is only 1/D of the computational

effort required by central finite differencing for a

D-dimensional problem.

IV. E

MPIRICAL STUDIES

In this section we studied the characteristics of algorithms

based on the pseudo-code shown in Fig.1, including three

memetic algorithm and one traditional genetic algorithm, in

order to illustrate the impact of gradient in MAs. In this study

our focus is on the steps marked (6) and (7) in Fig. 1. The first

memetic algorithm employed analytical gradient that is

assumed to be readily available (MA-AG). The cost of

evaluating the gradient is deemed as negligible. It is worth

noting that gradients obtained by automatic differentiation are

as precise as analytical gradient, only at a slightly higher cost.

Consequently, characteristics exhibited by this particular

algorithm can easily be extended to cases where AD is used

by only multiplying the cost by a small constant.

The second algorithm employed gradient obtained from

one-sided finite differencing (MA-FD). The first and second