INF581 - Optimization

INF 581 Advanced Machine Learning and Autonomous Agents

Topics in Optimization

February 2024

$$ \newcommand{\vs}[1]{\boldsymbol{#1}} % vector symbol (\boldsymbol, \textbf or \vec) \newcommand{\ms}[1]{\boldsymbol{#1}} % matrix symbol (\boldsymbol, \textbf) $$

Outline

Part 1. Introduction
Part 2. Stochastic and Population Methods

Population methods
Generic Evolutionary Algorithm
Solution encoding
Operators
- Selection, Crossover, Mutation, Termination
Evolution Strategies Examples
- $(1 + 1)\text{-ES}$, $(1 + 1)\text{-SA-ES}$, $(\mu/\rho + \lambda)\text{-SA-ES}$
Other Popular Stochastic and Population Methods
- Differential Evolution, Cross-Entropy Method, CMA-ES, Particle Swarm Optimization

Part 3. Applications Linked to Reinforcement Learning
Conclusion

Part 1

Introduction

Mathematical optimization

Find the minimum (or the maximum) of a function $f$ $$\boldsymbol{x}^* = \arg\min_{\boldsymbol{x} \in \mathcal{X}} f(\boldsymbol{x})$$

Minimize or maximize ?

In optimization literature, usually by “optimize” one mean “minimize”

In reinforcement learning by “optimize” one mean “maximize”

Same algorithms in both cases

$$ \arg\min f(x) \Leftrightarrow \arg\max -f(x) $$

Optimization in AI

Optimization is a main component of modern AI (the other one is Learning)

Optimization and Learning are highly connected

E.g. optimization is used to:

set a model in supervised learning (e.g. weights of neural networks)
optimize a policy in Reinforcement Learning (Direct Policy Search)

Some optimization algorithms use learning techniques (especially surrogate models)

Applications beyond AI

Mathematical optimization can be applied everywhere...

Industry : optimize Power Grid, air traffic, planning, logistics, …
Robotics : optimize actuators control, trajectory / energy consumption, ...
Finance : maximize long term profits, minimize risks, ...
Almost everything can be optimized…
Life itself is the result of an optimization process (natural selection)

Mathematical optimization

Few applications examples among thousands

The 2006 NASA ST5 spacecraft antenna.

This complicated shape was found by an evolutionary computer design program to create the best radiation pattern.

It is known as an evolved antenna.

Mathematical optimization

Few applications examples among thousands

https://www.sciencedirect.com/science/article/pii/S1568494617305690

Mathematical optimization

Few applications examples among thousands

Production planning in electric power grid

Huge optimization problem with millions of dimensions!

Many different problems and algorithms

Problems

Objective function shape:

Linear / non-linear
Convex / non-convex, …

Continuous / discrete (combinatorial) / mixte
Uni-modal / multi-modal
Multi-valued objective function:

Multi objective VS mono objective

With / without gradient formulation
Stochastic objective function
Constrained or not
Dynamical (multi-stage optimization)
Optimize vectors VS optimize functions (using Hamiltonian)

Algorithms

Global VS Local
Gradient based / gradient free (black box)

Gradient based -> first order methods : gradient descent, conjugate gradient, nesterov momentum, adagrad, adam, …
Gradient based -> 2nd order methods : newton’s methods, quasi newton methods

Population methods: genetic algorithms, differential evolution, particle swarm optimization, …
Stochastic algorithm (random sampling)
Multiobjective optimization
Surrogate models
Expression optimization: genetic programming,...

Gradient descent

Iteratively update $\boldsymbol{x}$ toward the steepest slope $$\boldsymbol{x} \leftarrow \boldsymbol{x} - \alpha \nabla_{\boldsymbol{x}}f(\boldsymbol{x})$$

Requires gradient of the objective function

$\alpha$ is the step size which influence the convergence rate
How to adapt it to converge faster ?

Popular variants of SGD in machine learning:

Momentum optimization: Nesterov Momentum / Nesterov Accelerated Gradient (NAG)
Adaptive methods: AdaGrad, RMSProp, Adam

Momentum methods

"Classic" Gradient Descent: $$\boldsymbol{x} \leftarrow \underbrace{\boldsymbol{x}}_{\text{position}} - \underbrace{\alpha ~ \nabla_{\boldsymbol{x}}f(\boldsymbol{x})}_{\text{momentum}}$$

Gradient Descent with momentum: $$\boldsymbol{m} \leftarrow \underbrace{\beta \boldsymbol{m}}_{\text{momentum}} - \underbrace{\alpha ~ \nabla_{\boldsymbol{x}}f(\boldsymbol{x})}_{\text{acceleration}}$$ $$\boldsymbol{x} \leftarrow \underbrace{\boldsymbol{x}}_{\text{position}} + \underbrace{\boldsymbol{m}}_{\text{momentum}}$$

Gradient free methods

(“black box optimization”)

Why gradient free methods ?

gradient of objective function not always known
or difficult to compute
convergence can be slow
gradient based methods search for local optima (doesn’t help for multimodal functions)

Part 2

Population methods

Descent methods:

A single solution is updated incrementally toward a local minimum

Population methods:

Iterative methods where several solutions (called individuals) are tested at each iteration
Information at different points in the solution space can be shared between individuals
Help the algorithm to avoid being stuck in a local optimum
Easy to parallelize

Evolutionary Algorithms

(a.k.a. Evolutionary Computing)

Evolutionary Algorithms

Biomimetic algorithms, inspired by natural selection and evolution

Fitter individuals are more likely to pass on their genes to the next generation.

Chromosomes of the fitter individuals are passed on to the next generation after undergoing the genetic operations of crossover and mutation

Evolution by Natural Selection Source: Khan Academy

Families of Evolutionary Algorithms (1/3)

Evolution Strategies (ES)

Ingo Rechenberg & Hans-Paul Schwefel [1965]
Made for continuous solution spaces

Families of Evolutionary Algorithms (2/3)

Genetic Algorithms (GA)

J.H. Holland [1975]
Dissociate genotype and phenotype

Families of Evolutionary Algorithms (3/3)

Genetic Programming (GP)

J. Koza [1988]
Based on Genetic Algorithms, but individuals are programs

Past, present, future

Old methods (60’s)
Genetic Algorithms (and EAs) became very popular after the publication of Goldberg’s book in 1989
Evolution Strategies are very popular thanks CMA-ES

The state of the art nowadays
Works up to few hundreds of dimensions (solution space)

Recently regains attention in RL community (e.g. Atari or Mujoco)

Generic Evolutionary Algorithm

Illustration on the Rosenbrock function

$$f(\boldsymbol{x}) = \sum_{i=1}^{n-1} \left[ 100(x_{i+1} - x_i^2)^2 + (x_i - 1)^2 \right]$$

$$f(\boldsymbol{x}) = (1 - x_1)^2 + 100 (x_2 - x_1^2)^2$$

Generic Evolutionary Algorithm

									Input:

									$\quad\quad$ $f$: the objective function

									Algorithm parameter:

									$\quad\quad$ $\mu$: number of parents per generation

									$\quad\quad$ $\rho$: number of parents used to make each child

									$\quad\quad$ $\lambda$: number of children per generation

									$\quad\quad$ mutation operator

									$\quad\quad$ mating selection operator

									$\quad\quad$ survivor selection operator

									$\quad\quad$ termination criterion

									$\quad\quad$ an initialization procedure

Initialize a population of $\mu$ individuals
Evaluate each individual $\boldsymbol{x}$: $f(\boldsymbol{x})$

FOR EACH generation (until termination criterion)
$\quad$ FOR EACH child $\boldsymbol{x}$
$\quad\quad$ 1. select $\rho$ parents
$\quad\quad$ 2. $\boldsymbol{x} \leftarrow$ recombinate selected parents (crossover)
$\quad\quad$ 3. mutate $\boldsymbol{x}$
$\quad\quad$ 4. evaluate $f(\boldsymbol{x})$
$\quad$ Select the next generation of $\mu$ individuals
RETURN the best individual

Generic Evolutionary Algorithm

Input:
$\quad\quad$ $f$: the objective function
Algorithm parameter:
$\quad\quad$ $\mu$: number of parents per generation
$\quad\quad$ $\rho$: number of parents used to make each child
$\quad\quad$ $\lambda$: number of children per generation
$\quad\quad$ mutation operator
$\quad\quad$ mating selection operator
$\quad\quad$ survivor selection operator
$\quad\quad$ termination criterion
$\quad\quad$ an initialization procedure

Initialize a population of $\mu$ individuals
Evaluate each individual $\boldsymbol{x}$: $f(\boldsymbol{x})$

FOR EACH generation (until termination criterion)
$\quad$ FOR EACH child $\boldsymbol{x}$
$\quad\quad$ 1. select $\rho$ parents
$\quad\quad$ 2. $\boldsymbol{x} \leftarrow$ recombinate selected parents (crossover)
$\quad\quad$ 3. mutate $\boldsymbol{x}$
$\quad\quad$ 4. evaluate $f(\boldsymbol{x})$
$\quad$ Select the next generation of $\mu$ individuals
RETURN the best individual

Solution encoding

The problem we want to solve is defined as a function to minimize (e.g. error, cost, ...)

$$\boldsymbol{x}^* = \arg\min_{\boldsymbol{x} \in \mathcal{X}} f(\boldsymbol{x})$$

But how do we define the solution space in EA?

The efficiency of mutation and recombination operators (a.k.a. search operators) greatly depends on this definition !

Solution encoding

In some evolutionary algorithms, the solution space is tailored for specific needs of search operators

Then a distinction has to be made between:

The solution space used by the objective function f : the phenotype
The solution space used by the algorithm : the genotype

The data structure used to store the genotype is a chromosome

Evolution Strategies like classical optimization techniques don’t make this distinction and simply use vectors of real numbers

But Genetic Algorithms and Genetic Programming does !

Solution encoding

Genotype-Phenotype Mapping

Can be convenient for combinatorial problems: Travelling salesman problem, Knapsack problem, …

Be careful

A point in the genotype space might not represent a valid point in the phenotype space
The genotype-phenotype mapping should avoid bias:

Two individuals close in the genome space should be close in the phenotype space and vice versa
Another example of bias is when the mapping could map most of the genotype space to a small set of phenotypes

Hamming cliff

Example of bias with the “classic” integer to binary mapping:

phenotype	7 = 0111	genotype
	8 = 1000

Individuals close in the phenotype space but “far” (according to the Hamming distance) in the genotype space

To go from one individual to the other, many mutations are required despite their small distance in the phenotype space
This disturb variation operators (crossover and mutation)
Individuals are trapped by this distortion, that’s the "Hamming cliff"

Solution: use the Gray code

Gray code

decimal values (phenotype) ↔ bit strings (genotype)

A binary encoding such that the binary representation of adjacent decimal numbers only differ by one bit flip (Hamming distance = 1)


								def bin_to_gray(n):
									return n ^ (n >> 1)   # N  XOR  (N  BITWISE_SHIFT_RIGHT  1)

								def gray_to_bin(n):
									mask = n >> 1
									while (mask != 0):
										n = n ^ mask
										mask = mask >> 1
									return n

Decimal	Binary	Gray
0	0000	0000
1	0001	0001
2	0010	0011
3	0011	0010
4	0100	0110
5	0101	0111
6	0110	0101
7	0111	0100
8	1000	1100
9	1001	1101
10	1010	1111
11	1011	1110
12	1100	1010
13	1101	1011
14	1110	1001
15	1111	1000

Genotype-Phenotype mapping

Genetic Algorithms

Input:
$\quad\quad$ $f$: the objective function
Algorithm parameter:
$\quad\quad$ $\mu$: number of parents per generation
$\quad\quad$ $\rho$: number of parents used to make each child
$\quad\quad$ $\lambda$: number of children per generation
$\quad\quad$ mutation operator
$\quad\quad$ mating selection operator
$\quad\quad$ survivor selection operator
$\quad\quad$ termination criterion
$\quad\quad$ an initialization procedure

Initialize a population of $\mu$ individuals
Genotype-Phenotype mapping
Evaluate each individual $\boldsymbol{x}$: $f(\boldsymbol{x})$

FOR EACH generation (until termination criterion)
$\quad$ FOR EACH child $\boldsymbol{x}$
$\quad\quad$ 1. select $\rho$ parents
$\quad\quad$ 2. $\boldsymbol{x} \leftarrow$ recombinate selected parents (crossover)
$\quad\quad$ 3. mutate $\boldsymbol{x}$
$\quad\quad$ 4. genotype-phenotype mapping
$\quad\quad$ 5. evaluate $f(\boldsymbol{x})$
$\quad$ Select the next generation of $\mu$ individuals

RETURN the best individual

Solution encoding

Genetic Programming: tree structure

Symbolic representation
An individual

= a program
= a tree of instructions

Nodes

= genes
= instructions, operators, ...

Initial population

Initial population should be spread over the solution space to increase chances that samples are close to best regions

Should define a region of interest to sample the initial population:

Hyper rectangle to sample into with a uniform distribution
Point to sample around with a multivariate normal distribution

For GA using bit string: a random combination of 0 and 1 is reasonable

Mating selection

Can either:

Randomly select parents
Select among an “elite” population (best ones): apply a competition

Survivor Selection

Select individuals for the next generation among:

Children (“offspring”) only
Children + parents (best individuals can live several generations)

Notation for Evolution Strategies:

$(\mu \color{\red}{+} \lambda)\text{-ES}$	selection in $\{ \text{parents} \} \cup \{ \text{offspring} \}$
$(\mu \color{\red}{,} \lambda)\text{-ES}$	selection in $\{ \text{offspring} \}$

Crossover (recombination) in GA

Single-point crossover

Parent 1	● ● ● ● ● ● ● ● ● ● ● ● ● ● ●\|● ● ● ● ● ● ● ●
Parent 2	● ● ● ● ● ● ● ● ● ● ● ● ● ● ●\|● ● ● ● ● ● ● ●
Child	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Two-point / k-point crossover

Parent 1	● ● ● ● ● ● ● ● ● ●\|● ● ● ● ● ● ● ●\|● ● ● ● ●
Parent 2	● ● ● ● ● ● ● ● ● ●\|● ● ● ● ● ● ● ●\|● ● ● ● ●
Child	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Uniform crossover

Parent 1	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Parent 2	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Child	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Each gene or bit is chosen from either parent with equal probability

Crossover (recombination) in ES

Global intermediate recombination

Each component of the child solution vector $x$ is the arithmetic mean of its parents' components

Discrete recombination

Each component of the child solution vector $x$ is chosen from either parent with equal probability

Parent 1	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Parent 2	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Child	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Mutation

To explore more of the state space

Can be:

Each bit in a binary-valued chromosome typically has a small probability of being flipped: for a chromosome with $m$ bits, this mutation rate is typically set to $1/m$ ⇒ one mutation per child chromosome in average

Before mutation	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
After mutation	● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Common mutation for real-valued chromosomes is to add zero-mean Gaussian noise

Controlling mutation strength

In real-valued chromosomes (ES)

How to set the standard deviation $\sigma$ of the Gaussian noise ?

By deterministic rule: Rechenberg’s 1/5th rule
By evolutive rules (Schwefel): Self-Adaptation Evolution Strategies

Rechenberg’s 1/5th rule:

Increase step-size $\sigma$ if more than 20% of the new solutions (individuals) are “successful” (i.e. better than previous generations)
Decrease otherwise

Self-Adaptation:

$\sigma$ become a part of individuals genes
Each individual has a different $\sigma$
It’s subject to mutations too
An individual with an inappropriate $\sigma$ simply disappear !

Termination criteria

Can be:

A fixed number of generation
A minimal rate of improvement in the average fitness (score)
A minimal rate of change in the solution space

Example: $(1 + 1)\text{-ES}$

Input:
$\quad\quad$ $f$: the objective function
$\quad\quad$ $\boldsymbol{x}$: initial solution

FOR EACH generation
$\quad\quad$ 1. mutation of $\sigma$ (individual strategy) : $\sigma' \leftarrow $ Rechenberg's 1/5th rule
$\quad\quad$ 2. mutation of $\boldsymbol{x}$ (current solution) : $\boldsymbol{x}' \leftarrow \boldsymbol{x} + \sigma' ~ \mathcal{N}(0,1)$
$\quad\quad$ 3. eval $f(\boldsymbol{x}')$
$\quad\quad$ 4. survivor selection $\boldsymbol{x} \leftarrow \boldsymbol{x}'$ and $\sigma \leftarrow \sigma'$ if $f(\boldsymbol{x}') \leq f(\boldsymbol{x})$

RETURN $\boldsymbol{x}$

Example: $(1 + 1)\text{-SA-ES}$

Input:
$\quad\quad$ $f$: the objective function
$\quad\quad$ $\boldsymbol{x}$: initial solution
Algorithm parameter:
$\quad\quad$ $\tau$: self-adaptation learning rate

FOR EACH generation
$\quad\quad$ 1. mutation of $\sigma$ (current individual strategy) : $\sigma' \leftarrow $ $\sigma ~ e^{\tau \mathcal{N}(0,1)}$
$\quad\quad$ 2. mutation of $\boldsymbol{x}$ (current solution) : $\boldsymbol{x}' \leftarrow \boldsymbol{x} + \sigma' ~ \mathcal{N}(0,1)$
$\quad\quad$ 3. eval $f(\boldsymbol{x}')$
$\quad\quad$ 4. survivor selection $\boldsymbol{x} \leftarrow \boldsymbol{x}'$ and $\sigma \leftarrow \sigma'$ if $f(\boldsymbol{x}') \leq f(\boldsymbol{x})$

RETURN $\boldsymbol{x}$

Example: $(\mu/\rho + \lambda)\text{-SA-ES}$

Input:
$\quad\quad$ $f$: the objective function
Algorithm parameter:
$\quad\quad$ $\tau$: self-adaptation learning rate

							$\quad\quad$ $\mu$: number of parents per generation

							$\quad\quad$ $\rho$: number of parents used to make each child

							$\quad\quad$ $\lambda$: number of children per generation

							$\quad\quad$ mating selection operator

							$\quad\quad$ survivor selection operator

Initialize arbitrarily the population
FOR EACH generation

							$\quad$ FOR EACH children

							$\quad\quad$ 1. randomly select $\rho$ parents

							$\quad\quad$ 2. $(\boldsymbol{x}, \sigma) \leftarrow$ recombination of selected parents (if $\rho > 1$)

$\quad\quad$ 3. mutation of $\sigma$ (individual strategy) : $\sigma \leftarrow \sigma ~ e^{\tau \mathcal{N}(0,1)}$
$\quad\quad$ 4. mutation of $\boldsymbol{x}$ (objective param) : $\boldsymbol{x} \leftarrow \boldsymbol{x} + \sigma ~ \mathcal{N}(0,1)$
$\quad\quad$ 5. evaluate $f(\boldsymbol{x})$

$\quad$ Select the next generation of individuals

RETURN the best individual

Differential Evolution

Storn, R and Price, K, Differential Evolution - a Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, Journal of Global Optimization, 1997, 11, 341 - 359

Arbitrarily initialize the population

FOR EACH generation

$\quad$ FOR EACH children

$\quad\quad$ 1. MUTATION
$\quad\quad\quad$ Choose 3 random distinct individuals $\boldsymbol{a}$, $\boldsymbol{b}$ et $\boldsymbol{c}$
$\quad\quad\quad$ $\boldsymbol{b}$ and $\boldsymbol{c}$ define the direction and the amplitude of the mutation (i.e. the search vector)

							$\quad\quad\quad$ Combine them to make an interim individual $\boldsymbol{z} \leftarrow \boldsymbol{a} + \omega ~ (\boldsymbol{b} - \boldsymbol{c})$

							$\quad\quad\quad$ where $\omega$ is a mutation amplification factor (a scalar)

$\quad\quad$ 2. CROSSOVER

							$\quad\quad\quad$ Choose a random dimension $j \in [1,\dots,n]$ for optimization in $n$ dimensions

							$\quad\quad\quad$ Make the candidate individual $\boldsymbol{a}'$ using the binary crossover
							$$
								a'_i = 
								\begin{cases} 
									z_i & \text{if } i = j \text{ or with probability } p \\
									x_i & \text{otherwise} 
								\end{cases}
							$$
						

$\quad\quad$ 3. SELECTION

							$\quad\quad\quad$ Insert the best individual between $\boldsymbol{a}'$ and $\boldsymbol{a}$ into the next generation

RETURN the best individual in the last generation

Differential Evolution

Strength: require few meta parameters

$p$ : the crossover probability (typically 0.5)
$w$ : the differential weight (typically between 0.4 and 1)

Implemented in `scipy.optimize`

Cross-Entropy Method

Maintains an explicit probability distribution over the solution space

Called a proposal distribution $P_{\boldsymbol{\theta}}$
Used to propose new samples for the next generation
Parametrized by $\boldsymbol{\theta}$
The family of probability distribution is chosen according to the objective function landscape

Usually : multivariate Normal distributions $\boldsymbol{\theta} = (\boldsymbol{\mu}, \ms{\Sigma})$

Cross-Entropy Method

At each iteration:

Sample from the proposal distribution $P_{\boldsymbol{\theta}}$
Select best samples according to the objective function $f$
Update the proposal distribution to fit the best samples

Sample from the proposal distribution $P_{\boldsymbol{\theta}}$
Select best samples according to the objective function $f$
Update the proposal distribution to fit the best samples

Sample from the proposal distribution $P_{\boldsymbol{\theta}}$
Select best samples according to the objective function $f$
Update the proposal distribution to fit the best samples

Sample from the proposal distribution $P_{\boldsymbol{\theta}}$
Select best samples according to the objective function $f$
Update the proposal distribution to fit the best samples

Sample from the proposal distribution $P_{\boldsymbol{\theta}}$
Select best samples according to the objective function $f$
Update the proposal distribution to fit the best samples

Sample from the proposal distribution $P_{\boldsymbol{\theta}}$
Select best samples according to the objective function $f$
Update the proposal distribution to fit the best samples

Cross-Entropy Method

							Input:

							 $f$: the objective function

							 $\mathbb{P}$: family of distribution

							 $\boldsymbol{\theta}$: initial parameters for the proposal distribution $\mathbb{P}$

Algorithm parameters:
$m$: sample size
$m_{\text{elite}}$: number of samples to use to fit $\boldsymbol{\theta}$

UNTIL the stop criteria are met

$\text{samples} \leftarrow \{x_1, \dots, x_m\}$ with $x_i \sim \mathbb{P}(\boldsymbol{\theta}) ~ \forall i \in 1 \dots m$
$\text{elite} \leftarrow$ { $m_{\text{elite}}$ best samples } (i.e. select best samples according to $f$)
$\boldsymbol{\theta} \leftarrow $ fit $\mathbb{P}(\boldsymbol{\theta})$ to the elite samples

RETURN $\boldsymbol{\theta}$

Cross-Entropy Method

In the case of the multivariate normal distribution, parameters are updated according to the maximum likelihood estimate:

$$\boldsymbol{\mu}^{(k+1)} = \frac{1}{m_{\text{elite}}} \sum_{i=1}^{m_{\text{elite}}} \boldsymbol{x}^{(i)}$$ $$\ms{\Sigma}^{(k+1)} = \frac{1}{m_{\text{elite}}} \sum_{i=1}^{m_{\text{elite}}} \left( \boldsymbol{x}^{(i)} - \boldsymbol{\mu}^{(k+1)} \right) \left( \boldsymbol{x}^{(i)} - \boldsymbol{\mu}^{(k+1)} \right)^{\top}$$

with:
$k$ = the iteration index
$m_{\text{elite}}$ = the number of “elite” (i.e. best) samples

Cross-Entropy Method

Multivariate Normal distributions is not always the right family of distributions to choose…

Here, a mixture model would be a better choice…

Cross-Entropy Method

Meta parameters:

The family of distribution probability $\mathbb{P}_{\boldsymbol{\theta}}$
The number of solutions to sample at each iteration
The number of best solutions selected at each iteration to update $\boldsymbol{\theta}$

Termination criteria:

Number of iterations
Value of $\boldsymbol{\theta}$
…

Cross-Entropy Method

Strengths:

Simplicity
Good convergence

Example:
I. Szita and A. Lorincz, Learning Tetris Using the Noisy Cross-Entropy Method (2006), Neural Computation [PDF]

Bibliography

Rubinstein, R., 1999. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1, pp.127-190. [PDF]
Rubinstein, R.Y., 2001. Combinatorial optimization, cross-entropy, ants and rare events. Stochastic optimization: algorithms and applications, pp.303-363. [PDF]
Rubinstein, R.Y. and Kroese, D.P., 2004. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning (Vol. 133). New York: Springer.

(the original papers)

CMA-ES

Covariance Matrix Adaptation Evolution Strategy

CMA-ES

Hansen, N. and A. Ostermeier (2001) Completely Derandomized Self-Adaptation in Evolution Strategies. Evolutionary Computation, 9(2), pp. 159-195

Nikolaus Hansen now at Inria in RandOpt team (CMAP)

CMA-ES:

For difficult optimisation problems in continuous domain
Usually considered as the best black box algorithm
Search space : between 3 and a hundred

CMA-ES

Like the cross-entropy method, a distribution is improved over time based on samples
CMA uses multivariate Gaussian distributions with 3 parameters

$\boldsymbol{x} \sim \mathcal{N}$ ( $ \boldsymbol{m} $, $ \sigma^2 $ $ \boldsymbol{C} $ )

CMA maintains:

a mean vector (position)
a step size scalar (amplitude)
a covariance matrix (direction)

CMA-ES

Like the cross-entropy method, these 3 parameters are updated at each iteration to converge such that the proposal distribution $\mathcal{N}(\boldsymbol{m}, \sigma^2\boldsymbol{C})$ focus on the global optima $\boldsymbol{x}^*$

CMA-ES

At each iteration, $\lambda$ points are sampled from this distribution

$$\boldsymbol{x}_i \sim \mathcal{N}(\boldsymbol{m}, \sigma^2 \boldsymbol{C}) \quad \text{for } i = 1, \ldots, \lambda$$

The samples are then sorted according to their objective function value

$$f(\boldsymbol{x}_1) \leq f(\boldsymbol{x}_2) \leq \ldots \leq f(\boldsymbol{x}_\lambda)$$

CMA-ES

Step 1: Update $\boldsymbol{m}$

$$\boldsymbol{x}_i = \boldsymbol{m} + \sigma \boldsymbol{y}_i, \quad \boldsymbol{y}_i \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{C}), \quad \text{for } i = 1, \ldots, \lambda$$ $$ \boldsymbol{m} \leftarrow \sum_{i=1}^{\mu} w_i \boldsymbol{x}_{i:\lambda} = \boldsymbol{m} + \sigma \boldsymbol{y}_w \quad \text{where } \boldsymbol{y}_w = \sum_{i=1}^{\mu} w_i \boldsymbol{y}_{i:\lambda} $$

CMA-ES

Step 2: Update $\boldsymbol{C}$

The covariance update is then $$ \boldsymbol{C} \leftarrow (1 - c_1 - c_{\mu}) \boldsymbol{C} + \underbrace{c_1 \boldsymbol{p_c} \boldsymbol{p_c}^\top}_{\text{rank-one update}} + \underbrace{c_{\mu} \sum_{i=1}^{\mu} w_i \boldsymbol{y}_{i:\lambda} \boldsymbol{y}_{i:\lambda}^\top}_{\text{rank-$\mu$ update}} $$ Rank-one update using cumulation vector allow for correlations between consecutive steps to be exploited, permitting the covariance matrix to elongate itself more quickly along a favorable axis. Rank-$\mu$ update the direction according to best solutions

CMA-ES

Step 3: Update $\sigma$

The new step size $\sigma$ is obtained according to $$ \sigma \leftarrow \sigma \exp\left( \frac{c_\sigma}{d_\sigma} \left( \frac{\|\boldsymbol{p}_{\sigma}\|}{\mathbb{E}\|\mathcal{N}(\boldsymbol{0}, \textbf{I})\|} - 1 \right) \right) $$

CMA-ES summary

INPUT:
$\quad$ $\boldsymbol{m} \in \mathbb{R}^n, \sigma \in \mathbb{R}_+, \lambda$
INITIALIZE:
$\quad$ $\boldsymbol{C} = \textbf{I}$, $\boldsymbol{p_c} = \boldsymbol{0}$ and $\boldsymbol{p_\sigma} = \boldsymbol{0}$
SET:
$\quad$ $c_c \approx \frac{4}{n}$, $c_\sigma \approx \frac{4}{n}$, $c_1 \approx \frac{2}{n^2}$, $c_{\mu} \approx \frac{\mu_w}{n^2}$, $c_1 + c_{\mu} \leq 1$, $d_\sigma \approx 1 + \sqrt{\frac{\mu_w}{n}},$ and
$\quad$ $w_i = 1 \dots \lambda$ such that $\mu_w = \frac{1}{\sum_{i=1}^{\mu} w_i^2} \approx 0.3 \lambda$

WHILE not terminate
$\quad$ $\boldsymbol{x}_i = \boldsymbol{m} + \sigma \boldsymbol{y}_i, \quad \boldsymbol{y}_i \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{C}), \quad \text{for } i = 1, \ldots, \lambda$ sampling
$\quad$ $\boldsymbol{m} \leftarrow \sum_{i=1}^{\mu} w_i \boldsymbol{x}_{i:\lambda} = \boldsymbol{m} + \sigma \boldsymbol{y}_w \quad \text{where } \boldsymbol{y}_w = \sum_{i=1}^{\mu} w_i \boldsymbol{y}_{i:\lambda}$ update mean
$\quad$ $\boldsymbol{p_c} \leftarrow (1 - c_c)\boldsymbol{p_c} + \mathbb{1}_{\{\|\boldsymbol{p_\sigma}\| < 1.5\sqrt{n}\}} \sqrt{1 - (1 - c_c)^2} \sqrt{\mu_w} \boldsymbol{y}_w$ cumulation for $\boldsymbol{C}$
$\quad$ $\boldsymbol{p_\sigma} \leftarrow (1 - c_\sigma)\boldsymbol{p_\sigma} + \sqrt{1 - (1 - c_\sigma)^2} \sqrt{\mu_w} \boldsymbol{C}^{-\frac{1}{2}} \boldsymbol{y}_w$ cumulation for $\sigma$
$\quad$ $\boldsymbol{C} \leftarrow (1 - c_1 - c_{\mu}) \boldsymbol{C} + c_1 \boldsymbol{p_c} \boldsymbol{p_c}^\top + c_{\mu} \sum_{i=1}^{\mu} w_i \boldsymbol{y}_{i:\lambda} \boldsymbol{y}_{i:\lambda}^\top$ update $\boldsymbol{C}$
$\quad$ $\sigma \leftarrow \sigma \exp\left(\frac{c_\sigma}{d_\sigma} \left(\frac{\|\boldsymbol{p_\sigma}\|}{E\|\mathcal{N}(\boldsymbol{0}, \textbf{I})\|} - 1\right)\right)$ update of $\sigma$

References

Hansen, N. and Ostermeier, A., 2001. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2), pp.159-195. [PDF]

Hansen, N., 2016. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772. [PDF]

Particle Swarm Optimization

Particle Swarm Optimization (PSO)

Use momentum : accumulate speed in a favorable direction

To accelerate convergence toward minima
Don’t be disturbed by local perturbations

Each individual (or particle) in the population keep tracks of:

its current position
its current velocity
the best position it has seen so far

Particle Swarm Optimization (PSO)

At each iteration, each individual is accelerated toward:

the best position it has seen
the best position found thus far by any individual

The acceleration is weighted by a random term

Particle Swarm Optimization (PSO)

The update equations are: $$ \begin{aligned} x^{(i)} &\leftarrow x^{(i)} + v^{(i)} \\ v^{(i)} &\leftarrow wv^{(i)} + c_1 r_1 \left( x_{\text{best}}^{(i)} - x^{(i)} \right) + c_2 r_2 \left( x_{\text{best}} - x^{(i)} \right) \end{aligned} $$ with:

$\boldsymbol{x}^{(i)}, \boldsymbol{x}^{(i)}$	position and velocity of particle $i$
$\boldsymbol{x}_{\text{best}}$	the best location found so far over all particles
$w, c_1, c_2$	parameters (inertia and momentum coefficients)
$r_1, r_2$	random numbers drawn from $\mathcal{U}(0,1)$

Particle Swarm Optimization (PSO)

Part 3

Application to Reinforcement Learning

Reinforcement Learning Landscape

In Reinforcement Learning, we can either:

Learn the value function $V$ or optimize the action-value function $Q$ (DQN, …)
Optimize the policy $\pi_{\boldsymbol{\theta}}$ (REINFORCE)
Do both (Action Critic methods)

Direct Policy Search (DPS)

Policy Gradient is a special case of Direct Policy Search (DPS)

Policy Gradient (PG)

Policy Gradient is a special case of Direct Policy Search (DPS):

PG search $\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$ with a gradient ascent method
Iteratively update $\boldsymbol{\theta}$ : $\quad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$
A convenient analytical formulation of $\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$ is given by the Policy Gradient theorem:

$$\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} V^{\pi_\boldsymbol{\theta}}(\boldsymbol{s}) = \mathbb{E}_{\pi_{\boldsymbol{\theta}}} \left[ \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(\boldsymbol{s}, \boldsymbol{a}) Q^{\pi_{\boldsymbol{\theta}}}(\boldsymbol{s}, \boldsymbol{a}) \right]$$

DPS: gradient based VS gradient free methods

Possible issues with Policy Gradient (gradient based methods):

May have a slow convergence
Only found local optima
Requires an analytical formulation of $\nabla_{\boldsymbol{\theta}} \ln \pi_{\boldsymbol{\theta}}$ which is not always known

Direct Policy Search methods using gradient free optimization procedures are interesting alternatives to tackle these issues: