A cheatsheet of mathematical formulas for fundations of Continuous Optimization

Gradient descent being like… — Photo by Alex Lange — Unsplash

A big part of the above formulas are from my notes at the DSTI. I also added curves I found or created, as well as more gradient descent algorithms. Please note that it is a formulas cheat sheet, not a course. It is good to check or refresh your knowledge.

If you are specifically interested in gradient descent algorithms and not so much into refreshing mathematical foundations such as Euler equality or the Lagrangian, you may like the following article, that dive more into the details of each optimization algorithms: “A glimpse of the maths behind Optimization Algorithms”.

  • Indexes : 1, 2, … , i, j, k, m, n ∈ [1,n] ⊂ ℕ
  • In regular: a, b, c, …, t, u, v, x, y, z ∈ ℝ
  • In bold: x is a vector of coordinates xᵢ ∈ ℝ with i ∈ [1,n]
  • Usually, x is the variable, y and z directions of space. While a and b are fixed vectors.
  • êᵢ are the unit vectors of the orthonormal basis
  • In uppercase: A is a matrix of size n×m and elements aᵢⱼ
  • J : V ⊂ ℝⁿ ↦ ℝ is the function to optimize
  • ε represent a small quantity in ℝ
  • θ ∈ [0,1] ⊂ ℝ
  • g is a vector of the gᵢ constraints such that g(x)=0 : ∀i ∈ [1, …,m₁]
  • 𝛌 is the associated vectors of Lagrange multipliers λᵢ
  • h is a vector of the hⱼ constraints such that hⱼ(x)≤0 : ∀j ∈ [1, …,m₂]
  • 𝛍 is the associated vectors of Lagrange multipliers µⱼ

What is a minimum?

x* is a global minimum of J if J(x*) ≤ J(x) ∀x :

\begin{aligned}
 &x^* = \min_K J(\mathbf{x}): K \rightarrow \mathbb{R} \Leftrightarrow \\
 \\ 
 &J(\mathbf{x^*}) \leq J(\mathbf{x}) \; \; \; \forall \mathbf{x} \in K
 \end{aligned}
\begin{aligned}
 &x^* = \min_K J(\mathbf{x}): K \rightarrow \mathbb{R} \Leftrightarrow \\
 \\ 
 &J(\mathbf{x^*}) \leq J(\mathbf{x}) \; \; \; \forall \mathbf{x} \in K
 \end{aligned}

x* is a local minimum of J if there is a neighbourhood N of x* where it is the minimum:

x^* = \min J(\mathbf{x}) \Leftrightarrow \\
 \\ 
 \exists \mathcal{N} \; \; \operatorname{s.t.} \; \; J(\mathbf{x^*}) \leq J(\mathbf{x}) \; \; \; \forall \mathbf{x} \in \mathcal{N}(\mathbf{x^*})
x^* = \min J(\mathbf{x}) \Leftrightarrow \\
 \\ 
 \exists \mathcal{N} \; \; \operatorname{s.t.} \; \; J(\mathbf{x^*}) \leq J(\mathbf{x}) \; \; \; \forall \mathbf{x} \in \mathcal{N}(\mathbf{x^*})

Gateaux differentiability

J is Gateaux differentiable at x if:

  • J’(x;y) exists ∀y
  • y ↦ J’(x;y) is linear
J’(\mathbf{x};\mathbf{y}) = \lim_{\varepsilon \rightarrow 0} \frac{J(\mathbf{x}+\varepsilon \mathbf{y})-J(\mathbf{x})}{\varepsilon}
J’(\mathbf{x};\mathbf{y}) = \lim_{\varepsilon \rightarrow 0} \frac{J(\mathbf{x}+\varepsilon \mathbf{y})-J(\mathbf{x})}{\varepsilon}

If y is a vector êᵢ of the orthonormal basis:

J’(\mathbf{x};\mathbf{\widehat{e}}_i) = \frac{\partial J}{\partial x_i}
J’(\mathbf{x};\mathbf{\widehat{e}}_i) = \frac{\partial J}{\partial x_i}

We can write it with the gradient:

J’(\mathbf{x};\mathbf{y}) = \nabla J(\mathbf{x}) \, \mathbf{y}
J’(\mathbf{x};\mathbf{y}) = \nabla J(\mathbf{x}) \, \mathbf{y}

J is twice Gateaux differentiable at x if:

  • J’’(x;y,z) exists ∀y,z
  • (y,z) ↦ J’’(x;y,z) is bilinear
J’’(\mathbf{x};\mathbf{y},\mathbf{z}) = \lim_{\varepsilon \rightarrow 0} \frac{J’(\mathbf{x}+\varepsilon \mathbf{z};\mathbf{y})-J’(\mathbf{x};\mathbf{y})}{\varepsilon}
J’’(\mathbf{x};\mathbf{y},\mathbf{z}) = \lim_{\varepsilon \rightarrow 0} \frac{J’(\mathbf{x}+\varepsilon \mathbf{z};\mathbf{y})-J’(\mathbf{x};\mathbf{y})}{\varepsilon}

Using vectors of the orthonormal basis:

J’(\mathbf{x};\mathbf{\widehat{e}}_i, \mathbf{\widehat{e}}_j) = \frac{\partial² J(\mathbf{x})}{\partial x_i \partial x_j}
J’(\mathbf{x};\mathbf{\widehat{e}}_i, \mathbf{\widehat{e}}_j) = \frac{\partial² J(\mathbf{x})}{\partial x_i \partial x_j}

We can write it with the Hessian matrix:

J’’(\mathbf{x};\mathbf{y},\mathbf{z}) = \nabla² J(\mathbf{x}) \, \mathbf{y} \cdot \mathbf{z}
J’’(\mathbf{x};\mathbf{y},\mathbf{z}) = \nabla² J(\mathbf{x}) \, \mathbf{y} \cdot \mathbf{z}

Similarly to what we know for the second order partial derivatives:

J’’(\mathbf{x};\mathbf{y},\mathbf{z}) = J’’(\mathbf{x};\mathbf{z},\mathbf{y})
J’’(\mathbf{x};\mathbf{y},\mathbf{z}) = J’’(\mathbf{x};\mathbf{z},\mathbf{y})

At some occasions, we will use the first terms of the polynomial approximation of J in dimension n, in forms such as:

J(\mathbf{x}+\mathbf{y}) = J(\mathbf{x}) + J’(\mathbf{x};\mathbf{y}) + \frac{1}{2} \cdot J’’(\mathbf{x};\mathbf{y},\mathbf{z}) + \cdots
J(\mathbf{x}+\mathbf{y}) = J(\mathbf{x}) + J’(\mathbf{x};\mathbf{y}) + \frac{1}{2} \cdot J’’(\mathbf{x};\mathbf{y},\mathbf{z}) + \cdots

Also:

\exists x_0 \in ]\mathbf{x},\mathbf{x}+\mathbf{y}[ \;\;\; s.t. \;\;\; J(\mathbf{x}+\mathbf{y}) = J(\mathbf{x}) + J’(\mathbf{x_0};\mathbf{y})
\exists x_0 \in ]\mathbf{x},\mathbf{x}+\mathbf{y}[ \;\;\; s.t. \;\;\; J(\mathbf{x}+\mathbf{y}) = J(\mathbf{x}) + J’(\mathbf{x_0};\mathbf{y})

We can also use:

\begin{aligned}
 J(\mathbf{x}) &= J(\mathbf{x^*}) + J’(\mathbf{x};\mathbf{x-x^*}) + \frac{1}{2} \cdot J’’(\mathbf{x^*};\mathbf{x-x^*},\mathbf{x-x^*}) + \cdots
 \\\\
 J(\mathbf{x^* + y}) &= J(\mathbf{x^*}) + J’(\mathbf{x^*};\mathbf{y}) + \frac{1}{2} \cdot J’’(\mathbf{x^*};\mathbf{y},\mathbf{y}) + \cdots
 \end{aligned}
\begin{aligned}
 J(\mathbf{x}) &= J(\mathbf{x^*}) + J’(\mathbf{x};\mathbf{x-x^*}) + \frac{1}{2} \cdot J’’(\mathbf{x^*};\mathbf{x-x^*},\mathbf{x-x^*}) + \cdots
 \\\\
 J(\mathbf{x^* + y}) &= J(\mathbf{x^*}) + J’(\mathbf{x^*};\mathbf{y}) + \frac{1}{2} \cdot J’’(\mathbf{x^*};\mathbf{y},\mathbf{y}) + \cdots
 \end{aligned}

Existence of a minimum

J : K ⊂ ℝⁿ → ℝ has a minimum if K is closed and either:

  • K is bounded (what mean that K has a boundary and the boundary is included in K. We can also note that bounded + closed = compact)
  • Or lim J(x) [when ∥x∥ → ] =

These examples, show why a closed set is part of the sufficient conditions, yet is not necessary:

Own work released under CC BY 3.0Download source

These examples, show why a bounded set or an infinite at infinite function is part of the sufficient conditions, yet is not necessary:

Own work released under CC BY 3.0Download source

Convexity

Since we usually want to minimize functions, convexity is a nice thing to check.

  • If J is convex, any local minimum is the global minimum.
  • If J is strictly convex, there exist only one minimum.

For all the above cases: ∀x,y ∈ K ; ∀θ ∈ [0,1]; α≥0

The set K ⊂ ℝⁿ is convex if the segment:

We need to have K convex, to have J: K → ℝ convex.

On ℝ², drawing potatoes K:

\begin{cases}
 \text{convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) \\ \\
 \text{strictly convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) < \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) \\ \\
 \text{alpha convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) — \frac{\alpha}{2} \theta (\theta -1) ||\mathbf{y}-\mathbf{x}||²
 \end{cases}
\begin{cases}
 \text{convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) \\ \\
 \text{strictly convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) < \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) \\ \\
 \text{alpha convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) — \frac{\alpha}{2} \theta (\theta -1) ||\mathbf{y}-\mathbf{x}||²
 \end{cases}
Own work released under CC BY 3.0Download source

If K = ℝ, a counter-example could be {[0,1],[3,4]}, that is not convex, as any disjoint set.

alpha convex ⇒ strictly convex ⇒ convex

\begin{cases}
 \text{convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) \\ \\
 \text{strictly convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) < \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) \\ \\
 \text{alpha convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) — \frac{\alpha}{2} \theta (\theta -1) ||\mathbf{y}-\mathbf{x}||²
 \end{cases}
\begin{cases}
 \text{convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) \\ \\
 \text{strictly convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) < \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) \\ \\
 \text{alpha convex: } & J(\theta \mathbf{x}+(1-\theta) \mathbf{y}) \leq \theta J(\mathbf{x}) + (1-\theta) J(\mathbf{y}) — \frac{\alpha}{2} \theta (\theta -1) ||\mathbf{y}-\mathbf{x}||²
 \end{cases}

Graphically, this can be interpreted as:

Own work released under CC BY 3.0Download source
\begin{cases}
 \text{convex: } & J(\mathbf{y}) \geq J(\mathbf{x}) + J’(\mathbf{x};\mathbf{y}-\mathbf{x}) \\ \\
 \text{strictly convex: } & J(\mathbf{y}) > J(\mathbf{x}) + J’(\mathbf{x};\mathbf{y}-\mathbf{x}) \\ \\
 \text{alpha convex: } & J(\mathbf{y}) \geq J(\mathbf{x}) + J’(\mathbf{x};\mathbf{y}-\mathbf{x}) + \frac{\alpha}{2}||\mathbf{y}-\mathbf{x}||²
 \end{cases}
\begin{cases}
 \text{convex: } & J(\mathbf{y}) \geq J(\mathbf{x}) + J’(\mathbf{x};\mathbf{y}-\mathbf{x}) \\ \\
 \text{strictly convex: } & J(\mathbf{y}) > J(\mathbf{x}) + J’(\mathbf{x};\mathbf{y}-\mathbf{x}) \\ \\
 \text{alpha convex: } & J(\mathbf{y}) \geq J(\mathbf{x}) + J’(\mathbf{x};\mathbf{y}-\mathbf{x}) + \frac{\alpha}{2}||\mathbf{y}-\mathbf{x}||²
 \end{cases}

Or, “saying” that the direction of the tangent is increasing:

\begin{cases} 
 
 \text{convex: } & J’(\mathbf{y;\mathbf{y}-\mathbf{x}}) \geq J’(\mathbf{x};\mathbf{y}-\mathbf{x}) \\ \\ 
 
 \text{strictly convex: } & J’(\mathbf{y;\mathbf{y}-\mathbf{x}}) > J’(\mathbf{x};\mathbf{y}-\mathbf{x}) \\ \\ 
 
 \text{alpha convex: } & J’(\mathbf{y;\mathbf{y}-\mathbf{x}}) \geq J’(\mathbf{x};\mathbf{y}-\mathbf{x}) + \alpha||\mathbf{y}-\mathbf{x}||² 
 
 \end{cases}
\begin{cases} 
 
 \text{convex: } & J’(\mathbf{y;\mathbf{y}-\mathbf{x}}) \geq J’(\mathbf{x};\mathbf{y}-\mathbf{x}) \\ \\ 
 
 \text{strictly convex: } & J’(\mathbf{y;\mathbf{y}-\mathbf{x}}) > J’(\mathbf{x};\mathbf{y}-\mathbf{x}) \\ \\ 
 
 \text{alpha convex: } & J’(\mathbf{y;\mathbf{y}-\mathbf{x}}) \geq J’(\mathbf{x};\mathbf{y}-\mathbf{x}) + \alpha||\mathbf{y}-\mathbf{x}||² 
 
 \end{cases}

ヽ(⌐■_■)ノ♪♬

\begin{cases} \text{convex: } & J’’(\mathbf{x};\mathbf{y},\mathbf{y}) \geq 0 \\ \\ \text{strictly convex: } & J’’(\mathbf{x};\mathbf{y},\mathbf{y}) > 0 \\ \\ \text{alpha convex: } & J’’(\mathbf{x};\mathbf{y},\mathbf{y}) \geq \alpha \end{cases}
\begin{cases} \text{convex: } & J’’(\mathbf{x};\mathbf{y},\mathbf{y}) \geq 0 \\ \\ \text{strictly convex: } & J’’(\mathbf{x};\mathbf{y},\mathbf{y}) > 0 \\ \\ \text{alpha convex: } & J’’(\mathbf{x};\mathbf{y},\mathbf{y}) \geq \alpha \end{cases}

Actually, due to an issue at limits, J(x;y,y) >0 only ⇒ that J is strictly convex, while the others are ⇔ (but we will rarely run into it…).

We can rewrite this with the Eigenvalues λᵢ of the Hessian matrix: ∇²J(x):

\begin{cases}
 \text{convex: } & \lambda_i \geq 0 \;\;\; \forall i\\ \\
 \text{strictly convex: } & \lambda_i > 0 \;\;\; \forall i\\ \\
 \text{alpha convex: } & \lambda_i \geq \alpha \;\;\; \forall i
 \end{cases}
\begin{cases}
 \text{convex: } & \lambda_i \geq 0 \;\;\; \forall i\\ \\
 \text{strictly convex: } & \lambda_i > 0 \;\;\; \forall i\\ \\
 \text{alpha convex: } & \lambda_i \geq \alpha \;\;\; \forall i
 \end{cases}

Optimality conditions

“How to find the min”

  • K ⊂ ℝⁿ is open
  • J is Gateaux differentiable
\mathbf{x^*} = \min J \;\; \Rightarrow \begin{cases} &J’(\mathbf{x^*},\mathbf{y}) = 0 \;\;\; \color{Gray} \forall \mathbf{y }\in K\\\\ &\nabla J(\mathbf{x^*}) = 0 \;\;\;\;\; \color{Gray} \forall \mathbf{y} \in K \end{cases}
\mathbf{x^*} = \min J \;\; \Rightarrow \begin{cases} &J’(\mathbf{x^*},\mathbf{y}) = 0 \;\;\; \color{Gray} \forall \mathbf{y }\in K\\\\ &\nabla J(\mathbf{x^*}) = 0 \;\;\;\;\; \color{Gray} \forall \mathbf{y} \in K \end{cases}

If K and J are convex, we even have an ⇔

Own work released under CC BY 3.0Download source

Second order, adding J twice G-derivable:

x^* = \min J \;\; \Rightarrow \begin{cases} &\nabla²J(\mathbf{x^*}) \geq 0 \;\;\;\;\;\;\;\;\;{\color{Gray} \forall \mathbf{y} \in K} \\ &\nabla²J(\mathbf{x^*}) \cdot \mathbf{y} \geq 0 \;\;\;\;{\color{Gray} \forall \mathbf{y} \in K}\\ &J’’(\mathbf{x^*};\mathbf{y},\mathbf{y}) \geq 0 \;\;\;\;{\color{Gray} \forall \mathbf{y} \in K} \end{cases}
x^* = \min J \;\; \Rightarrow \begin{cases} &\nabla²J(\mathbf{x^*}) \geq 0 \;\;\;\;\;\;\;\;\;{\color{Gray} \forall \mathbf{y} \in K} \\ &\nabla²J(\mathbf{x^*}) \cdot \mathbf{y} \geq 0 \;\;\;\;{\color{Gray} \forall \mathbf{y} \in K}\\ &J’’(\mathbf{x^*};\mathbf{y},\mathbf{y}) \geq 0 \;\;\;\;{\color{Gray} \forall \mathbf{y} \in K} \end{cases}

Sufficient condition, adding on a ball B(x*,ε):

The ball somehow define a “locally convex” function, around the minimum.

  • K ⊂ ℝⁿ is convex
  • J is Gateaux differentiable
x^* = \min J \;\; \Rightarrow \begin{cases} &J’(\mathbf{x^*},\mathbf{y}-\mathbf{x*}) \geq 0 \;\;\;\;\;\;\;\; \color{Gray} \forall \mathbf{y} \in K\\\\ &\nabla J(\mathbf{x^*}) \cdot (\mathbf{y}-\mathbf{x^*}) \geq 0 \;\;\; \color{Gray} \forall \mathbf{y} \in K \end{cases}
x^* = \min J \;\; \Rightarrow \begin{cases} &J’(\mathbf{x^*},\mathbf{y}-\mathbf{x*}) \geq 0 \;\;\;\;\;\;\;\; \color{Gray} \forall \mathbf{y} \in K\\\\ &\nabla J(\mathbf{x^*}) \cdot (\mathbf{y}-\mathbf{x^*}) \geq 0 \;\;\; \color{Gray} \forall \mathbf{y} \in K \end{cases}

If J is convex, we even have an ⇔

Own work released under CC BY 3.0Download source

Equality constraints

We search for inf J(x) ∀x ∈ K where K = { x ∈ X s.t. G(x) = 0 }, what mean for function g’s coordinates: gᵢ(x) = 0 ∀i ∈ {1,…,m}.

\begin{cases} & \mathbf{x}^* = \inf_K J(\mathbf{x}) \\ & K = \{\mathbf{x} \in \mathcal{X} \; \operatorname{s.t.} \; \mathbf{g}(\mathbf{x})=0 \} \\ \end{cases}
\begin{cases} & \mathbf{x}^* = \inf_K J(\mathbf{x}) \\ & K = \{\mathbf{x} \in \mathcal{X} \; \operatorname{s.t.} \; \mathbf{g}(\mathbf{x})=0 \} \\ \end{cases}

We say that the constraints g(x)=0 are regular at point x₀ if their derivatives are linearly independants.

\begin{aligned} & \text{The constraints } \mathbf{g}(\mathbf{x}) = 0 \text{ are regular at } \mathbf{x_0} \\ \\ & \Leftrightarrow \{\mathbf{\nabla} \mathrm{g}_i(\mathbf{x_0})\}_{i=1}^{m} \text{ linearly independants} \end{aligned}
\begin{aligned} & \text{The constraints } \mathbf{g}(\mathbf{x}) = 0 \text{ are regular at } \mathbf{x_0} \\ \\ & \Leftrightarrow \{\mathbf{\nabla} \mathrm{g}_i(\mathbf{x_0})\}_{i=1}^{m} \text{ linearly independants} \end{aligned}

If x* is a local minimum of J on K and if the constraints are regular at x*, then, there exists m real values λᵢ such that:

\nabla J(\mathbf{x^*}) + \sum_{i=1}^{m} \lambda_i \, \nabla \mathrm{g_i}(\mathbf{x^*}) = 0
\nabla J(\mathbf{x^*}) + \sum_{i=1}^{m} \lambda_i \, \nabla \mathrm{g_i}(\mathbf{x^*}) = 0

These values λᵢ are called Lagrange multipliers. There is a unique set of vectors that satisfy this relation.

Actually, we are saying that the derivative of a Lagrangian according to x is equal to zero at x*. So, with λ =(λ₁, …, λₘ), we associate the following Lagrangian to the minimization problem:

\begin{aligned} \mathcal{L}(\mathbf{x},\boldsymbol{\lambda}) &= J(\mathbf{x}) + \sum_{i=1}^{m} \lambda_i \, \mathrm{g_i}(\mathbf{x}) \\ &= J(\mathbf{x}) + \boldsymbol{\lambda} \cdot \mathbf{g}(\mathbf{x}) \end{aligned}
\begin{aligned} \mathcal{L}(\mathbf{x},\boldsymbol{\lambda}) &= J(\mathbf{x}) + \sum_{i=1}^{m} \lambda_i \, \mathrm{g_i}(\mathbf{x}) \\ &= J(\mathbf{x}) + \boldsymbol{\lambda} \cdot \mathbf{g}(\mathbf{x}) \end{aligned}

If we want to minimize both on J and on the regular constraints, writing the previous minimization problem with the Lagrangian gives:

\begin{cases}
 & \frac{\partial \mathcal{L}(\mathbf{x^*},\boldsymbol{\lambda^*})}{\partial \mathbf{x}} = 0 \\ \\ 
 & \frac{\partial \mathcal{L}(\mathbf{x^*},\boldsymbol{\lambda^*})}{\partial \boldsymbol{\lambda}} = 0
 \end{cases}
\begin{cases}
 & \frac{\partial \mathcal{L}(\mathbf{x^*},\boldsymbol{\lambda^*})}{\partial \mathbf{x}} = 0 \\ \\ 
 & \frac{\partial \mathcal{L}(\mathbf{x^*},\boldsymbol{\lambda^*})}{\partial \boldsymbol{\lambda}} = 0
 \end{cases}

The couple (x*,λ*) is a critical point of the lagrangian.

Notion of saddle point

By Nicoguaro — Wikipedia — CC BY 3.0

Let’s consider a fonction ℒ : ℝⁿ×ℝᵐ → ℝ

If (x*,y*) ∈ ℝⁿ×ℝᵐ is a saddle point of ℒ then:

\inf_{\mathbf{x} \in \mathbb{R}^n} \sup_{\mathbf{y} \in \mathbb{R}^m} \mathcal{L}(\mathbf{x},\mathbf{y})
 = 
 \mathcal{L}(\mathbf{x^*},\mathbf{y^*})
 =
 \sup_{\mathbf{y} \in \mathbb{R}^m} \inf_{\mathbf{x} \in \mathbb{R}^n} \mathcal{L}(\mathbf{x},\mathbf{y})
\inf_{\mathbf{x} \in \mathbb{R}^n} \sup_{\mathbf{y} \in \mathbb{R}^m} \mathcal{L}(\mathbf{x},\mathbf{y})
 = 
 \mathcal{L}(\mathbf{x^*},\mathbf{y^*})
 =
 \sup_{\mathbf{y} \in \mathbb{R}^m} \inf_{\mathbf{x} \in \mathbb{R}^n} \mathcal{L}(\mathbf{x},\mathbf{y})

Inequality constraints

Similarly to equality constraints:

\begin{cases} & \mathbf{x}^* = \inf_K J(\mathbf{x}) \\ & K = \{\mathbf{x} \in \mathcal{X} \; \operatorname{s.t.} \; \mathrm{h_i}(\mathbf{x}) \leq 0 \; \forall i \in \{ 1, \cdots , m\} \; \} \end{cases}
\begin{cases} & \mathbf{x}^* = \inf_K J(\mathbf{x}) \\ & K = \{\mathbf{x} \in \mathcal{X} \; \operatorname{s.t.} \; \mathrm{h_i}(\mathbf{x}) \leq 0 \; \forall i \in \{ 1, \cdots , m\} \; \} \end{cases}
\begin{aligned} \mathcal{L}(\mathbf{x},\boldsymbol{\mu}) &= J(\mathbf{x}) + \sum_{i=1}^{m} \mu_i \, \mathrm{h_i}(\mathbf{x}) \\ &= J(\mathbf{x}) + \boldsymbol{\mu} \cdot \mathbf{h}(\mathbf{x}) \end{aligned}
\begin{aligned} \mathcal{L}(\mathbf{x},\boldsymbol{\mu}) &= J(\mathbf{x}) + \sum_{i=1}^{m} \mu_i \, \mathrm{h_i}(\mathbf{x}) \\ &= J(\mathbf{x}) + \boldsymbol{\mu} \cdot \mathbf{h}(\mathbf{x}) \end{aligned}

The inequality constraint implies that all the coordinates of the lagrange multiplier must be superior to zero. We will work with 𝛍 on the positive cone ℝᵐ₊:

\{ \boldsymbol{\mu} \in \mathbb{R}^{m}_{+} \} = \{\boldsymbol{\mu} \in \mathbb{R}^{m} \; \operatorname{s.t.} \; \mu_i \geq 0 \;\; \forall i \}
\{ \boldsymbol{\mu} \in \mathbb{R}^{m}_{+} \} = \{\boldsymbol{\mu} \in \mathbb{R}^{m} \; \operatorname{s.t.} \; \mu_i \geq 0 \;\; \forall i \}

If (x*,𝛍*) ∈ ℝⁿ×ℝᵐ is a saddle point of ℒ, then it belong to K and is a solution to the minimization problem with inequality constraints. We can write it as:

\; \; \; \forall \mathbf{x} \in \mathbb{R}^n \; ; \; \forall \boldsymbol{\mu} \in \mathbb{R}^m_+ \\\\\mathcal{L}(\mathbf{x^*},\boldsymbol{\mu}) \leq \mathcal{L}(\mathbf{x^*},\boldsymbol{\mu}^*) \leq \mathcal{L}(\mathbf{x},\boldsymbol{\mu}^*)
\; \; \; \forall \mathbf{x} \in \mathbb{R}^n \; ; \; \forall \boldsymbol{\mu} \in \mathbb{R}^m_+ \\\\\mathcal{L}(\mathbf{x^*},\boldsymbol{\mu}) \leq \mathcal{L}(\mathbf{x^*},\boldsymbol{\mu}^*) \leq \mathcal{L}(\mathbf{x},\boldsymbol{\mu}^*)

We can eliminate the J(x*) terms on both side:

\sum_{i=1}^{m} \mu_i \, \mathrm{h_i}(\mathbf{x^*}) \leq \sum_{i=1}^{m} \mu_i^* \, \mathrm{h_i}(\mathbf{x^*})
\sum_{i=1}^{m} \mu_i \, \mathrm{h_i}(\mathbf{x^*}) \leq \sum_{i=1}^{m} \mu_i^* \, \mathrm{h_i}(\mathbf{x^*})

There is 2 possible cases for each i:

  • Either the constraint is inactive, what mean that x* is not on its boundary. Then we have μᵢ = 0 and hᵢ(x*) < 0.
  • Either the constraint is active, what mean that x* is on the boundary. Then we have μᵢ ≥ 0 and hᵢ(x*) = 0.

In practice, this is how we will proceed. For each i, we will solve for both cases: μᵢ = 0 and hᵢ(x*) = 0.

This can be a lot of cases. If there is there 3 constraints (m=3), there is 2³ = 8 cases to solve.

J(\mathbf{x^*}) + \sum_{i=1}^{m} \mu^*_i \, \mathrm{h_i}(\mathbf{x^*}) \leq J(\mathbf{x}) + \sum_{i=1}^{m} \mu^*_i \, \mathrm{h_i}(\mathbf{x})
J(\mathbf{x^*}) + \sum_{i=1}^{m} \mu^*_i \, \mathrm{h_i}(\mathbf{x^*}) \leq J(\mathbf{x}) + \sum_{i=1}^{m} \mu^*_i \, \mathrm{h_i}(\mathbf{x})

We can rely on the Euler equality to solve it:

\mathcal{L}’_\mathbf{x}(\mathbf{x^*},\boldsymbol{\mu}^*) = 0 \; \Rightarrow 
 
 \\ \\ \; \nabla J(\mathbf{x^*}) + \sum_{i=1}^{m} \mu_i \nabla h_i(\mathbf{x^*})
\mathcal{L}’_\mathbf{x}(\mathbf{x^*},\boldsymbol{\mu}^*) = 0 \; \Rightarrow 
 
 \\ \\ \; \nabla J(\mathbf{x^*}) + \sum_{i=1}^{m} \mu_i \nabla h_i(\mathbf{x^*})

Karush-Kuhn-Tucker

In the convex case, we can rely on the Karush-Kuhn-Tucker conditions, what makes things much easier. This necessary and sufficient conditions of optimality, given J convex, h and J derivables.

In the non-convex case, theses are a not sufficient yet necessary conditions for x* being a solution of the optimization with inequality problem.

If ∀i ∈ {1,…,m} ; ∃x* ∈ ℝ such that:

  • either hᵢ(x*) < 0
  • or: hᵢ(x*) = 0 with F affine.

Then x* is a solution of the minium with inequality constraints problem is equivalent to that, there exists 𝛍* such that (x*,𝛍*) that is a saddle point of ℒ on ℝⁿ×ℝᵐ₊.

We can write the Kuhn-Tucker relations:

\begin{cases} & h_i(\mathbf{x^*}) \leq 0 \; \; \; \forall i\\ & \mu_i \geq 0 \; \; \; \forall i \\ & \sum_{i=1}^{m} \mu_i h_i(\mathbf{x^*}) = 0 \\ & \nabla J(\mathbf{x^*}) + \sum_{i=1}^{m} \mu_i \nabla h_i(\mathbf{x^*}) = 0 \end{cases}
\begin{cases} & h_i(\mathbf{x^*}) \leq 0 \; \; \; \forall i\\ & \mu_i \geq 0 \; \; \; \forall i \\ & \sum_{i=1}^{m} \mu_i h_i(\mathbf{x^*}) = 0 \\ & \nabla J(\mathbf{x^*}) + \sum_{i=1}^{m} \mu_i \nabla h_i(\mathbf{x^*}) = 0 \end{cases}
  • The 3 first lines are given by the left inequality of the saddle point
  • The last line is given by the right inequality of the saddle point

We can also note that affine constraints are always qualified if there is only affine constraints and if the resulting subset in not empty.

We can write this case as:

K = \begin{Bmatrix} \mathbf{x} \in \mathbb{R} \text{ s.t. } & g_i(\mathbf{x}) = 0 \;\; \;\; {\color{Gray} \forall i \in \{1, \cdots, m_1\}}\\ & h_j(\mathbf{x}) \leq 0 \;\; \;\; {\color{Gray} \forall j \in \{1, \cdots, m_2\}}\\ \end{Bmatrix}
K = \begin{Bmatrix} \mathbf{x} \in \mathbb{R} \text{ s.t. } & g_i(\mathbf{x}) = 0 \;\; \;\; {\color{Gray} \forall i \in \{1, \cdots, m_1\}}\\ & h_j(\mathbf{x}) \leq 0 \;\; \;\; {\color{Gray} \forall j \in \{1, \cdots, m_2\}}\\ \end{Bmatrix}

We need 2 conditions, that may be pretty hard to check, except if all the constraints are affine.

Qualification hypothesis 1:

We need the existence of a vector z that permits the re-entrance in the domain of the constraints if we are at the boundary.

\exists \mathbf{z} \in \mathbb{R}^n \text{ s.t. } \begin{cases} & \nabla g_i(\mathbf{x^*}) \cdot \mathbf{z} = 0 \; \; \; \forall i\\ & \nabla h_j(\mathbf{x^*}) \cdot \mathbf{z} < 0 \; \, \; \; \forall j \end{cases}
\exists \mathbf{z} \in \mathbb{R}^n \text{ s.t. } \begin{cases} & \nabla g_i(\mathbf{x^*}) \cdot \mathbf{z} = 0 \; \; \; \forall i\\ & \nabla h_j(\mathbf{x^*}) \cdot \mathbf{z} < 0 \; \, \; \; \forall j \end{cases}

The second line may be an issue and shall be taken care of if some constraints are affine and some other are not affine.

Qualification hypothesis 2:

The gradients ∇gᵢ(x*) must be linearly independent ∀i ∈ {1,…,m₁}.

Theorem:

If x* is a local minimum of J on K and if the above qualification hypothesis are satisfied, then:

  • ∃𝛌 = (λ₁, …, λₙ) ∈ ℝᵐ
  • ∃𝛍 = (μ₁, …, μₘ) ∈ ℝᵐ¹₊

such that, ∀i ∈ {1,…,m₁} ; ∀j ∈ {1,…,m₂}:

\begin{cases} 
 & g_i(\mathbf{x^*}) = 0 \\ 
 & h_j(\mathbf{x^*}) \leq 0 \\ 
 & \mu_j \geq 0 \\ 
 & \sum_{j=1}^{m_2} \mu_j h_j(\mathbf{x^*}) = 0 \\ 
 & \nabla J(\mathbf{x^*}) + \sum_{i=1}^{m_1} \lambda_i \nabla g_i(\mathbf{x^*}) + \sum_{j=1}^{m_2} \mu_j \nabla h_j(\mathbf{x^*}) = 0 \end{cases}
\begin{cases} 
 & g_i(\mathbf{x^*}) = 0 \\ 
 & h_j(\mathbf{x^*}) \leq 0 \\ 
 & \mu_j \geq 0 \\ 
 & \sum_{j=1}^{m_2} \mu_j h_j(\mathbf{x^*}) = 0 \\ 
 & \nabla J(\mathbf{x^*}) + \sum_{i=1}^{m_1} \lambda_i \nabla g_i(\mathbf{x^*}) + \sum_{j=1}^{m_2} \mu_j \nabla h_j(\mathbf{x^*}) = 0 \end{cases}

This can also be written with matrices:

\begin{cases} 
 & \mathbf{g}(\mathbf{x^*}) = 0 \\ 
 & \mathbf{h}(\mathbf{x^*}) \leq 0 \\ 
 & \boldsymbol{\mu} \geq 0 \\ 
 & \boldsymbol{\mu}^\intercal \mathbf{h}(\mathbf{x^*}) = 0 \\ 
 & \nabla J(\mathbf{x^*}) + D \mathbf{g}(\mathbf{x^*})^\intercal \boldsymbol{\lambda} + D \mathbf{h}(\mathbf{x^*})^\intercal \boldsymbol{\mu} = 0 \end{cases}
\begin{cases} 
 & \mathbf{g}(\mathbf{x^*}) = 0 \\ 
 & \mathbf{h}(\mathbf{x^*}) \leq 0 \\ 
 & \boldsymbol{\mu} \geq 0 \\ 
 & \boldsymbol{\mu}^\intercal \mathbf{h}(\mathbf{x^*}) = 0 \\ 
 & \nabla J(\mathbf{x^*}) + D \mathbf{g}(\mathbf{x^*})^\intercal \boldsymbol{\lambda} + D \mathbf{h}(\mathbf{x^*})^\intercal \boldsymbol{\mu} = 0 \end{cases}

Practically, in an algorithm, when the hypothesis are hard to verify, we can check the value on the specific points where calculus issues arise.

Base formulas of optimization algorithms

This part is only a list of formulas for reminder. If you want to get through the details and explanations you may read the article: “A glimpse of the maths behind Optimization Algorithms”.

The general formula for gradient descent is:

\mathbf{x_{k+1}} = \mathbf{x_{k}} — \rho_k \nabla J(\mathbf{x_{k}})
\mathbf{x_{k+1}} = \mathbf{x_{k}} — \rho_k \nabla J(\mathbf{x_{k}})

We are basically saying that we will move towards the opposite direction of the gradient by a factor ρₖ of the gradient norm.

In practice, terms are added to help it converve better and faster.

This is the most “mathematically simple” algorithm, while not a very performant one. Here, we simply choose a fixed value of ρₖ, that we will therefore call ρ.

\mathbf{x_{k+1}} = \mathbf{x_{k}} — \rho \nabla J(\mathbf{x_{k}})
\mathbf{x_{k+1}} = \mathbf{x_{k}} — \rho \nabla J(\mathbf{x_{k}})

We are working under the previous hypothesis. J is alpha-convex and Gateaux derivable.

J(xₖ) is Lipschitz continuous:

\begin{aligned}
 & \forall \mathbf{x_1},\mathbf{x_2} \in \mathbb{R}^n \; ; \; \exists \textsc{m} \in \mathbb{R} \text{ such that:}
 \\ 
 \\ 
 & \lVert \nabla J(\mathbf{x_1}) — \nabla J(\mathbf{x_2}) \rVert \leq \textsc{m} \lVert \mathbf{x_1}-\mathbf{x_2} \rVert
 
 \end{aligned}
\begin{aligned}
 & \forall \mathbf{x_1},\mathbf{x_2} \in \mathbb{R}^n \; ; \; \exists \textsc{m} \in \mathbb{R} \text{ such that:}
 \\ 
 \\ 
 & \lVert \nabla J(\mathbf{x_1}) — \nabla J(\mathbf{x_2}) \rVert \leq \textsc{m} \lVert \mathbf{x_1}-\mathbf{x_2} \rVert
 
 \end{aligned}

The condition for convergence is:

0 < \rho < \frac{2\alpha}{\textsc{m}²}
0 < \rho < \frac{2\alpha}{\textsc{m}²}

In this algorithm, we will also maximize the progress value of the step: J(xₖ₊₁)-J(xₖ), to find out the optimal value of rho.

\begin{cases}
 & \mathbf{x_{k+1}} = \mathbf{x_{k}} — \rho_k \nabla J(\mathbf{x_{k}}) \\ 
 & {\small \rho_k \text{ s.t. }J(\mathbf{x_{k}} — \rho_k \nabla J(\mathbf{x_{k}})) = \inf_{\rho > 0} J(\mathbf{x_{k}} — \rho_k \nabla J(\mathbf{x_{k}}))}
 \end{cases}
\begin{cases}
 & \mathbf{x_{k+1}} = \mathbf{x_{k}} — \rho_k \nabla J(\mathbf{x_{k}}) \\ 
 & {\small \rho_k \text{ s.t. }J(\mathbf{x_{k}} — \rho_k \nabla J(\mathbf{x_{k}})) = \inf_{\rho > 0} J(\mathbf{x_{k}} — \rho_k \nabla J(\mathbf{x_{k}}))}
 \end{cases}

The optimal step gradient algorithm converges for any J that is: Gateaux-derivable, Alpha-convex, Lipschitz continuous, on every bounded subset.

The algorithm converges in n step, where n is the dimension of the space. In practice rounding errors prevents the exact convergence in simulations, yet it remain faster than previous methods.

\begin{aligned}
 
 J(\mathbf{x_{k+1}}) = \inf_{\boldsymbol{\alpha} \in \mathbb{R}^{k+1}} J(\mathbf{x_k} + \sum_{i=0}^{k}\alpha_i \nabla_i J(\mathbf{x_k}) )
 
 \end{aligned}
\begin{aligned}
 
 J(\mathbf{x_{k+1}}) = \inf_{\boldsymbol{\alpha} \in \mathbb{R}^{k+1}} J(\mathbf{x_k} + \sum_{i=0}^{k}\alpha_i \nabla_i J(\mathbf{x_k}) )
 
 \end{aligned}

On quadratic functions, if we take dᵢ the direction of the gradient at the iᵗʰ iteration, the algorithm is the following:

\begin{aligned}
 
 & \text{1) }\;\; d_0 = \nabla J(\mathbf{x_0})
 \\ \\
 & \text{2) }\;\; \mathbf{x_{k+1}} = \mathbf{x_{k}} — \rho_k \mathbf{d_{k}} {\color{Gray} \text{ with }} \rho_k=\frac{\nabla J(\mathbf{x_{k}})\cdot \mathbf{d_{k}}}{\boldsymbol{\mathrm{A}}\mathbf{d_{k}} \cdot \mathbf{d_{k}}} 
 \\ \\
 & \text{3) }\;\; \mathbf{d_{k+1}} = \nabla J(\mathbf{x_{k+1}}) + \beta_k \mathbf{d_{k}} {\color{Gray} \text{ with }} \beta_k = — \frac{\nabla J(\mathbf{x_{k+1}})\cdot \mathrm{A}\mathbf{d_{k}}}{\b
\begin{aligned}
 
 & \text{1) }\;\; d_0 = \nabla J(\mathbf{x_0})
 \\ \\
 & \text{2) }\;\; \mathbf{x_{k+1}} = \mathbf{x_{k}} — \rho_k \mathbf{d_{k}} {\color{Gray} \text{ with }} \rho_k=\frac{\nabla J(\mathbf{x_{k}})\cdot \mathbf{d_{k}}}{\boldsymbol{\mathrm{A}}\mathbf{d_{k}} \cdot \mathbf{d_{k}}} 
 \\ \\
 & \text{3) }\;\; \mathbf{d_{k+1}} = \nabla J(\mathbf{x_{k+1}}) + \beta_k \mathbf{d_{k}} {\color{Gray} \text{ with }} \beta_k = — \frac{\nabla J(\mathbf{x_{k+1}})\cdot \mathrm{A}\mathbf{d_{k}}}{\b

One of the common uses of this algorithm is to solve linear systems in the form Ax = b, by transforming it into a quadratic function.

We need to solve:

\begin{cases}
 & \boldsymbol{\delta} \mathbf{x_k} \text{ {\small such that: } } \nabla²J(\mathbf{x_k}) \boldsymbol{\delta} \mathbf{x_k} = — \nabla J(\mathbf{x_k}) \\
 & \mathbf{x_{k+1}} = \mathbf{x_k} + \boldsymbol{\delta} \mathbf{x_k}
 \end{cases}
\begin{cases}
 & \boldsymbol{\delta} \mathbf{x_k} \text{ {\small such that: } } \nabla²J(\mathbf{x_k}) \boldsymbol{\delta} \mathbf{x_k} = — \nabla J(\mathbf{x_k}) \\
 & \mathbf{x_{k+1}} = \mathbf{x_k} + \boldsymbol{\delta} \mathbf{x_k}
 \end{cases}

In practice, since computing the Hessian is costly, we do it only at some steps.

With restrict the set on which we are looking for the minimum and call it K.

\mathbf{x^*} = \inf_{\mathbf{x} \in K} J(\mathbf{x})
\mathbf{x^*} = \inf_{\mathbf{x} \in K} J(\mathbf{x})

When xₖ is not on K, we will make a projection of the result to K:

\mathbf{x_{k+1}} = \operatorname{Proj}_K(\mathbf{x_{k}} — \rho \nabla J(\mathbf{x_{k}})) \; \; \; {\color{Gray} ; \; \; \rho > 0}
\mathbf{x_{k+1}} = \operatorname{Proj}_K(\mathbf{x_{k}} — \rho \nabla J(\mathbf{x_{k}})) \; \; \; {\color{Gray} ; \; \; \rho > 0}

J: ℝⁿ → ℝ is the function to minimize and F: ℝⁿ → ℝᵐ is the vector function of the constraints such that F(x)=0. We introduce a cost function:

J_{\varepsilon}(\mathbf{x}) = J(\mathbf{x}) + \frac{1}{\varepsilon} \lVert F(\mathbf{x}) \rVert²
J_{\varepsilon}(\mathbf{x}) = J(\mathbf{x}) + \frac{1}{\varepsilon} \lVert F(\mathbf{x}) \rVert²

Then we can solve on ℝⁿ, since any point out of {x s.t. F(x)=0} will be strongly penalized.

In practice, if epsilon is too small the problem become ill conditioned.

this is once again the fixed step gradient algorithm, but we inject a Lagrangian in the method. J is the function to minimize. F(x) is the vector of the conditions. J and F are considered as convex.

\begin{aligned}
 \\& 1) \; \; \; \mathbf{y_0} \in \mathbb{R}^{m}_+
 \\ 
 \\& 2) \; \; \; \mathcal{L}(\mathbf{x_k^*},\mathbf{y_k^*}) = \inf_{x \in\mathbb{R}^n} \mathcal{L}(\mathbf{x},\mathbf{y_k^*}) 
 \\
 \\& 3) \; \; \; \mathbf{y_{k+1}} = \operatorname{Proj}_{\mathbb{R}^{m}_+}(\mathbf{y_k} + \rho_k \mathbf{F}(\mathbf{x_k})) {\color{Gray} \; \; ; \; \; \rho_k > 0 }
 
 \end{aligned}
\begin{aligned}
 \\& 1) \; \; \; \mathbf{y_0} \in \mathbb{R}^{m}_+
 \\ 
 \\& 2) \; \; \; \mathcal{L}(\mathbf{x_k^*},\mathbf{y_k^*}) = \inf_{x \in\mathbb{R}^n} \mathcal{L}(\mathbf{x},\mathbf{y_k^*}) 
 \\
 \\& 3) \; \; \; \mathbf{y_{k+1}} = \operatorname{Proj}_{\mathbb{R}^{m}_+}(\mathbf{y_k} + \rho_k \mathbf{F}(\mathbf{x_k})) {\color{Gray} \; \; ; \; \; \rho_k > 0 }
 
 \end{aligned}

Uzawa algorithm can be quite slow. We can also add a penalty term called augmented Lagrangian that is square, alpha-convex and converge better than linear terms.

With stochastic gradient descent, at each step, you take a random subset of the data to compute the gradient. This makes gradient descent very noisy yet scalable on big data sets.

Similarly, we cut the data into a finite number of batches and rotate the computation of the gradient on each batch.

Momentum adds a fraction of the previous update vector to the current one. This can be written:

\begin{cases}
 & \mathbf{v_{k}} = \gamma \mathbf{v_{k-1}} + \rho \nabla J(\mathbf{x_k}) \\ 
 & \mathbf{x_{k+1}} = \mathbf{x_k} — \mathbf{v_{k}}
 \end{cases}
\begin{cases}
 & \mathbf{v_{k}} = \gamma \mathbf{v_{k-1}} + \rho \nabla J(\mathbf{x_k}) \\ 
 & \mathbf{x_{k+1}} = \mathbf{x_k} — \mathbf{v_{k}}
 \end{cases}

γ can be quite high, as 0.9 is a common value.

Here, we compute the gradient on actual position of point after adding the fraction of the previous update vector, rather than on the original starting point:

\begin{cases}
 & \mathbf{v_{k}} = \gamma \mathbf{v_{k-1}} + \rho \nabla J(\mathbf{x_k} — \gamma \mathbf{v_{k-1}}) \\ 
 & \mathbf{x_{k+1}} = \mathbf{x_k} — \mathbf{v_{k}}
 \end{cases}
\begin{cases}
 & \mathbf{v_{k}} = \gamma \mathbf{v_{k-1}} + \rho \nabla J(\mathbf{x_k} — \gamma \mathbf{v_{k-1}}) \\ 
 & \mathbf{x_{k+1}} = \mathbf{x_k} — \mathbf{v_{k}}
 \end{cases}

The idea of Adagrad is to:

  • associate a lower learning rate to frequent features,
  • associate a higher learning rate to more rare features.

Adagrad it therefore especially relevant with sparse datasets where some features occur rarely and carry lots of sense.

Adagrad defines a diagonal matrix Gₖ, that takes the sum of the subsequent gradients in its diagonal elements. It is used to apply a decay on the learning rate over time:

x_{i,k+1} = x_{i,k} — \frac{\rho}{\sqrt{G_{ii,k}+\varepsilon}} \nabla J(x_{i,k})
x_{i,k+1} = x_{i,k} — \frac{\rho}{\sqrt{G_{ii,k}+\varepsilon}} \nabla J(x_{i,k})

The weakness of Adagrad is that the accumulation term Gₖ continue to grow over time, eventually resulting in a very small learning rate. These methods intent to fix that by taking an exponentially decaying average of the squared gradients.

RMS Prop equations look like this:

\begin{cases}
 & \operatorname{E}[g²]_k = \beta \operatorname{E}[g²]_{k-1} + (1-\beta)(\nabla J(\mathbf{x_{k}}))² \\ 
 & \\ 
 & \mathbf{x_{k+1}} = \mathbf{x_{k}} — \frac{\rho}{\sqrt{\operatorname{E}[g²]_k}} \nabla J(\mathbf{x_{k}}) 
 \end{cases}
\begin{cases}
 & \operatorname{E}[g²]_k = \beta \operatorname{E}[g²]_{k-1} + (1-\beta)(\nabla J(\mathbf{x_{k}}))² \\ 
 & \\ 
 & \mathbf{x_{k+1}} = \mathbf{x_{k}} — \frac{\rho}{\sqrt{\operatorname{E}[g²]_k}} \nabla J(\mathbf{x_{k}}) 
 \end{cases}

Once again, beta can be quite high as a classical value is 0.9.

Adadelta even add an exponentially moving average of the squared update vectors:

\begin{cases}
 & \operatorname{E}[g²]_k = \beta \operatorname{E}[g²]_{k-1} + (1-\beta)(\nabla J(\mathbf{x_{k}}))² \\ 
 & \\ 
 & \operatorname{D}_k = \beta \operatorname{D}_{k-1} + (1-\beta)(\boldsymbol{\Delta} \mathbf{x_k})² \\ 
 & \\ 
 & \mathbf{x_{k+1}} = \mathbf{x_{k}} — \frac{\sqrt{\operatorname{D}_k + \varepsilon}}{\sqrt{\operatorname{E}[g²]_k} + \varepsilon} \nabla J(\mathbf{x_{k}}) 
 \end{cases}
\begin{cases}
 & \operatorname{E}[g²]_k = \beta \operatorname{E}[g²]_{k-1} + (1-\beta)(\nabla J(\mathbf{x_{k}}))² \\ 
 & \\ 
 & \operatorname{D}_k = \beta \operatorname{D}_{k-1} + (1-\beta)(\boldsymbol{\Delta} \mathbf{x_k})² \\ 
 & \\ 
 & \mathbf{x_{k+1}} = \mathbf{x_{k}} — \frac{\sqrt{\operatorname{D}_k + \varepsilon}}{\sqrt{\operatorname{E}[g²]_k} + \varepsilon} \nabla J(\mathbf{x_{k}}) 
 \end{cases}

Adam is inspired from AdaGrad and RMSProp. It is an often used general purpose optimizer.

Adams uses the first moment (expected value) mₖ and the second moment (variance) vₖ estimates of the gradient.

\begin{cases}
 & \mathbf{m_{k+1}} = \beta_1 \mathbf{m_k} + (1-\beta_1)\nabla J(\mathbf{x_k}) \\
 & \mathbf{v_{k+1}} = \beta_2 \mathbf{v_k} + (1-\beta_2)\nabla J(\mathbf{x_k})² \\
 \end{cases}
\begin{cases}
 & \mathbf{m_{k+1}} = \beta_1 \mathbf{m_k} + (1-\beta_1)\nabla J(\mathbf{x_k}) \\
 & \mathbf{v_{k+1}} = \beta_2 \mathbf{v_k} + (1-\beta_2)\nabla J(\mathbf{x_k})² \\
 \end{cases}

However, since m and v and initiated at 0 and since the β₁ and β₂ constants are near to 1, these terms are biased towards zero. Then, Adam compute a bias-corrected moment estimate.

\begin{cases}
 & \mathbf{\widehat{m}_{k+1}} = \frac{\mathbf{m_{k}}}{1 — \beta_1^k} \\
 & \mathbf{\widehat{v}_{k+1}} = \frac{\mathbf{v_{k}}}{1 — \beta_2^k} \\
 \end{cases}
\begin{cases}
 & \mathbf{\widehat{m}_{k+1}} = \frac{\mathbf{m_{k}}}{1 — \beta_1^k} \\
 & \mathbf{\widehat{v}_{k+1}} = \frac{\mathbf{v_{k}}}{1 — \beta_2^k} \\
 \end{cases}

Then, we can write Adam’s update function as:

There is many Adam-like algorithms that intent to improve its performance. There is also a trend towards parameters auto tuning over time, that seems promiseful.

Publications in French & English about design, innovation and programming.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store