Gradient descent with Python
Introduction
Gradient descent method is a way to find a local minimum of a function. The way it works is we start with an initial guess of the solution and we take the gradient of the function at that point. We step the solution in the negative direction of the gradient and we repeat the process. The algorithm will eventually converge where the gradient is zero (which correspond to a local minimum). Its brother, the gradient ascent, finds the local maximum nearer the current solution by stepping it towards the positive direction of the gradient. They are both first-order algorithms because they take only the first derivative of the function.
Let's say we are trying to find the solution to the minimum of some function
There is a chronical problem to the gradient descent. For functions that have valleys
(in the case of descent) or saddle points (in the case of ascent),
the gradient descent/ascent algorithm zig-zags, because the gradient is nearly orthogonal
to the direction of the local minimum in these regions. It is like being inside a round tube
and trying to stay in the lower part of the tube. In case we are not, the gradient tells us
we should go almost perpendicular to the longitudinal direction of the tube.
If the local minimum is at the end of the tube, it will take a long time to reach
it because we keep jumping between the sides of the tube (zig-zag). The Rosenbrock function is used
to test this difficult problem.
Exercise [A "good" example]
We want to find the minimum value of the following function:
-
Calculate the minimum values of this function.
We calculate the gradient vector of . It yields: Consequently, the possible minimum values are , or .
We then calculate the hessian matrix: The hessian matrix is only positive definite for or whatever the value of . -
Draw the level sets of the function. For this, you can use the following Python code:
import numpy as np import matplotlib.pyplot as plt """ Level sets""" def g(x1,x2): res = (x1**2-4*x1+4)*(x1**2+4*x1+2) return res # level set drawing of the function X1,X2 = np.meshgrid(np.linspace(-4,4,401),np.linspace(-3,2.5,401)) Z = g(X1,X2) graphe = plt.contour(X1,X2,Z,120) plt.show()
-
Define the J and GradJ functions that allows to calculate the gradient of
and the Hess which returns the hessian matrix of .import numpy as np import matplotlib.pyplot as plt """ The J, GradJ and Hess functions""" def J(X): res = (X[0]**2-4*X[0]+4)*(X[0]**2+4*X[0]+2) return res def GradJ(X): res = np.array([4*X[0]**3-20*X[0]+8.0, 0.0]) return res def Hess(X): res = np.array([[#TO DO,#TO DO],[#TO DO,#TO DO]]) return res
-
Gradient Descent with a fixed
coefficient:
Let be the initial value. Calculate the values by using the following algorithm:
While
return
Tune the value of .import numpy as np from numpy import linalg as LA import matplotlib.pyplot as plt """ The Gradient Descent function""" def Gradient(X0,alpha,eps,niter): k=0 while(#TO DO): #TO DO return x,y X0=[-1.0,1.0] alpha = 0.01 eps = 0.001 niter = 150 x0,y0 = Gradient(X0,lambda,eps,niter) plt.plot(x0,y0,'r', label=r"Gradient Method with a fixed $\alpha$", linestyle='--') plt.legend(loc='upper left', bbox_to_anchor=(1.1, 0.95),fancybox=True, shadow=True) plt.show()
-
Draw the
values of this algorithm in the level sets graph. -
Confirm that the value found by the Algorithm is a local minimum value.
-
Gradient Descent with momentum:
Let be the initial value. Calculate the values by using the following momentum Gradient Descent algorithm:
0,
While
return""" The Momentum Gradient Descent function""" def MGradient(X0,alpha,beta,eps,niter): k=0 while(#TO DO): #TO DO return x,y X0=[-1.0,1.0] alpha = 0.01 mu = 0.01 eps = 0.001 niter = 150 x1,y1 = MGradient(X0,alpha,mu,eps,niter) plt.plot(x1,y1,'r', label=r"Momentum Gradient Method") plt.legend(loc='upper left', bbox_to_anchor=(1.1, 0.95),fancybox=True, shadow=True) plt.show()
The value has been to be tuned. -
Draw the
values of this algorithm in the level sets graph. -
Confirm that the value found by the Algorithm is a local minimum value.
""" The Gradient Descent function"""
def Gradient(X0,alpha,eps,niter):
k=0
LX=[X[0]]
LY=[X[1]]
while((LA.norm(GradJ(X))>eps and k< niter):
X=X-np.dot(alpha,GradJ(X))
k=k+1
LX.append(X[0])
LX.append(X[1])
print("\n iteration number of the fixed step gradient descent method: ",k)
print("Optimal value x= ",X[0]," y= ",X[1])
return LX,LY
""" The Momentum Gradient Descent function"""
def MGradient(X0,lambda,mu,eps,niter):
k=0
LX=[X[0]]
LY=[X[1]]
Delta=[0.0 for i in range(len(X))]
while((LA.norm(GradJ(X))>eps and k< niter):
Delta=-np.dot(alpha,GradJ(X))+np.dot(beta,Delta)
X=np.add(X,Delta)
k=k+1
LX.append(X[0])
LX.append(X[1])
print("\n iteration number of the gradient descent with momentum: ",k)
print("Optimal value x= ",X[0]," y= ",X[1])
return LX,LY
We obtain:
-
iterations of the Gradient Descent with give:
x= -2.4141942582083296 and y= 1.0 -
iterations of the Momentum Gradient with and give:
x= -2.4142172250611096 and y= 1.0
We want to find the minimum value of the following function:
- Calculate the minimum value of this function.
-
Draw the level sets of this function.
-
Apply the previous Gradient Descent methods
-
Apply the Newton-Raphson Algorithm to this function.
""" The Newton-Raphson function""" def Newton(X0,eps,niter): k=0 x=[X[0]] y=[X[1]] while(LA.norm(GradJ(X))>eps and k< niter): delta=#TO DO X=X-delta k=k+1 x.append(X[0]) y.append(X[1]) return x,y X0=[-1.0,1.0] eps = 0.001 niter = 150 x3,y3 = Newton(X0,eps,niter) plt.plot(x3,y3,'r', label=r"Newton-Raphson Method") plt.legend(loc='upper left', bbox_to_anchor=(1.1, 0.95),fancybox=True, shadow=True) plt.show()
import numpy as np
""" GradJ and Hess functions"""
def GradJ(X):
res = np.array([2*(X[0]-1)+40*X[0]*(X[0]**2-X[1]),-20*(X[0]**2-X[1])])
return res
def Hess(X):
res = np.array([[2 + 80*X[0]**2 + 40*(X[0]**2 - X[1]),-40*X[0]],[-40*X[0], 20]])
return res
""" The Newton-Raphson function"""
delta=LA.solve(Hess(X),GradJ(X))
We obtain:
-
iterations of the Gradient Descent with give:
x= 0.8467914701456682 y= 0.7102982431003074 -
iterations of the Momentum Gradient with and give:
x= 0.8610575985931694 y= 0.73534414819584 -
iterations of the Newton Raphson method give:
x= 0.9999999999999982 y= 0.9999999999999964
Estimating the step-size
A wrong step size
Any differentiable function has a maximum derivative value, i.e., the maximum of the derivatives at all points. If this maximum is not infinite, this value is known as the Lipschitz constant and the function is Lipschitz continuous.
Each gradient descent can be viewed as a minimization of the function:
Under this view, the result we get from minimizing the right hand side is an upper bound to the real value of the function at
Adaptive step size
There are methods, known as line search, that make an estimate of what the step size should be at a given iteration. After calculating the gradient, these methods choose a step size by minimizing a function of the step size itself.Each method defines its own function, based on some assumptions. Exact methods accurately minimize this function, while inexact methods make an approximation that just improves on the last iteration. The following are just a few examples.
Cauchy (in 1847)
One of the most obvious choices of
Barzilai and Borwein
An
approach proposed in 1988 is to find the step size that minimizes:
This is an approximation to the secant equation, used in quasi-Newton methods.
The solution to this problem is easily solved differentiating with respect to
Exercise [Adaptative Steps]
We want to minimize the following problem:-
Show that the steepest-descent (or Cauchy) step is given by
where . Hence define theOptimalGradient
function in this case. -
Define the Barzilai and Borwein Gradient Descent function
BorweinGradient
- Compare these two methods.
Descent Lemma
In Proposition A.24 from Bertsekas 1999 we have the following result:
Optimization with constraint equalities
Exercise [Equality-constrained problem]
We want to minimize the following problem:- By using the Lagrange framework, calculate the four possible extremal (or stationnary points) values of this problem. Show that the Lagrange function is convex.
-
Plot the contours of
and the contour line of the constraint . -
Apply the
Gradient
function on the Lagrange function to find the global minimum of the problem.
References
- D. P. Bertsekas (1999), Nonlinear Programming.
- J. Barzilai, J.M. Borwein (1988), Two-Point Step Size Gradient Methods, IMA Journal of Numerical Analysis (8), pp. 141-148.
- A. Cauchy (1847), Méthode générale pour la résolution des systèmes d'équations simultanées, Comptes Rendus Académie des Sciences Paris 25, pp. 536-538.