Gradient descent is an optimization algorithm that approaches a local maximum of a function by taking steps proportional to the gradient (or the approximate gradient) of the function at the current point. If instead one takes steps proportional to the negative of the gradient, one approaches a local minimum of that function.

This algorithm is also known as steepest descent, or the method of steepest descent, not to be confused with the method for approximating integrals with the same name, see method of steepest descent.

## Description of the method

Gradient descent is based on the observation that if the real-valued function [itex]F(\mathbf{x})[itex] is defined and differentiable in a neighborhood of a point [itex]\mathbf{a}[itex], then [itex]F(\mathbf{x})[itex] increases fastest if one goes from [itex]\mathbf{a}[itex] in the direction of the gradient of [itex]F[itex] at [itex]\mathbf{a}[itex], [itex]\nabla F(\mathbf{a})[itex]. It follows that, if

[itex]\mathbf{b}=\mathbf{a}+\gamma\nabla F(\mathbf{a})[itex]

for [itex]\gamma>0[itex] a small enough number, then [itex]F(\mathbf{a})\leq F(\mathbf{b})[itex]. With this observation in mind, one starts with a guess [itex]\mathbf{x}_0[itex] for a local maximum of [itex]F[itex], and considers the sequence [itex]\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \dots[itex] such that

[itex]\mathbf{x}_{n+1}=\mathbf{x}_n+\gamma \nabla F(\mathbf{x}_n),\ n \ge 0.[itex]

We have [itex]F(\mathbf{x}_0)\le F(\mathbf{x}_1)\le F(\mathbf{x}_2)\le \dots,[itex] so hopefully the sequence [itex](\mathbf{x}_n)[itex] converges to the desired local maximum. Note that the value of the step size [itex]\gamma[itex] is allowed to change at every iteration.

Let us illustrate this process in the picture below. Here [itex]F[itex] is assumed to be defined on the plane, and that its graph looks like a hill. The blue curves are the contour lines, that is, the regions on which the value of [itex]F[itex] is constant. A red arrow originating at a point shows the direction of the gradient at that point. Note that the gradient at a point is perpendicular to the contour line going through that point. We see that gradient descent leads us to the top of the hill, that is, to the point where the value of the function [itex]F[itex] is largest.

Missing image
alt An illustration of the gradient descent method.

To have gradient descent go towards a local minimum, one needs to replace [itex]\gamma[itex] with [itex]-\gamma[itex].

Note that gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones.

Two weaknesses of gradient descent are:

1. The algorithm can take many iterations to converge towards a local maximum/minimum, if the curvature in different directions is very different
2. Finding the optimal [itex]\gamma[itex] per step can be time-consuming. Conversely, using a fixed [itex]\gamma[itex] can yield poor results. Conjugate gradient is often a better alternative.

A more powerful algorithm is given by the BFGS method which consists in calculating on every step a matrix by which is multiplied the gradient vector to go into a "better" direction, combined with a more sophisticated linear search algorithm, to find the "best" value of [itex]\gamma[itex].

• Art and Cultures
• Countries of the World (http://www.academickids.com/encyclopedia/index.php/Countries)
• Space and Astronomy