A large majority of artificial neural networks are based on the gradient descent algortihm. It is necessary to understand the fundamentals of this algorithm before studying neural networks. Gradient descent is an optimization algorithm for finding the minimum of a function.

Let's consider the differentiable function \(f(x)\) to minimize. The gradient descent algorithm starts at an arbitrary position and iteratively converge to the minimum, as illustrated below:

Let's name \( x_0 \) the starting point of the algorithm. To computes the next point \( x_1 \), the gradient descent algorithm calculates the derivative \( f'(x_o) \), as illustrated on the following figure:

As the derivative is the slope of the tangent line to the function at that point, it is generaly a good indicator of how far the point is from the minimum. The next point is calculated according to the following formula:

$$ x_{n+1} = x_n - \alpha.\frac{df}{dx}(x_n) $$ where:- \( x_{n+1} \) is the next point
- \( x_{n} \) is the current point
- \( \alpha \) is the step size multiplier

The step size multiplier \( \alpha \) is a parameter to tune. It represents how big will be the step to the next point. The dilemna hidden behind this parameter is that large values will allow fast convergences, but small values will ensure more stability. On the illustration below, a too large step size prevents the algorithm from converging to the expected mimima:

Note that in artificial neural networks, this parameter is no longer called *step size ( \( \alpha \) ) *, but
*learning rate ( \( \eta \) )*.

For the sake of clarity, the algorithm has been presented on a function in one dimension ( \( f: \mathbb{R} \mapsto
\mathbb{R} \) ). The previous principles can be extended to multidimentional functions ( \( f: \mathbb{R}^n \mapsto \mathbb{R} \) ).
The derivative is then replaced by the gradient of the function. This is why the algorithm was named *gradient descent*.
The
multidimensional gradient descent generalization is given by the following equation:

- \( X_{n+1} \) is the next point in \( \mathbb{R}^n \)
- \( X_{n} \) is the current point in \( \mathbb{R}^n \)
- \( \alpha \) is the step size multiplier
- \( \nabla f(X_n) \) is the local gradient of function \( f \) expressed at point \( X_n \)

When this algorithm is used for optimizing artificial neural netwoks parameters, this limitation can prevent the network to learn properly. Fortunatly, in practice we consider models with large number of parameters. Generaly, the local minima in a given dimension is not shared with the other dimensions. This explains why artificial neural networks can be applied to complex systems, but this also explains why it's particularly difficult to prove the convergence of theses algorithms.