How can gradient clipping help avoid the gradient exploding problem?


Deep neural networks are prone to the problem of vanishing and exploding gradients. This is especially true for recurrent neural networks that are used frequently (RNNs). Since RNNs are typically used in situations that require short-term memory, the weights can easily be exploited during exercise, leading to unexpected results, such as: B. Nan or the model not covered at the desired point. Various methods, such as regularizers, are used to reduce this effect. Of all these methods, in this article we will focus on the gradient clipping method and try to understand it both theoretically and practically. The following are the main points to be discussed in this article.

Table of Contents

  1. The exploding gradient problem
  2. What is gradient clipping?
  3. How do I use gradient clipping?
  4. Implement gradient clipping

Let’s start the discussion by understanding the problem and its causes.

The exploding gradient problem

The exploding gradient problem is a problem that arises when using gradient-based learning methods and backpropagation to train artificial neural networks. An artificial neural network, also known as a neural network or neural network, is a learning algorithm that uses a network of functions to understand the data input and translate it into a specific output. This type of learning algorithm aims to simulate how neurons work in the human brain.

When large error gradients accumulate, the gradients explode resulting in very large updates to the neural network model weights during training. Gradients are used to update the network weights during training, but this process usually works best when the updates are small and controlled. If the sizes of the gradients add up, an unstable network is likely to form, leading to poor prediction results or even a model that does not report anything useful.

When training artificial neural networks, exploding gradients can cause problems. When gradients explode, the network becomes unstable and learning cannot be completed. The values ​​of the weights can also grow to a point where they overflow, resulting in NaN values.

The term “no number” refers to values ​​that represent undefined or non-representable values. To correct training, it is helpful to know how to spot exploding gradients. Since recurrent networks and gradient-based learning methods deal with large sequences, this often happens. There are techniques for repairing exploding gradients, such as gradient clipping and weight regulation, among others. In this post we take a look at the gradient clipping method.

What is gradient clipping?

Gradient clipping is a technique used to prevent exploding gradients in recurrent neural networks. Gradient clipping can be calculated in several ways, but one of the most common is rescaling gradients so that their norm is at most a certain value. Gradient clipping introduces a predetermined gradient threshold and then downscales gradient norms that exceed it to meet the norm.

This ensures that no gradient has a norm greater than the threshold, causing the gradients to be clipped. Although the gradient will skew the resulting values, clipping the gradient can keep things stable.

It can be difficult to train recurrent neural networks. Vanishing gradients and exploding gradients are two common problems in recurrent neural network training. If the gradient becomes too large, error gradients will accumulate, resulting in an unstable network.

Vanishing gradients can occur if the optimization gets stuck at a certain point due to a gradient that is too small. Gradient clipping can prevent these gradient problems from messing up the parameters during training.

In general, exploding gradients can be avoided by carefully configuring the network model, such as using a small learning rate, scaling the target variables, and using a standard loss function. However, in recurrent networks with a large number of input time steps, exploding gradients can still be a problem.

How do I use gradient clipping?

Changing the derivative of the error before it is fed back through the network and using it to update the weights is a common solution to exploding gradients. Rescaling the error derivative also rescales the updates to the weights, which drastically reduces the likelihood of overflow or underflow.

Gradient scaling is the process of normalizing the error gradient vector so that the vector norm (size) corresponds to a predefined value, for example 1.0. Gradient clipping is the process of forcing gradient values ​​(element by element) to a certain minimum or maximum value when they exceed an expected range. These techniques are often collectively referred to as “gradient clipping”.

It is common practice to use the same gradient clipping configuration for all network layers. Nevertheless, there are some cases where a wider range of error gradients is allowed in the output layer than in the hidden layer.

Implement gradient clipping

We now understand why exploding gradients occur and how gradient clipping can help resolve them. We also saw two different methods of applying clipping to your deep neural network. Let’s see how both gradient clipping algorithms are implemented in major machine learning frameworks like Tensorflow and Pytorch.

We use the Fashion MNIST data set, an open source digit classification data set designed for image classification.

Gradient clipping is easy to implement in TensorFlow models. All you have to do is pass the parameter to the optimizer. To cut the color gradients, all optimizers have the parameters ‘clipnorm’ and ‘clipvalue’.

Before we go any further, let’s briefly discuss how we can crop the Clipnorm and Clipvalue parameters.

Clip standard

Scaling the gradient norm involves modifying the derivatives of the loss function to have a specified vector norm when the L2 vector norm of the gradient vector (sum of squared values) exceeds a threshold. For example, we can provide a norm of 1.0, which means that if the vector norm for a gradient exceeds 1.0, the vector values ​​are re-scaled so that the vector norm equals 1.0.

Clip value

Clipping the gradient value involves clipping the derivatives of the loss function to a specific value when a gradient value is less than or greater than a negative or positive threshold value. For example we can define a norm of 0.5, which means that if a gradient value is less than -0.5 it will be set to -0.5 and if it is greater than 0.5 it will be set to 0, 5 set.

Now that we understand what role these parameters actually play. Start the implementation by importing the required package and sub-module.

import tensorflow as tf
from tensorflow.keras.datasets import mnist

Next, load the Fashion MNIST dataset and pre-process it for the TF model to handle.

# load the data
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
# make compatible for tensorflow
x_train, x_test = x_train / 255., x_test / 255. # scalling
train_data =, y_train))
train_data = train_data.repeat().shuffle(5000).batch(32).prefetch(1)

Now we’re going to define and compile the model without gradient clipping, here I intentionally limit the number of layers and neurons for each layer to replicate the behavior.

# build a model
model = tf.keras.models.Sequential([
  tf.keras.layers.LSTM(10,input_shape=(28, 28)),
#compile a model
    # inside the optimizer we are doing clipping

Next we fit the model and observe the loss and accuracy movement.,steps_per_epoch=500,epochs=10)

Here is the result

As we can see, we trained for some epochs and in whatever model we tried to reduce losses and accuracy. Now let’s see if the grading clipping makes any difference here.

As already mentioned, to implement gradient clipping we have to initiate the desired method in the optimizer. Here I am moving with the clipvalue method.

# inside the optimizer we are doing clipping

Next we train the model with history clipping and can observe losses and accuracies as follows:

It is now clear that the clipping gradient value can improve the training performance of the model.

last words

Clipping the gradients speeds up training because the model can converge faster. As a result, the training achieves a minimum error rate more quickly. Since the error diverges when the gradients explode, no global or local minima can be found. When the exploding gradients are clipped, the errors begin to converge to a minimal point.

This post discussed what exploding gradients are and why they occur. To counter this effect, we discussed a technique called Gradient Clipping and saw how this technique can solve the problem both theoretically and practically.



Comments are closed.