Faster training of the neural networks is one of the important factors in deep learning. In general, we find such difficulties with the neural networks due to their complex architecture and the large number of parameters used. As the size of the data, network, and weights increases, the training time of the models also increases, which is not good for the modelers and practitioners. In this article, we will discuss some of the tips and tricks that can speed up neural network training. The key points to be discussed in the article are listed below.
- Multi GPU training
- Learning rate scaling
- Cyclical learning rate plans
- mix training
- label smoothing
- transfer learning
- Mixed precision training
Let’s start by discussing how multi-GPU training can improve learning speed.
Multi GPU Training
This tip is purely on the side of speeding up the neural networks, which has nothing to do with the performance of the models. This tip can get expensive, but it’s very effective. Implementing a GPU can also speed up neural network training, but applying multiple GPUs has more benefits. If someone is not able to imply GPU in their system, they can go through the Google Collab notebooks that offer online based GPU and TPU support.
Looking for a complete repository of Python libraries used in data science, Look here.
By applying multiple GPUs in training, the data is spread across different GPUs, and these GPUs hold the network weight and let them learn about the mini-batch size of the data. For example, if we implemented a stack size of 8192 and implemented 256 GPUs, then each GPU would have a mini-stack of size 32, or we can say 32 samples to train a network. This means that the training of the network will be faster.
Learning rate scaling
This is one of the tips that can help us improve the speed of neural network training. In general training of the neural network with a large batch size stands the low validation accuracy. In the section above, we saw that applying multiple GPUs spreads the stack size to prevent slow network training.
In a scenario where GPUs are not available, we can also scale the learning rate. This tip can compensate for the averaging effect that the mini-batch has. For example, we can increase the stack size by 4x when training over four GPUs. We can also multiply the learning rate by 4 to increase the speed of the training.
We can also say that this method is the learning rate warm-up, which is a simple strategy to start training the model with high learning rates. Right at the beginning we can start the training with a lower learning rate and increase it to a preset value in a warm-up phase. The lower learning rate may be through the first few epochs. Then the learning rate can drop, as usually happens in standard training.
These two tricks are useful when using distributed training of the network with multiple GPUs. We can also find that as the learning rate warms up, it can stabilize the more difficult-to-train models, regardless of the stack size and number of GPUs we use.
Cyclical learning rate plans
Learning rate plans can be of various types, one of which is cyclic learning rate plans that help to increase the speed of neural network training. This mainly works by increasing and decreasing the learning rate in a cycle under predefined upper and lower bounds. In some schedules we find that the upper limit decreases as the training process progresses.
One-time cyclic learning rate plans are a variant of cyclic learning rate plans that increase and decrease the learning rate only once during the entire training process. We can also think of it similarly to the learning rate warm-up we discussed in the section above. This tip can also be applied to the optimizer’s impulse parameter, but in reverse order
This is a very simple tip, also known as a mixup, and it mainly works with computer vision networks. This is where the idea of this tip comes from Paper. The process of confusion helps avoid overfitting and reduces the models’ sensitivity to conflicting data. We can also think of this process as a data propagation technique that performs a random shuffling of the input samples. Digging further, we find that this tip collects a pair of data samples and generates new samples while calculating the weighted average of the inputs and outputs.
One of our articles explains the process of mixing used by the mixup process. For example, in an image classification task, the process works by blending images in the input and using the same blending parameters to calculate a weighted average of the output labels.
Label smoothing is a general technique to speed up the neural network training process. A normal classification record consists of labels that are one-hot coded, with a true class having values of one and other classes having values of zero. In such a situation, a softmax function never returns the one-hot encoded vectors. This technique mainly creates a gap between the distributions of ground truth labels and model predictions.
Applying label smoothing can reduce the gap between the distribution of ground truth labels and model predictions. In label smoothing, we mainly subtract some epsilon from the true labels and add the subtraction results to the others. This prevents the models from overfitting and acts as a regularizer. One thing to note here is that if the value of epsilon gets very large, the labels may become flattened too much.
The strong label smoothing value results in less retention of information from the labels. The effect of label smoothing can also be seen in the speed of neural network training, as the model learns faster from the soft targets, which are a weighted average of the hard targets, and the even distribution across the labels. This technique can be used with various modeling techniques such as image classification, language translation, and speech recognition.
Transfer learning can be explained as the process by which a model begins its training by transferring the weights from other models, rather than training from scratch. This is a very good way to improve the training of the model and it can also help increase the performance of the model as the weights we use for the process are already trained somewhere and saves a tremendous amount of training time can be the whole training time. One of our articles explains how this type of learning can be done.
Mixed precision training
We can define this type of training so that a model learns both 16-bit and 32-bit floating-point numbers, allowing the model to learn faster and reducing the model’s training time. Let’s take an example of a simple neural network where the model needs to recognize an object from the images. Training such a model means finding the edge weight of the network such that it is able to perform object detection from the data. These edge weights can be stored in a 32-bit format.
This general training can involve forward and backward propagation, and to do so requires billions of multiplications when the points are in 32 bits. We can avoid the 32-bit point to represent a number during training. Backpropagation gradients that the network calculates can have very low values, and under such conditions we have to use a lot of memory to represent the numbers. We can also inflate these gradients so we don’t need a lot of memory to represent these numbers.
We can also represent the numbers with 16 bits and can save a lot of memory of the model and the training program can be faster than before. This procedure can be explained as training the model while the arithmetic operations have very few bits. The only thing to consider in such training is accuracy. This can happen because the accuracy of the model is significantly lower.
As discussed above, we can train the model in both 16-bit and 32-bit using mixed-precision training, which preserves the master copy of the actual weight parameters in the original 32-bit precision format. And we can use it as an actual set of the weights we use after training with the model.
Between training, we can convert 32-bit precision to 16-bit precision and do forward propagation with all arithmetic operations using less memory, and the loss can be calculated to feed it after some time by converting it to the Scale backward propagation, where the gradients are also upscaled.
In backpropagation we compute 16-bit gradients and the final gradients go to the actual set of weights we use in the model. The method given above is an iteration and this happens for many iterations. The image below explains the entire procedure that we have discussed.
In the article, we have discussed the tricks and tips that can be used to speed up neural network training. Since the training time and the performance of the models are important factors to consider when modeling, we should try to use them in our processes.