Table of Contents
- Gradient Descent
- Stochastic Gradient Descent
- Batch Gradient Descent
- Video Explanation
Here we are going to discuss Gradient Descent / Stochastic Gradient Descent / Batch Gradient Descent and to check which one works faster, which one is slower and which one is better inaccuracy with certain number of epochs. In the very beginning of the neural network we used these descents, but in the improvising works, we want our datasets to work faster and it should take much less time to deliver. One thing to notice that in Pytorch, there s only one optimizer which is optimal.SGD to vary all these categories in previous content. So, let’s begin,
- Gradient Descent
First we will import torch, torchvision, torchvision.transform, torchvision.datasets, torch.autograd, variables and time. Here we are using MNIST and we are keeping the data in, my_datafolder. So the first category we are going to discuss is Batch Gradient descent, you keep your batch size as 100 and you have to keep it in the range minimum to maximum batch size. So the minimum batch size will be 1 and the maximum batch size, which is the size of the dataset and that will be 60,000 for the training and 10,000 for the test dataset, you can see that in figure 1.
Since it is Batch Gradient descent, we have to keep this in the above mentioned range. Then load the train loader and test loader. We are keeping the epochs as 10. Once the train loader is loaded on the batch size 100 each train loader will have 100 images.When the batch size will load with 100 then it will load with 100 images and each time you the run the epoch, the performance calculated each time as the average performance of the 100 images and the same process will go on until its 600. The length of test loader is 100 and we will use the simple_MLP on the data, and we will train our batch gradient descent. As we calculated the time, each epoch take around 5 seconds and the accuracy starts from 36.59. Here we can notice that our accuracy increases with each epoch which means our model is running. After 10 epochs our final accuracy is 83.83 which is good value.
Stochastic Gradient Descent
- In Stochastic Gradient Descent, to implement it, you have to keep the batch size as maximum. In training, the maximum batch size is 60,000 and for testing the batch size is 10,000. Now my train loader length is 1 and the test loader length is 1. when you train your train loader on one model, it has 60,000 images and the performance is basically the average of all the 60,000 images in each epoch. In train epoch we load the train loader and in SGD there is only 1. So it should basically be giving the worst performance than Batch Gradient Descent, but it should give a good performance and should take lesser time.
- As we check, according to the time-wise, it is taking the same 5 seconds time in SGD as Batch Gradient Descent but accuracy wise, it is performing much worse than Batch Gradient Descent. The reason for this worst performance phenomenon is that when you train your Batch Gradient Descent it is taking the average of only 100 images but when you are training on Stochastic Gradient Descent, it is taking the average of all the 60,000 images in each epoch.
- Let’s discuss the Gradient Descent, it is actually implemented by using the minimum length.
- So the batch size is 1 and when we run it, the number of epochs is 10, train loader is 60,000 and test loader is 10,000. Then let’s run our Gradient Descent.
- So as of now, each train loader has one image at each epoch, the model has to go through only one image in the train loader, and then it has to go through only one image at each epoch. So the performance will be very good as there is no mean but it takes a lot of time that is almost 3 minutes to run each epoch because now it has to go through 60,000 train loaders and 10,000 test loaders in each epoch. But the performance is great as it starts with 91.69 % accuracy.
- So when always started the Stochastic Gradient Descent and Batch Gradient Descent, these concepts were not there and we were training our models, one simple Gradient Descent and it was taking a lot of time but were getting very high accuracies on our model. But we wanted to run our model faster, so new concepts of Stochastic Gradient Descent and Batch Gradient Descent was introduced. So SGD takes only one input data, one iteration, and calculates the backward propagation derivatives and updates based on only one input data, that is why it gives the worst accuracy but the time is good. As you can see it is giving on 15.02% accuracy, actually the thing is it is very slow as we have mentioned many times, that our goal is to reach the minimum of the loss function.
- To reach the loss function minimum the SGD takes a lot of time, so for that, you have to run many epochs but as you have many training datasets that grow very big, the performance of SGD will be remarkably good then GD and BGD.
- We usually prefer BGD over GD, if the data set is not that big because in this case 83.83% accuracy is good enough even the amount of time it took, but if the dataset is very big we usually prefer SGD although it takes more epochs to reach the minimum, it will take very less time compared to BGD and GD.
- After almost 25 minutes, the training testing accuracy is 96.82%, which is much much higher than accuracy after the 10 epochs compared to both SGD and BGD.
- So after plotting the data we can see that in figure 1, that the test accuracies are highest for the Gradient Descent, medium for Batch Gradient Descent, and lowest for Stochastic Gradient Descent.
- If we plot the run time, you can see that in figure 2, that Stochastic Gradient Descent and Batch Gradient Descent has an almost similar time frame, as you can see they are actually overlapping, as you see focused, you will notice that Stochastic Gradient Descent is very slightly higher than Batch Gradient Descent.
- The run time for Gradient Descent is the highest almost of the range 150 to 157seconds for an epoch.
Then the training loss is getting decreased all the Descent, as you can see in figure 3, there is a sharp decrease for Batch Gradient Descent and for Stochastic Gradient Descent, it is decreasing but just slightly, as we discussed, gradient descent for each epoch, the minimization of loss is very slow, but if the dataset is very big, it will run faster than the other two.