**Table of Contents**

- Different types of Optimizer Functions
- Training with ADAM
- Training with SGD
- Training of RMSProp
- Video Explanation

#### 1. Different types of Optimizer Functions

Firstly we will import torch, torchvision, torch.nn, torchvision.transforms, torchvision.datasets, torch. autograd, variables, and time.

Here we will discuss about the fashion_MNIST, to show how different optimizer functions gives us different accuracy on the same dataset.

Here we are keeping the batch size 100 and the number of epoch will be 20. Then we will implement the deep neural network model or the MLP. After we will implement the training and testing methods. Then we run the functions, which is training the model, it takes time, so wait patiently for download.

**2. Training with ADAM**

- So the first optimizer is ADAM and it stands for adaptive momentum.
- Momentum method is a very easy method to understand and implement and for optimizing also. Basically, this optimizer function decides, how you update widths and biases at each back-propagation stage. You might remember that when you do loss without back-propagation.
- In optimizer.state, you basically update your widths and biases to run on the next epoch. Then you will reach the minimum of the loss function. It means you have optimized your system and It will give a very good accuracy.
- Then we have mentioned the default learning rate which is 0.01. I have kept two betas and it says 0.9 and 0.999. Betas are necessary for running average calculation.

**3. Training with SGD**

The next optimizer is SGD, which is Stochastic Gradient Descent.

SGD is a gradient of the normal gradient descent, the main difference is SGD, take one input value or input data, one at a time and it calculates the loss on it. From the above description, you can understand that this process is very slow. It also means from SGD that it is the training process is random.

Then we will load our data again and we will use our training and testing sets. In SGD we are using model parameters and we are taking the learning rate is 0.001. If you keep the learning rate small your model will learn slowly and the learning will be better. But if you keep the value too small it will take too much time to train your model and again and if your learning rate is high 1 or 2 for example, your model will learn very fast and it will take lesser time but at the same time, it will learn less parameters from your data.

**4. Training of RMSProp**

- RMSProp is another optimizer in Pytorch. It is similar to SGD.
- RMSProp and gradient descent is on how the gradients are calculated.
- It has models.parameter(), you have a learning rate
- Here in the parameters, we have alpha, which helps the RMSProp to run smoothly. It is an optimizer that utilizes the magnitude of recent gradients to normalize the gradients We always keep a moving average over the root mean squared (hence Rms) gradients, by which we divide the current gradient.
- Once our optimizer functions are used to train our models, we have taken here the training losses, test losses, and the test accuracies to check which one is working better on our fashion_MNIST dataset.
- When you are working with a deep learning problem, you have to deal with all these mentioned parameters to check which is working well for you.

Now the training is done and we can see from figure 4 that, in the training losses from the image we can see that SGD didnâ€™t work well as compared to RMSProp and ADAM.

As you can see figure 5 , RMSProp and ADAM has overlapped.

In the test losses, the ADAM and RMSProp are is as similar as training losses. In Test accuracies, this is the most important part. And we can see that, in figure 6, ADAM and RMSProp have performed so well as compared to SGD.

**5.** **Video Explanation**