The optimizers(SGD, SGD with Momentum, Adam) in PyTorch

#pytorch #optimizer #sgd #adam

An optimizer is the function which can adjust a model's parameters(weight and bias) by gradient descent to minimize the mean(average) of the sum of the losses(differences) between the model's predictions and true values(train data) during training. *Gradient Descent(GD) is the function which can find the minimum(or maximum) gradient(slope) of a function.

(1) SGD(Stochastic Gradient Descent)(1950s):

can do gradient descent with some randomly selected true values(train data) but not all true values(train data).
's learning rate is fixed during training.
's pros:
- Suit to train a model with large datasets because it can select only some data.
- Reduce getting stuck in local minima or saddle points by randomly selecting data.
's cons:
- Sensitive to learning rate. *Large learning rate may overshoot a global minimal and small learning rate may cause slow convergence or get stuck in local minima or saddle points.
- Much noisy(useless) data caused by randomly selecting data. *Oscillation(Vibration) happens around the optimal solution, taking longer to reach convergence.
- Convergence is slow because of noisy data.
- The quality of optimal solution can be lower because of noisy data.
is not in PyTorch. *Actually, SGD() with momentum = 0 is classic GD(Gradient Descent) in PyTorch.

(2) SGD(Stochastic Gradient Descent) with Momentum:

can do gradient descent with some randomly selected true values(train data) but not all true values(train data) by stabilizing and accelerating convergence with Momentum. *Momentum(1964) is also the optimizer which can stabilize and accelerate convergence.
's learning rate is fixed during training.
's pros:
- Suit to train a model with large datasets because it can select only some data.
- Reduce getting stuck in local minima or saddle points by randomly selecting data.
- Less noisy data caused by randomly selecting data than SGD.
- Convergence is faster then SGD because of less noisy data.
- The quality of optimal solution can be higher than SGD because of less noisy data.
's cons:
- Sensitive to learning rate. *Large learning rate may overshoot a global minimal and small learning rate may cause slow convergence or get stuck in local minima or saddle points.
- High momentum causes slow convergence, overshooting a global minimal repeatedly.
is not in PyTorch. *Actually, SGD() with momentum > 0 is classic GD(Gradient Descent) with Momentum in PyTorch.

(3) Adam(Adaptive Moment Estimation)(2014):

can do gradient descent by automatically adapting learning rate to parameters using recent gradients during training. *The learning rate is not fixed during training.
's learning rate decreases as closing to a global minimum to find the optimal solution precisely.
is the combination of Momentum and RMSProp. *Memos:
- Momentum(1964) is also the optimizer which can stabilize and accelerate convergence.
- RMSProp(2012):
- is also the optimizer which can do gradient descent by automatically adapting learning rate to parameters using recent gradients during training. *The learning rate is not fixed during training.
- 's learning rate decreases as closing to a global minimum to find the optimal solution precisely.
- is the improved version of AdaGrad(2011) which can do gradient descent by adapting learning rate to parameters, considering all gradients.
's pros:
- Automatic adaption of learning rate to parameters.
- Work well with large datasets.
's cons:
- Memory usage is higher than SGD.
- Generalization is poorer than SGD.
- Failure to converge due to unstable or extreme learning rate.
- Get stuck in local minima or saddle points.
is Adam() in PyTorch.

DEV Community

The optimizers(SGD, SGD with Momentum, Adam) in PyTorch

Top comments (0)

Read next

How Do Electric Eels Generate Electricity The Shocking Truth

How to Use Laravel for Event Management System Development?

Exploring Space Weather Its Impact on Earths Technology and Infrastructure

Announcement