DEV Community

Super Kai (Kazuya Ito)
Super Kai (Kazuya Ito)

Posted on • Updated on

The optimizers(SGD, SGD with Momentum, Adam) in PyTorch

An optimizer is the function which can adjust a model's parameters(weight and bias) by gradient descent to minimize the mean(average) of the sum of the losses(differences) between the model's predictions and true values(train data) during training. *Gradient Descent(GD) is the function which can find the minimum(or maximum) gradient(slope) of a function.

(1) SGD(Stochastic Gradient Descent)(1950s):

  • can do gradient descent with some randomly selected true values(train data) but not all true values(train data).
  • 's learning rate is fixed during training.
  • 's pros:
    • Suit to train a model with large datasets because it can select only some data.
    • Reduce getting stuck in local minima or saddle points by randomly selecting data.
  • 's cons:
    • Sensitive to learning rate. *Large learning rate may overshoot a global minimal and small learning rate may cause slow convergence or get stuck in local minima or saddle points.
    • Much noisy(useless) data caused by randomly selecting data. *Oscillation(Vibration) happens around the optimal solution, taking longer to reach convergence.
    • Convergence is slow because of noisy data.
    • The quality of optimal solution can be lower because of noisy data.
  • is not in PyTorch. *Actually, SGD() with momentum = 0 is classic GD(Gradient Descent) in PyTorch.

(2) SGD(Stochastic Gradient Descent) with Momentum:

  • can do gradient descent with some randomly selected true values(train data) but not all true values(train data) by stabilizing and accelerating convergence with Momentum. *Momentum(1964) is also the optimizer which can stabilize and accelerate convergence.
  • 's learning rate is fixed during training.
  • 's pros:
    • Suit to train a model with large datasets because it can select only some data.
    • Reduce getting stuck in local minima or saddle points by randomly selecting data.
    • Less noisy data caused by randomly selecting data than SGD.
    • Convergence is faster then SGD because of less noisy data.
    • The quality of optimal solution can be higher than SGD because of less noisy data.
  • 's cons:
    • Sensitive to learning rate. *Large learning rate may overshoot a global minimal and small learning rate may cause slow convergence or get stuck in local minima or saddle points.
    • High momentum causes slow convergence, overshooting a global minimal repeatedly.
  • is not in PyTorch. *Actually, SGD() with momentum > 0 is classic GD(Gradient Descent) with Momentum in PyTorch.

(3) Adam(Adaptive Moment Estimation)(2014):

  • can do gradient descent by automatically adapting learning rate to parameters using recent gradients during training. *The learning rate is not fixed during training.
  • 's learning rate decreases as closing to a global minimum to find the optimal solution precisely.
  • is the combination of Momentum and RMSProp. *Memos:
    • Momentum(1964) is also the optimizer which can stabilize and accelerate convergence.
    • RMSProp(2012):
    • is also the optimizer which can do gradient descent by automatically adapting learning rate to parameters using recent gradients during training. *The learning rate is not fixed during training.
    • 's learning rate decreases as closing to a global minimum to find the optimal solution precisely.
    • is the improved version of AdaGrad(2011) which can do gradient descent by adapting learning rate to parameters, considering all gradients.
  • 's pros:
    • Automatic adaption of learning rate to parameters.
    • Work well with large datasets.
  • 's cons:
    • Memory usage is higher than SGD.
    • Generalization is poorer than SGD.
    • Failure to converge due to unstable or extreme learning rate.
    • Get stuck in local minima or saddle points.
  • is Adam() in PyTorch.

Top comments (0)