An optimizer is the function which can adjust a model's parameters(weight
and bias
) by gradient descent to minimize the mean(average) of the sum of the losses(differences) between the model's predictions and true values(train data) during training. *Gradient Descent(GD) is the function which can find the minimum(or maximum) gradient(slope) of a function.
(1) SGD(Stochastic Gradient Descent)(1950s):
- can do gradient descent with some randomly selected true values(train data) but not all true values(train data).
- 's learning rate is fixed during training.
- 's pros:
- Suit to train a model with large datasets because it can select only some data.
- Reduce getting stuck in local minima or saddle points by randomly selecting data.
- 's cons:
- Sensitive to learning rate. *Large learning rate may overshoot a global minimal and small learning rate may cause slow convergence or get stuck in local minima or saddle points.
- Much noisy(useless) data caused by randomly selecting data. *Oscillation(Vibration) happens around the optimal solution, taking longer to reach convergence.
- Convergence is slow because of noisy data.
- The quality of optimal solution can be lower because of noisy data.
- is not in PyTorch. *Actually, SGD() with
momentum
= 0 is classic GD(Gradient Descent) in PyTorch.
(2) SGD(Stochastic Gradient Descent) with Momentum:
- can do gradient descent with some randomly selected true values(train data) but not all true values(train data) by stabilizing and accelerating convergence with Momentum. *Momentum(1964) is also the optimizer which can stabilize and accelerate convergence.
- 's learning rate is fixed during training.
- 's pros:
- Suit to train a model with large datasets because it can select only some data.
- Reduce getting stuck in local minima or saddle points by randomly selecting data.
- Less noisy data caused by randomly selecting data than SGD.
- Convergence is faster then SGD because of less noisy data.
- The quality of optimal solution can be higher than SGD because of less noisy data.
- 's cons:
- Sensitive to learning rate. *Large learning rate may overshoot a global minimal and small learning rate may cause slow convergence or get stuck in local minima or saddle points.
- High momentum causes slow convergence, overshooting a global minimal repeatedly.
- is not in PyTorch. *Actually, SGD() with
momentum
> 0 is classic GD(Gradient Descent) with Momentum in PyTorch.
(3) Adam(Adaptive Moment Estimation)(2014):
- can do gradient descent by automatically adapting learning rate to parameters using recent gradients during training. *The learning rate is not fixed during training.
- 's learning rate decreases as closing to a global minimum to find the optimal solution precisely.
- is the combination of Momentum and RMSProp.
*Memos:
- Momentum(1964) is also the optimizer which can stabilize and accelerate convergence.
- RMSProp(2012):
- is also the optimizer which can do gradient descent by automatically adapting learning rate to parameters using recent gradients during training. *The learning rate is not fixed during training.
- 's learning rate decreases as closing to a global minimum to find the optimal solution precisely.
- is the improved version of AdaGrad(2011) which can do gradient descent by adapting learning rate to parameters, considering all gradients.
- 's pros:
- Automatic adaption of learning rate to parameters.
- Work well with large datasets.
- 's cons:
- Memory usage is higher than SGD.
- Generalization is poorer than SGD.
- Failure to converge due to unstable or extreme learning rate.
- Get stuck in local minima or saddle points.
- is Adam() in PyTorch.
Top comments (0)