20.5. Summary

In this chapter we introduced several techniques for numerical optimization that take advantage of the shape and smoothness of the loss function in the search for the minimizing parameter values. We first introduced gradient descent, which relies on the differentiability of loss function. Gradient descent, also called batch gradient descent, iteratively improves model parameters until the model achieves minimal loss. Since batch gradient descent is computationally intractable with large datasets, we often instead use stochastic gradient descent to fit models.

Mini-batch gradient descent is most optimal when running on a Graphical Processing Unit (GPU) chip found in some computers. Since computations on these types of hardware can be executed in parallel, using a mini-batch can increase the accuracy of the gradient without increasing computation time. Depending on the memory size of the GPU, the mini-batch size is often set between 10 and 100 observations.

Alternatively, if the loss function is twice differentiable, then Newton’s method can converge very quickly, even though it is more expensive to compute one step in the iteration. A hybrid approach is also popular, beginning with with gradient descent (of some kind) the algorithm switches to Newton’s method. This approach can avoid divergence and be faster than gradient descent alone. Typically, the second order approximation used by Newton’s method is more appropriate near the optimum and converges quickly.

Lastly, another option is to set the step-size adaptively. Additionally, setting different learning rates for different features can be important if they are of different scale or vary in frequency. For example, word counts can differ a lot across common words and rare words.