Optimising neural network training efficiency through spectral parameter-based multiple adaptive learning rates

The process of training deep neural networks involves heavily solving optimization problems. Finding optimal values for different hyperparameters makes training neural networks challenging. A hyperparameter called learning rate or step size is one of the most crucial factors in optimization using gr...

Full description

Bibliographic Details
Main Author: Koay, Yeong Lin
Format: Final Year Project / Dissertation / Thesis
Published: 2023
Subjects:
Online Access:http://eprints.utar.edu.my/6338/
http://eprints.utar.edu.my/6338/1/4._Revised_Dissertation_Koay_Yeong_Lin.pdf
Description
Summary:The process of training deep neural networks involves heavily solving optimization problems. Finding optimal values for different hyperparameters makes training neural networks challenging. A hyperparameter called learning rate or step size is one of the most crucial factors in optimization using gradient-based approaches. A small learning rate might result in slow convergence and the loss function will get stuck in the local minimum, whereas a large learning rate might hinder convergence or cause divergence. Currently, most of the common optimization algorithms use a fixed learning rate or a simplified adaptive updating scheme in every iteration. In this project, we propose a stochastic gradient descent method with multiple adaptive learning rates (MAdaGrad) and A am with multiple adaptive learning rates (MAdaGrad Adam). In the derivation of the updating formula, we aim to minimize the log-determinant norm and allow them to satisfy the secant equation. We apply the Lagrange multiplier to the minimization problem and the Lagrange multiplier can be approximated by using the Newton-Raphson method. The proposed algorithms update the learning rate in every iteration based on the approximated spectrum of the Hessian of the loss function. The methods were compared to the existing optimization methods in deep learning, stochastic gradient descent method (SGD) and Adam. Some datasets were used to observe the performance of the proposed methods. The numerical results show that the proposed methods perform better than SGD and Adam. Hence, the proposed MAdaGrad and MAdaGrad Adam can be alternative optimizer in machine learning.