UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

On the different regimes of stochastic gradient descent

Sclocchi, Antonio; Wyart, Matthieu; (2024) On the different regimes of stochastic gradient descent. Proceedings of the National Academy of Sciences (PNAS) , 121 (9) , Article e2316301121. 10.1073/pnas.2316301121. Green open access

[thumbnail of 2309.10688v4.pdf]
Preview
Text
2309.10688v4.pdf - Accepted Version

Download (1MB) | Preview

Abstract

Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size [Formula: see text], and the step size or learning rate [Formula: see text]. For small [Formula: see text] and large [Formula: see text], SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the "temperature" [Formula: see text]. Yet this description is observed to break down for sufficiently large batches [Formula: see text], or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the [Formula: see text]-[Formula: see text] plane that separates three dynamical phases: i) a noise-dominated SGD governed by temperature, ii) a large-first-step-dominated SGD and iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size [Formula: see text] separating regimes (i) and (ii) scale with the size [Formula: see text] of the training set, with an exponent that characterizes the hardness of the classification problem.

Type: Article
Title: On the different regimes of stochastic gradient descent
Location: United States
Open access status: An open access version is available from UCL Discovery
DOI: 10.1073/pnas.2316301121
Publisher version: https://doi.org/10.1073/pnas.2316301121
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: critical batch size, implicit bias, phase diagram, stochastic gradient descent
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences > Gatsby Computational Neurosci Unit
URI: https://discovery.ucl.ac.uk/id/eprint/10206008
Downloads since deposit
1Download
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item