The size of training batches correlates linearly with the learning rate. ![[Batch size correlates linearly with learning rate.png|400]] The right formula can be derived by dividing the SOTA learning rate by the batch size of that experiment (5e-5 and 32 above, respectively). [Others claim](https://arxiv.org/pdf/1404.5997.pdf) the learning rate should scale with the *square root* of the batch size: $ LR \propto \sqrt{BS} $