The size of training batches correlates linearly with the learning rate.
![[Batch size correlates linearly with learning rate.png|400]]
The right formula can be derived by dividing the SOTA learning rate by the batch size of that experiment (5e-5 and 32 above, respectively).
[Others claim](https://arxiv.org/pdf/1404.5997.pdf) the learning rate should scale with the *square root* of the batch size:
$
LR \propto \sqrt{BS}
$