1. Tune $\beta_2$, typically by trying values like 0.99 or 0.9 (instead of the typical default of 0.999); Prefer this over changing the learning rate when you have 1B+ model parameters. 2. Also consider tuning $\epsilon$: This makes Adam behave more like Momentum and is worth tuning at the very end to get a bit more performance. 3. Tune the learning rate by changing the exponential (for example, $3e^{-2}, 3e^{-4}, 3e^{-6}, ...$). Always have a learning rate schedule (cosine or simple discrete drops). And don't forget that Transformers need a warmup phase.