- Optimizer Comparison: Adam vs AdamW vs LAMB
1 ๐งฎ Adam vs. AdamW vs. LAMB โ Detailed Comparison Table
| Aspect | Adam | AdamW | LAMB |
|---|---|---|---|
| ๐ง Base Optimizer | Adam | Adam + Decoupled Weight Decay | AdamW + Layer-wise Scaling |
| ๐ Published In | Kingma & Ba, 2014 | Loshchilov & Hutter, 2017 | You et al., 2019 |
| ๐ง Purpose | Adaptive optimizer for general deep learning | Improved regularization for deep models | Scalable training for large-batch, large-scale models |
| ๐ Update Rule | \(\theta = \theta - \eta \cdot \left( \frac{m}{\sqrt{v} + \epsilon} + \lambda \theta \right)\) | \(\theta = \theta - \eta \cdot \frac{m}{\sqrt{v} + \epsilon} - \eta \cdot \lambda \cdot \theta\) | \(\theta = \theta - \eta \cdot \text{Trust Ratio} \cdot \left( \frac{m}{\sqrt{v} + \epsilon} + \lambda \cdot \theta \right)\) |
| ๐๏ธ Weight Decay Handling | โ Incorrect โ applied as L2 in gradient (interferes with adaptation) | โ Correct โ decoupled from gradient | โ โ Decoupled + scaled per-layer for stability |
| ๐ Layer-wise Adaptive LR | โ No | โ No | โ Yes โ based on weight and update norms |
| โ๏ธ Trust Ratio | โ No | โ No | โ Yes โ \(\frac{||\theta||}{||\Delta||}\) scales the update |
| ๐งช Overfitting Prevention | โ ๏ธ Weak with weight decay | โ Stronger with proper decay | โ Very robust with scaled decay |
| ๐งฎ Gradient Clipping | Optional but often needed | Optional | Less needed โ trust ratio stabilizes |
| ๐ง Good for Deep Models? | โ ๏ธ Sometimes unstable | โ Better | โ โ Designed for it |
| ๐ง Good for Wide Models? | โ ๏ธ Risky | โ Works well | โ โ Great for MoE/large embeddings |
| ๐๏ธ Batch Size Scaling | โ Poor beyond 8โ16K | โ ๏ธ Up to ~32K with tuning | โ โ Efficient up to 256K batches |
| ๐ข Training Stability (LLMs) | โ Risk of exploding/vanishing gradients | โ Acceptable | โ โ High stability with large layers |
| โก Convergence Speed | โ Fast at small scale | โ Fast and generalizes well | โ โ Fast + scalable for large workloads |
| ๐งฉ Parameter Count Suitability | Small to mid-size (~<300M) | Mid to large (~<1B) | Very large (1Bโ>175B) |
| ๐งฐ Use Case Examples | Small models, legacy setups | BERT, GPT-2, fine-tuning | BERT-Large, GPT-3, T5, GShard |
| ๐ฅ๏ธ Used In | Early DL models, simple setups | HuggingFace default for Transformers | Google, Microsoft, OpenAI scale LLM pretraining |
| ๐งช Torch Implementation | torch.optim.Adam |
torch.optim.AdamW |
deepspeed.ops.adam.FusedLamb, NVIDIA Apex |
| ๐ฆ Libraries Supporting It | All major DL libs | Hugging Face, PyTorch | DeepSpeed, Megatron-LM, FairScale |
2 ๐ง Key Takeaways
2.1 โ Use Adam if:
- Youโre training small models with limited data and compute.
- Simplicity is more important than scalability.
2.2 โ Use AdamW if:
- Youโre training medium-sized models (e.g., BERT-base, GPT-2).
- You want better generalization via proper weight decay.
- Youโre fine-tuning pre-trained models (like LoRA/QLoRA).
2.3 โ โ Use LAMB if:
- Youโre training large-scale LLMs from scratch (e.g., 7B, 13B, 175B).
- You want to scale batch size to 64K+ efficiently.
- Youโre using TPUs or multi-GPU clusters and care about throughput and convergence.
3 ๐ What is Layer-wise Scaling?
In deep neural networksโespecially transformersโwe have multiple layers with:
- Different weight magnitudes
- Different gradient magnitudes
- Varying learning dynamics
Layer-wise scaling refers to the practice of scaling updates differently for each layer based on its weight norm and gradient norm, instead of applying the same learning rate uniformly across all layers.
3.1 ๐ง Why is this necessary?
Imagine two layers:
| Layer | Weight Norm (\(\|\theta\|\)) | Gradient Norm (\(\|\nabla \theta\|\)) | Standard Update |
|---|---|---|---|
| Layer A (early) | 0.01 | 0.001 | Large (may overshoot) |
| Layer B (final) | 10.0 | 0.01 | Tiny (may under-update) |
In standard optimizers (Adam/SGD), both layers would receive similar relative updates, potentially causing:
- Training instability
- Under-trained or over-trained layers
- Poor convergence with large batch sizes
3.2 โ๏ธ Enter LAMB: Layer-wise Adaptive Moments
LAMB = Adam + Layer-wise Trust Ratio Scaling
LAMB introduces a trust ratio that adapts updates per layer:
\[ \text{Trust Ratio} = \frac{||\theta||}{||\Delta||} \]
Where:
- \(\theta\) = parameters of a layer
- \(\Delta\) = Adam-like update (i.e.,\(\frac{m}{\sqrt{v} + \epsilon}\))
Instead of applying $ $ directly, LAMB applies:
\[ \theta = \theta - \eta \cdot \text{Trust Ratio} \cdot \Delta \]
3.3 ๐ Effect: Proportional Updates
- If a layerโs weights are large but its gradient update is small, LAMB increases the update magnitude (to match scale).
- If the weights are small but gradients are large (possibly noisy), LAMB reduces the update (to prevent overshooting).
This keeps each layerโs update proportional to its own parameter magnitude.
3.4 ๐ Example: Trust Ratio in Practice
| Layer | \(\|\theta\|\) | \(\|\Delta\|\) | Trust Ratio = \(\frac{\|\theta\|}{\|\Delta\|}\) |
|---|---|---|---|
| Layer A | 1.0 | 0.5 | 2.0 |
| Layer B | 1.0 | 2.0 | 0.5 |
So:
- Layer Aโs update is amplified (under-updated in Adam)
- Layer Bโs update is attenuated (over-updated in Adam)
3.5 ๐ Why LAMB Helps in Large Models
| Challenge in Billion-Param Models | How LAMB Helps |
|---|---|
| Different layers learn at different speeds | Layer-wise trust ratio balances learning speed |
| Gradient instability in deep networks | Normalizes update magnitudes by norm ratios |
| Huge batch sizes (e.g., 64Kโ256K) cause convergence issues | Layer scaling stabilizes convergence across layers |
| Uniform LR schedules donโt work well | Each layer gets its own implicit adaptive LR |
3.6 ๐ง Analogy
Imagine training layers as runners in a race:
- Adam gives the same boost to everyone.
- But some are already ahead (large weights), some are behind.
- LAMB adjusts boost based on where each runner is, so everyone converges together efficiently.