- Optimizer Comparison: Adam vs AdamW vs LAMB

1 ๐Ÿงฎ Adam vs. AdamW vs. LAMB โ€” Detailed Comparison Table

Aspect Adam AdamW LAMB
๐Ÿ”ง Base Optimizer Adam Adam + Decoupled Weight Decay AdamW + Layer-wise Scaling
๐Ÿ“˜ Published In Kingma & Ba, 2014 Loshchilov & Hutter, 2017 You et al., 2019
๐Ÿง  Purpose Adaptive optimizer for general deep learning Improved regularization for deep models Scalable training for large-batch, large-scale models
๐Ÿ“‰ Update Rule \(\theta = \theta - \eta \cdot \left( \frac{m}{\sqrt{v} + \epsilon} + \lambda \theta \right)\) \(\theta = \theta - \eta \cdot \frac{m}{\sqrt{v} + \epsilon} - \eta \cdot \lambda \cdot \theta\) \(\theta = \theta - \eta \cdot \text{Trust Ratio} \cdot \left( \frac{m}{\sqrt{v} + \epsilon} + \lambda \cdot \theta \right)\)
๐Ÿ—๏ธ Weight Decay Handling โŒ Incorrect โ€” applied as L2 in gradient (interferes with adaptation) โœ… Correct โ€” decoupled from gradient โœ…โœ… Decoupled + scaled per-layer for stability
๐Ÿ” Layer-wise Adaptive LR โŒ No โŒ No โœ… Yes โ€” based on weight and update norms
โš–๏ธ Trust Ratio โŒ No โŒ No โœ… Yes โ€” \(\frac{||\theta||}{||\Delta||}\) scales the update
๐Ÿงช Overfitting Prevention โš ๏ธ Weak with weight decay โœ… Stronger with proper decay โœ… Very robust with scaled decay
๐Ÿงฎ Gradient Clipping Optional but often needed Optional Less needed โ€” trust ratio stabilizes
๐Ÿง  Good for Deep Models? โš ๏ธ Sometimes unstable โœ… Better โœ…โœ… Designed for it
๐Ÿง  Good for Wide Models? โš ๏ธ Risky โœ… Works well โœ…โœ… Great for MoE/large embeddings
๐Ÿ‹๏ธ Batch Size Scaling โŒ Poor beyond 8โ€“16K โš ๏ธ Up to ~32K with tuning โœ…โœ… Efficient up to 256K batches
๐Ÿข Training Stability (LLMs) โŒ Risk of exploding/vanishing gradients โœ… Acceptable โœ…โœ… High stability with large layers
โšก Convergence Speed โœ… Fast at small scale โœ… Fast and generalizes well โœ…โœ… Fast + scalable for large workloads
๐Ÿงฉ Parameter Count Suitability Small to mid-size (~<300M) Mid to large (~<1B) Very large (1Bโ€“>175B)
๐Ÿงฐ Use Case Examples Small models, legacy setups BERT, GPT-2, fine-tuning BERT-Large, GPT-3, T5, GShard
๐Ÿ–ฅ๏ธ Used In Early DL models, simple setups HuggingFace default for Transformers Google, Microsoft, OpenAI scale LLM pretraining
๐Ÿงช Torch Implementation torch.optim.Adam torch.optim.AdamW deepspeed.ops.adam.FusedLamb, NVIDIA Apex
๐Ÿ“ฆ Libraries Supporting It All major DL libs Hugging Face, PyTorch DeepSpeed, Megatron-LM, FairScale

2 ๐Ÿง  Key Takeaways

2.1 โœ… Use Adam if:

  • Youโ€™re training small models with limited data and compute.
  • Simplicity is more important than scalability.

2.2 โœ… Use AdamW if:

  • Youโ€™re training medium-sized models (e.g., BERT-base, GPT-2).
  • You want better generalization via proper weight decay.
  • Youโ€™re fine-tuning pre-trained models (like LoRA/QLoRA).

2.3 โœ…โœ… Use LAMB if:

  • Youโ€™re training large-scale LLMs from scratch (e.g., 7B, 13B, 175B).
  • You want to scale batch size to 64K+ efficiently.
  • Youโ€™re using TPUs or multi-GPU clusters and care about throughput and convergence.

3 ๐Ÿ” What is Layer-wise Scaling?

In deep neural networksโ€”especially transformersโ€”we have multiple layers with:

  • Different weight magnitudes
  • Different gradient magnitudes
  • Varying learning dynamics

Layer-wise scaling refers to the practice of scaling updates differently for each layer based on its weight norm and gradient norm, instead of applying the same learning rate uniformly across all layers.


3.1 ๐Ÿง  Why is this necessary?

Imagine two layers:

Layer Weight Norm (\(\|\theta\|\)) Gradient Norm (\(\|\nabla \theta\|\)) Standard Update
Layer A (early) 0.01 0.001 Large (may overshoot)
Layer B (final) 10.0 0.01 Tiny (may under-update)

In standard optimizers (Adam/SGD), both layers would receive similar relative updates, potentially causing:

  • Training instability
  • Under-trained or over-trained layers
  • Poor convergence with large batch sizes

3.2 โš™๏ธ Enter LAMB: Layer-wise Adaptive Moments

LAMB = Adam + Layer-wise Trust Ratio Scaling

LAMB introduces a trust ratio that adapts updates per layer:

\[ \text{Trust Ratio} = \frac{||\theta||}{||\Delta||} \]

Where:

  • \(\theta\) = parameters of a layer
  • \(\Delta\) = Adam-like update (i.e.,\(\frac{m}{\sqrt{v} + \epsilon}\))

Instead of applying $ $ directly, LAMB applies:

\[ \theta = \theta - \eta \cdot \text{Trust Ratio} \cdot \Delta \]


3.3 ๐Ÿ”„ Effect: Proportional Updates

  • If a layerโ€™s weights are large but its gradient update is small, LAMB increases the update magnitude (to match scale).
  • If the weights are small but gradients are large (possibly noisy), LAMB reduces the update (to prevent overshooting).

This keeps each layerโ€™s update proportional to its own parameter magnitude.


3.4 ๐Ÿ“Š Example: Trust Ratio in Practice

Layer \(\|\theta\|\) \(\|\Delta\|\) Trust Ratio = \(\frac{\|\theta\|}{\|\Delta\|}\)
Layer A 1.0 0.5 2.0
Layer B 1.0 2.0 0.5

So:

  • Layer Aโ€™s update is amplified (under-updated in Adam)
  • Layer Bโ€™s update is attenuated (over-updated in Adam)

3.5 ๐Ÿš€ Why LAMB Helps in Large Models

Challenge in Billion-Param Models How LAMB Helps
Different layers learn at different speeds Layer-wise trust ratio balances learning speed
Gradient instability in deep networks Normalizes update magnitudes by norm ratios
Huge batch sizes (e.g., 64Kโ€“256K) cause convergence issues Layer scaling stabilizes convergence across layers
Uniform LR schedules donโ€™t work well Each layer gets its own implicit adaptive LR

3.6 ๐Ÿง  Analogy

Imagine training layers as runners in a race:

  • Adam gives the same boost to everyone.
  • But some are already ahead (large weights), some are behind.
  • LAMB adjusts boost based on where each runner is, so everyone converges together efficiently.