- MLM vs CLM

Masked Language Modeling (MLM) and Causal Language Modeling (CLM)”

In Natural Language Processing (NLP), Language Modeling (LM) is a key technique where models learn the structure and patterns of natural language. Two of the most common types of language modeling techniques are:

  • Masked Language Modeling (MLM)
  • Causal Language Modeling (CLM)

This document explores the differences, provides examples, and discusses use cases of both.


1. What is Masked Language Modeling (MLM)?

Masked Language Modeling is a bidirectional approach where the model is trained to predict masked (hidden) tokens in a sentence using the surrounding context.

πŸ”§ How it works:

  • A portion of tokens in the input sentence is replaced with a special [MASK] token.
  • The model tries to predict the original token at that position.

βœ… Example:

Input:

The quick brown [MASK] jumps over the lazy dog.

Target output:

The quick brown fox jumps over the lazy dog.

🧠 Used by:

  • BERT (Bidirectional Encoder Representations from Transformers)
  • RoBERTa
  • ALBERT

πŸ” Code Example (Hugging Face Transformers):

from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

input_text = "The quick brown [MASK] jumps over the lazy dog."
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

predicted_token = tokenizer.decode(predictions[0])
print(predicted_token)

2. What is Causal Language Modeling (CLM)?

Causal Language Modeling is a unidirectional (usually left-to-right) modeling technique where the model predicts the next word in a sequence given the previous words.

πŸ”§ How it works:

  • At each step, the model sees only the tokens before the current position (no peeking ahead).
  • It’s also known as autoregressive language modeling.

βœ… Example:

Input:

The quick brown fox

Model predicts next word:

jumps

🧠 Used by:

  • GPT series (GPT-1, GPT-2, GPT-3, GPT-4)
  • Transformer-XL
  • CTRL

πŸ” Code Example (Hugging Face Transformers):

from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_text = "The quick brown fox"
inputs = tokenizer(input_text, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=5)

print(tokenizer.decode(outputs[0]))

3. Key Differences

Aspect Masked Language Modeling (MLM) Causal Language Modeling (CLM)
Directionality Bidirectional Unidirectional (left-to-right)
Prediction Task Predict masked tokens Predict next token
Use of Context Uses both left and right context Uses only left context
Example Model BERT, RoBERTa GPT, GPT-2, GPT-3
Output Word in masked position Next word/token in sequence
Use Case Classification, QA, embeddings Text generation, story writing, chatbot
Pretraining Technique Mask random tokens and train to reconstruct Train to generate next token sequentially

4. Applications

βœ… MLM is better for:

  • Embedding learning
  • Sentence classification
  • Question answering
  • Natural language inference (NLI)

βœ… CLM is better for:

  • Text generation
  • Autocomplete
  • Conversational AI
  • Creative writing

5. Final Notes

  • MLMs can’t be directly used for generation tasks because they require masking, which doesn’t happen at inference time.
  • CLMs, being autoregressive, are ideal for generating fluent and coherent text.

πŸ“š References

  • Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
  • Radford et al., Language Models are Unsupervised Multitask Learners (2019)
  • Hugging Face Transformers Documentation: https://huggingface.co/docs/transformers