- MLM vs CLM
Masked Language Modeling (MLM) and Causal Language Modeling (CLM)β
In Natural Language Processing (NLP), Language Modeling (LM) is a key technique where models learn the structure and patterns of natural language. Two of the most common types of language modeling techniques are:
- Masked Language Modeling (MLM)
- Causal Language Modeling (CLM)
This document explores the differences, provides examples, and discusses use cases of both.
1. What is Masked Language Modeling (MLM)?
Masked Language Modeling is a bidirectional approach where the model is trained to predict masked (hidden) tokens in a sentence using the surrounding context.
π§ How it works:
- A portion of tokens in the input sentence is replaced with a special
[MASK]token. - The model tries to predict the original token at that position.
β Example:
Input:
The quick brown [MASK] jumps over the lazy dog.
Target output:
The quick brown fox jumps over the lazy dog.
π§ Used by:
- BERT (Bidirectional Encoder Representations from Transformers)
- RoBERTa
- ALBERT
π Code Example (Hugging Face Transformers):
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
input_text = "The quick brown [MASK] jumps over the lazy dog."
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
predicted_token = tokenizer.decode(predictions[0])
print(predicted_token)2. What is Causal Language Modeling (CLM)?
Causal Language Modeling is a unidirectional (usually left-to-right) modeling technique where the model predicts the next word in a sequence given the previous words.
π§ How it works:
- At each step, the model sees only the tokens before the current position (no peeking ahead).
- Itβs also known as autoregressive language modeling.
β Example:
Input:
The quick brown fox
Model predicts next word:
jumps
π§ Used by:
- GPT series (GPT-1, GPT-2, GPT-3, GPT-4)
- Transformer-XL
- CTRL
π Code Example (Hugging Face Transformers):
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
input_text = "The quick brown fox"
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))3. Key Differences
| Aspect | Masked Language Modeling (MLM) | Causal Language Modeling (CLM) |
|---|---|---|
| Directionality | Bidirectional | Unidirectional (left-to-right) |
| Prediction Task | Predict masked tokens | Predict next token |
| Use of Context | Uses both left and right context | Uses only left context |
| Example Model | BERT, RoBERTa | GPT, GPT-2, GPT-3 |
| Output | Word in masked position | Next word/token in sequence |
| Use Case | Classification, QA, embeddings | Text generation, story writing, chatbot |
| Pretraining Technique | Mask random tokens and train to reconstruct | Train to generate next token sequentially |
4. Applications
β MLM is better for:
- Embedding learning
- Sentence classification
- Question answering
- Natural language inference (NLI)
β CLM is better for:
- Text generation
- Autocomplete
- Conversational AI
- Creative writing
5. Final Notes
- MLMs canβt be directly used for generation tasks because they require masking, which doesnβt happen at inference time.
- CLMs, being autoregressive, are ideal for generating fluent and coherent text.
π References
- Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
- Radford et al., Language Models are Unsupervised Multitask Learners (2019)
- Hugging Face Transformers Documentation: https://huggingface.co/docs/transformers