- MLM vs CLM

Masked Language Modeling (MLM) and Causal Language Modeling (CLM)”

In Natural Language Processing (NLP), Language Modeling (LM) is a key technique where models learn the structure and patterns of natural language. Two of the most common types of language modeling techniques are:

Masked Language Modeling (MLM)
Causal Language Modeling (CLM)

This document explores the differences, provides examples, and discusses use cases of both.

1. What is Masked Language Modeling (MLM)?

Masked Language Modeling is a bidirectional approach where the model is trained to predict masked (hidden) tokens in a sentence using the surrounding context.

🔧 How it works:

A portion of tokens in the input sentence is replaced with a special [MASK] token.
The model tries to predict the original token at that position.

✅ Example:

Input:

The quick brown [MASK] jumps over the lazy dog.

Target output:

The quick brown fox jumps over the lazy dog.

🧠 Used by:

BERT (Bidirectional Encoder Representations from Transformers)
RoBERTa
ALBERT

🔍 Code Example (Hugging Face Transformers):from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

input_text = "The quick brown [MASK] jumps over the lazy dog."
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

predicted_token = tokenizer.decode(predictions[0])
print(predicted_token)

2. What is Causal Language Modeling (CLM)?

Causal Language Modeling is a unidirectional (usually left-to-right) modeling technique where the model predicts the next word in a sequence given the previous words.

🔧 How it works:

At each step, the model sees only the tokens before the current position (no peeking ahead).
It’s also known as autoregressive language modeling.

✅ Example:

Input:

The quick brown fox

Model predicts next word:

jumps

🧠 Used by:

GPT series (GPT-1, GPT-2, GPT-3, GPT-4)
Transformer-XL
CTRL

🔍 Code Example (Hugging Face Transformers):from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_text = "The quick brown fox"
inputs = tokenizer(input_text, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=5)

print(tokenizer.decode(outputs[0]))

3. Key Differences

Aspect	Masked Language Modeling (MLM)	Causal Language Modeling (CLM)
Directionality	Bidirectional	Unidirectional (left-to-right)
Prediction Task	Predict masked tokens	Predict next token
Use of Context	Uses both left and right context	Uses only left context
Example Model	BERT, RoBERTa	GPT, GPT-2, GPT-3
Output	Word in masked position	Next word/token in sequence
Use Case	Classification, QA, embeddings	Text generation, story writing, chatbot
Pretraining Technique	Mask random tokens and train to reconstruct	Train to generate next token sequentially

4. Applications

✅ MLM is better for:

Embedding learning
Sentence classification
Question answering
Natural language inference (NLI)

✅ CLM is better for:

Text generation
Autocomplete
Conversational AI
Creative writing

5. Final Notes

MLMs can’t be directly used for generation tasks because they require masking, which doesn’t happen at inference time.
CLMs, being autoregressive, are ideal for generating fluent and coherent text.

📚 References

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
Radford et al., Language Models are Unsupervised Multitask Learners (2019)
Hugging Face Transformers Documentation: https://huggingface.co/docs/transformers