[Stage 1] Pretraining ↓ [Stage 2] Continual Pre-Training (optional) ↓ [Stage 3] Supervised Fine-Tuning (SFT) ↓ [Stage 4] Preference Optimization (DPO / PPO / RLHF) ↓ [Stage 5] Inference Optimization (quantization, distillation)
1. Pretraining (The Birth of the Model)
This is where the base model (GPT, Llama, Phi, Mistral) is trained from scratch.
What happens here:
- Train on massive unlabeled text (books, code, web, papers, logs…)
- Predict the next token
- Learn grammar, world knowledge, reasoning
- Billions of tokens (100B–15T)
- Weeks of distributed GPU training
You get:
A base LLM
(e.g., Phi-3, GPT-3, Llama-3, Mistral-8x7B).
These base LLMs know language but:
- They hallucinate
- They are not obedient
- They can’t follow instructions
- They can’t reason in steps
- They don’t know how to chat
So we continue.
2. Continual Pretraining (CPT) — Adding New Knowledge
This is the first topic you asked about.
CPT = continuing pretraining
on NEW unlabeled text
.
Goal:
Adapt the LLM to a new domain, task family, or knowledge source.
When do we use it?
- Adapt to code (Github → code models)
- Adapt to finance, medical, legal
- Add 2024–2025 knowledge to an older model
- Adapt to a company’s internal data
- Fix distribution shift
CPT looks like this:
[Base Model] ↓ Train again on
- new domain data
- new corpora
- new modalities ↓ [Better Model for the New Domain]
CPT is
NOT
fine-tuning
It’s basically mini-pretraining.
This is where:
- catastrophic forgetting occurs
- tokenizer problems arise
- data quality matters 10×
- distributed training skills matter
This is the stage Microsoft CoreAI heavily uses.