[Stage 1] Pretraining ↓ [Stage 2] Continual Pre-Training (optional) ↓ [Stage 3] Supervised Fine-Tuning (SFT) ↓ [Stage 4] Preference Optimization (DPO / PPO / RLHF) ↓ [Stage 5] Inference Optimization (quantization, distillation)

1. Pretraining (The Birth of the Model)

This is where the base model (GPT, Llama, Phi, Mistral) is trained from scratch.

What happens here:

Train on massive unlabeled text (books, code, web, papers, logs…)
Predict the next token
Learn grammar, world knowledge, reasoning
Billions of tokens (100B–15T)
Weeks of distributed GPU training

You get:

A base LLM

(e.g., Phi-3, GPT-3, Llama-3, Mistral-8x7B).

These base LLMs know language but:

They hallucinate
They are not obedient
They can’t follow instructions
They can’t reason in steps
They don’t know how to chat

So we continue.

2. Continual Pretraining (CPT) — Adding New Knowledge

This is the first topic you asked about.

CPT = continuing pretraining

on NEW unlabeled text

.

Goal:

Adapt the LLM to a new domain, task family, or knowledge source.

When do we use it?

Adapt to code (Github → code models)
Adapt to finance, medical, legal
Add 2024–2025 knowledge to an older model
Adapt to a company’s internal data
Fix distribution shift

CPT looks like this:

[Base Model] ↓ Train again on

new domain data
new corpora
new modalities ↓ [Better Model for the New Domain]

CPT is

NOT

fine-tuning

It’s basically mini-pretraining.

This is where:

catastrophic forgetting occurs
tokenizer problems arise
data quality matters 10×
distributed training skills matter

This is the stage Microsoft CoreAI heavily uses.