Continual Pre training

[Stage 1] Pretraining ↓ [Stage 2] Continual Pre-Training (optional) ↓ [Stage 3] Supervised Fine-Tuning (SFT) ↓ [Stage 4] Preference Optimization (DPO / PPO / RLHF) ↓ [Stage 5] Inference Optimization (quantization, distillation)

1. Pretraining (The Birth of the Model)

This is where the base model (GPT, Llama, Phi, Mistral) is trained from scratch.

What happens here:

  • Train on massive unlabeled text (books, code, web, papers, logs…)
  • Predict the next token
  • Learn grammar, world knowledge, reasoning
  • Billions of tokens (100B–15T)
  • Weeks of distributed GPU training

You get:

A base LLM

(e.g., Phi-3, GPT-3, Llama-3, Mistral-8x7B).

These base LLMs know language but:

  • They hallucinate
  • They are not obedient
  • They can’t follow instructions
  • They can’t reason in steps
  • They don’t know how to chat

So we continue.

2. Continual Pretraining (CPT) — Adding New Knowledge

This is the first topic you asked about.

CPT = continuing pretraining 

on NEW unlabeled text

.

Goal:

Adapt the LLM to a new domain, task family, or knowledge source.

When do we use it?

  • Adapt to code (Github → code models)
  • Adapt to finance, medical, legal
  • Add 2024–2025 knowledge to an older model
  • Adapt to a company’s internal data
  • Fix distribution shift

CPT looks like this:

[Base Model] ↓ Train again on

  • new domain data
  • new corpora
  • new modalities ↓ [Better Model for the New Domain]

CPT is 

NOT

 fine-tuning

It’s basically mini-pretraining.

This is where:

  • catastrophic forgetting occurs
  • tokenizer problems arise
  • data quality matters 10×
  • distributed training skills matter

This is the stage Microsoft CoreAI heavily uses.