Artificial Intelligence (AI)

Here’s a paper on making Large Language Models more efficient for low-bit quantization by preventing outliers right from the training phase.

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

What’s interesting?
Outliers have been a major issue when quantizing LLMs, especially for on-device deployment. This paper introduces Outlier-Safe Pre-Training (OSP) – a proactive approach to stop outliers from forming in the first place.

Key Highlights:
- Muon Optimizer: Improves training without privileged bases.
- Single-Scale RMSNorm: Controls channel-wise amplification.
- Learnable Embedding Projection: Balances activation magnitudes.

The result?
- 1.4B parameter model trained on 1 trillion tokens – without activation outliers.
- Achieved 35.7 avg. score across 10 benchmarks under 4-bit quantization (baseline: 26.5).
- Only 2% extra training cost.

Turns out, outliers aren’t an unavoidable part of LLMs, they’re just a result of how we train them.

Here’s a paper on making Large Language Models more efficient for low-bit quantization by preventing outliers right from the training phase.

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

What’s interesting?
Outliers have been a major issue when quantizing LLMs, especially for on-device deployment. This paper introduces Outlier-Safe Pre-Training (OSP) – a proactive approach to stop outliers from forming in the first place.

Key Highlights:
- Muon Optimizer: Improves training without privileged bases.
- Single-Scale RMSNorm: Controls channel-wise amplification.
- Learnable Embedding Projection: Balances activation magnitudes.

The result?
- 1.4B parameter model trained on 1 trillion tokens – without activation outliers.
- Achieved 35.7 avg. score across 10 benchmarks under 4-bit quantization (baseline: 26.5).
- Only 2% extra training cost.

Turns out, outliers aren’t an unavoidable part of LLMs, they’re just a result of how we train them.

https://www.arxiv.org/abs/2506.19697