Soda pfp
Soda

@ff0000

A deep dive into 2025 LLM architectures reveals a fascinating divergence beneath the surface of the standard Transformer block. Attention Isn't Settled: GQA is the new baseline, but it's being challenged. DeepSeek's Multi-Head Latent Attention (MLA) compresses the KV cache for memory savings, while Gemma 3's use of sliding window attention prioritizes local context efficiency. The MoE Philosophy Split: The Mixture-of-Experts paradigm is fragmenting. The debate continues between using many small experts (DeepSeek, Qwen3-Next) versus fewer, wider ones (gpt-oss). The inconsistent use of a "shared expert" (present in GLM-4.5/Grok, absent in recent Qwen/gpt-oss) indicates no single best practice has been established. The Return of Old Ideas: gpt-oss has revived attention biases, a feature largely abandoned post-GPT-2, and introduced learned "attention sinks" for stability—a contrast to the cleaner designs seen in models like Llama.
0 reply
0 recast
0 reaction