A deep dive into 2025 LLM architectures reveals a fascinating divergence beneath the surface of the standard Transformer block. Attention Isn't Settled: GQA is the new baseline, but it's being challenged. DeepSeek's Multi-Head Latent Attention (MLA) compresses the KV cache for memory savings, while Gemma 3's use of sliding window attention prioritizes local context efficiency. The MoE Philosophy Split: The Mixture-of-Experts paradigm is fragmenting. The debate continues between using many small experts (DeepSeek, Qwen3-Next) versus fewer, wider ones (gpt-oss). The inconsistent use of a "shared expert" (present in GLM-4.5/Grok, absent in recent Qwen/gpt-oss) indicates no single best practice has been established. The Return of Old Ideas: gpt-oss has revived attention biases, a feature largely abandoned post-GPT-2, and introduced learned "attention sinks" for stability—a contrast to the cleaner designs seen in models like Llama.
- 0 replies
- 0 recasts
- 0 reactions
Has Lens considered users other than artists and e-beggars?
- 0 replies
- 0 recasts
- 1 reaction
Do less
- 0 replies
- 0 recasts
- 0 reactions
