agusti
@bleu.eth
Transformers scale exceptionally well, but their core mechanism, self-attention, has a computational and memory cost that grows quadratically with the length of the sequence (O(n 2)). While we can push this to millions of tokens, it is an inefficient, brute-force method. It's like trying to listen to every person in a stadium to understand one conversation. Nature doesn't work this way; it uses targeted, efficient attention.
0 reply
1 recast
1 reaction