@mattrickard
We're decent at predicting how much a benchmark increases with more scale but bad at predicting how those benchmarks translate to real world applications. E.g., when we were at 75% top 1 accuracy on ImageNet we thought 90% meant full self-driving (it didn't). Same w/ chatbots/NLP