Giuliano Giacaglia π²
@giu
Anthropic just announced Claude 4 Opus and Claude 4 Sonnet! They lead on SWE-bench (72.5%) - testing practical software engineering skills - and Terminal-bench (43.2%) https://www.anthropic.com/news/claude-4
2 replies
7 recasts
33 reactions
gcmac.eth
@gcmac
What are your thoughts on the parallel execution for benchmarks? Makes it hard to compare to prior models imo
0 reply
0 recast
0 reaction