joelceth pfp
joelceth

@joelceth

ai models answer memorized questions perfectly. but what happens when you give them a new document and say "read this and apply it"? tencent and fudan tested with 1899 tasks: even the best model only scored 24%. why is this so hard? answering from memorized knowledge is easy for ai models. the hard part is reading information they've never seen before and applying it instantly. tencent and fudan university released a test suite called cl-bench to measure exactly this. it includes 500 different contexts, 1899 tasks, and over 31000 validation rules. models receive information they never saw during training and must solve problems using it. the results are striking: the best of 19 tested models, gpt 5.1, only achieved 23.7% accuracy. claude opus 4.5 got 21.1%, gemini 3 pro managed 15.8%. even the most advanced models struggle to learn from context and apply new information.
2 replies
8 recasts
70 reactions