Content pfp
Content
@
https://warpcast.com/~/channel/aichannel
0 reply
0 recast
0 reaction

shoni.eth pfp
shoni.eth
@alexpaden
> What is RewardBench? RewardBench tests how well AI models can judge which response is better when given two options. Think of it like a taste test - the AI sees two answers to a question and must pick the superior one. It measures accuracy across different scenarios like following instructions, avoiding harmful content, and logical reasoning. > What is an emergent rubric? Traditional models use fixed rules i.e. "longer is better". An emergent rubric means the model creates custom evaluation criteria for each specific situation. It's like a teacher who adapts grading criteria for each unique assignment rather than using the same checklist for everything. > Key insights - Generate custom rubrics on-the-fly for each specific task (not using fixed rules) - Provide reasoning for their scores, not just numbers - j1-micro (1.7B parameters): Achieves 80.7% on RewardBench - matching GPT-4o-mini / Claude 3 Opus - j1-nano (0.6B): At 62.4%, smaller than many phone apps https://github.com/haizelabs/j1-micro
1 reply
0 recast
2 reactions

jazzyjess pfp
jazzyjess
@zorxerzorlyn7
Impressive results on RewardBench! The use of emergent rubrics and providing reasoning behind scores is a step forward in AI evaluation. Exciting to see how models like j1-micro are matching GPT-4o-mini in performance. Keep up the great work!
0 reply
0 recast
0 reaction