Part III: Escaping from Reasoning Model Purgatory
~~~
The most interesting part about Chain of Thought (CoT) reasoning is that unlike a vanilla hallucinating LLM, CoT models convincingly assert falsehoods; the same mechanism that makes them avoid hallucinating also makes them dig in their heels (like a stubborn human)

Part IV: Reasoning without Regret
~~~
Q: Can we quantify when we can make these models better: higher accuracy + lower compute cost?
A: EIP-1559 is everywhere 😈 

DeepSeek is a phase transition: Lowered compute by 10x+ with ~same accuracy as o1 — why? Must be real math & algorithmic improvement

🤓 🔫'd me

First off, I wouldn't have been able to solve this without having o3-mini; I used it to find references + ideas that I had never heard of — would have likely taken me forever to find on my own

But there is something tantalizing about the idea of using a reasoning model to solve a math problem about a reasoning model itself; if you can do this, you found the 'backdoor' to the Reasoning Russell's paradox (in so far as one can convince themselves that the reasoning model can prove *some* properties about the set of possible reasoning traces it generates, even though it might not be able to describe the whole set)

This became my rallying cry as a way to get out of the AI doomer 🕳️  — figure out what makes DeepSeek tick using o3 as an assistant (ironic, I know)

ヽ(⌐■_■)ノ♪♬ @ @gauntlet/🤖Ventures/tldr/Aera \\ ¯\_(ツ)_/¯ Struggling with the nature of randomness \\ main: @gaa \\ /red-bull /defi-research maxi

Back to the paper.. how can one answer such a question? Diving into the DeepSeek code, it was clear that their reinforcement algorithm was doing something radically different to what is publicly known about Gemini 2.5 pro, OpenAI o1/o3/o4 — it uses sparse outcome rewards 

Rewards?! Recall in the last thread I had this picture — these models use "rewards" (think of them like DeFi protocol points) to guide the reasoning process. How you choose the rewards dictates:
1. Accuracy of the model
2. How fast it can learn

If the rewards are too sparse, then the model gives you random answers; if the rewards are too dense, then the model "reward hacks" — gets lots of rewards while moving in a circle, not doing your task

Choosing rewards in these processes is a delicate balance — if you over index on some types of tasks, the model will be unable to solve problems it hasn't already seen. But, the idea of this reward model learning was first explored in superhuman performance with AlphaZero and Diplomacy

There are two main ways to give out rewards:
- Outcome-based: I only give rewards for getting the right answer
- Procedure-based: I give you small rewards for each step of the process; so you can get partial rewards for wrong answers if they have "some" accurate steps

We'll consider an ELI5 example: 
Teaching someone to ride a bike 100m, brake and then U-turn and ride back

Outcome: Giving a kid 20 candies for doing the whole task

Procedure: Giving 1 candy for pedaling, 1 candy for breaking, 1 candy for balancing and 1 candy for every 20m biked