DEV Community

Cover image for OpenAI o3 - Thinking Fast and Slow
Maxim Saplin
Maxim Saplin

Posted on • Edited on

OpenAI o3 - Thinking Fast and Slow

OpenAI has teased the o3 model today—a further development of the "reasoning" model and a successor to o1.

I was impressed by how much it improved on the ARC-AGI-1 benchmark - a supposedly unbeatable benchmark by the current generation of LLMs. o1's high-score was 32% while o3 jumped right at 88%. The authors of the Arc Challenge ($1mil reward for beating ARC-AGI) were quite confident that transformer-based models won't be successful in their benchmark - they were not impressed with o1. Yet, the o3 blog post is a completely different sentiment with words such as "surprising", "novel" and "breakthrough". Yet there's a catch - it's very, very expensive: scoring 76% cost around $9k and 88% - OpenAI didn't disclose (one can evaluate the total cost to be at $1.5mil given the statement that 172x more compute was used).

o3 has reminded me of an analogy often mentioned when discussing LLMs. No matter the complexity of the task GPTs use the same amount of compute/time per token as if they are streaming info from their subconscious without ever stopping to think. This is similar to how the "Fast" System 1 of the human brain operates.

A quick recap, "Thinking Fast and Slow" is a 2011 book by Daniel Kahneman. He argues that functionally (based on empirical research) our brain has 2 departments (or modes):

  • System 1, Fast - effortless, autonomous, associative.
  • System 2, Slow - effortful, deliberate, logical.

The 2 systems work together and shape humans' thinking processes. We can read a book out loud without any stress, yet we might not remember a single word. We can read the book and be focused, constantly replaying the scenes and pictures in our mind, keeping track of events and timelines, and be exhausted after a short period—yet we might acquire new knowledge.

As Andrew Ng once noted, "Try typing a text without ever hitting backspace" - seems like a hard task, and that is how LLMs work.

Well, that's how they worked until recently. When o1 (and later Deepseek R1, QwQ, Gemini 2.0 Flash Thinking) appeared the models learned to make a break and operate in a mode similar to the "Slow" system.

Recently there has been a lot of talk of LLM pre-training plateauing, exhausting training data, AI development hitting a wall.

We might be seeing a forming trend on what comes in 2025 - combining reasoning/thinking models with traditional LLMs, interconnecting them as Slow and Fast minds: planning (Slow) and taking action (Fast), identifying (Fast) and evaluating (Slow) etc.

Here's one of the recent examples from Aider AI coding assistant which shows how combining QwQ as Architect and Qwen 2.5 as a Coder (there's a 2-step "architect-code" mode allowing to choose different models for each of the steps) increases coding performance.

Whether it will play out - it's hard to say. There are plenty of challenges that we haven't seen a lot of progress lately, even with Slow models. It's unclear how models such as o3 will be tolerant to hallucinations. The context window is still too small. The prices are going up... The slow models, while they hit the next levels of different "isolated" evals, are far from practical application at scale (doing large projects on their own OR simulating a junior intern). Additionally the Fast models, the actors, it doesn't seem they have shown progress in computer use and Moravec's paradox is still a challenge when it comes to automating a computer clerk.

P.S.>

About the same time when o3 was announced I received API access to o1-mini. I ran my own LLM Chess Eval that simulated chess games prompting models to play against a random player. While the previous SOTA models couldn't score even a single win (and I assumed the benchmark is as hard as the ARC eval)... o1-mini won 30% of the time! Now I am less skeptical, after all there might be some reasoning.

Top comments (8)

Collapse
 
valeriavg profile image
Valeria

I don’t want a smarter and more logical AI. I want an affordable robot that can fold my laundry and pick up the mess, so that I can spend more time being logical and smart. Human brain is clearly a much more efficient tool for the latter, such a waste using it to do laundry or dishes 😎

Collapse
 
dansasser profile image
Info Comment hidden by post author - thread only accessible via permalink
Daniel T Sasser II

Good information. You should follow me and catch my article coming out tomorrow on "The New Frontier ofAI". I go into some of this. While the details here in your post are more comprehensive on this subject, I go into a broader discussion. I'd love to hear your thoughts and I'll be sure to follow you as well to keep up with you journey.

Collapse
 
andre_adpc profile image
Andre Du Plessis

[Daniel[(@dansasser) , will you be posting your article "The New Frontier ofAI" here on DEV, or on another platform?

Collapse
 
dansasser profile image
Daniel T Sasser II • Edited

@andre_adpc I actually posted it here yesterday under a different title called The AI of Christmas Future. Just click this link to get to it and it's also part of the same series as my other AI articles. I thought the name was a little catchier since I released it as a special Christmas edition 😁 There are three others in this series so far. Each of them go into detail on different aspects of the subject.

Collapse
 
sinni800 profile image
sinni800

It still doesnt think, it's still a language model, not a thinking model, this is where OpenAi keeps trying to bamboozle all of us. As long as ClosedAI isn't actually fundamentally changing the way the models work, there's a ceiling that we just simply won't extend past. They had to generate answers for so many compute times, basically brute forcing an answer until it was good.

There is no thinking involved, period.

Collapse
 
maximsaplin profile image
Maxim Saplin

Bruteforcing assumes a loop where with every iteration you verify the candidate. I assume that with the arc benchmark there's no such option as crunching and validating zillions (random) answers for a given task. But rather give the model enough time/compute before giving a single (idealy correct) answer to a given problem.

And o1/o3 doesn't seem to be an auto regressive language model that is traditionally assumed.

Collapse
 
founderbrief profile image
FounderBrief

Good Job

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more