Maxim Saplin

Posted on Dec 20, 2024 • Edited on Dec 23, 2024

OpenAI o3 - Thinking Fast and Slow

#ai #machinelearning #llm #news

OpenAI has teased the o3 model today—a further development of the "reasoning" model and a successor to o1.

I was impressed by how much it improved on the ARC-AGI-1 benchmark - a supposedly unbeatable benchmark by the current generation of LLMs. o1's high-score was 32% while o3 jumped right at 88%. The authors of the Arc Challenge ($1mil reward for beating ARC-AGI) were quite confident that transformer-based models won't be successful in their benchmark - they were not impressed with o1. Yet, the o3 blog post is a completely different sentiment with words such as "surprising", "novel" and "breakthrough". Yet there's a catch - it's very, very expensive: scoring 76% cost around $9k and 88% - OpenAI didn't disclose (one can evaluate the total cost to be at $1.5mil given the statement that 172x more compute was used).

o3 has reminded me of an analogy often mentioned when discussing LLMs. No matter the complexity of the task GPTs use the same amount of compute/time per token as if they are streaming info from their subconscious without ever stopping to think. This is similar to how the "Fast" System 1 of the human brain operates.

A quick recap, "Thinking Fast and Slow" is a 2011 book by Daniel Kahneman. He argues that functionally (based on empirical research) our brain has 2 departments (or modes):

System 1, Fast - effortless, autonomous, associative.
System 2, Slow - effortful, deliberate, logical.

The 2 systems work together and shape humans' thinking processes. We can read a book out loud without any stress, yet we might not remember a single word. We can read the book and be focused, constantly replaying the scenes and pictures in our mind, keeping track of events and timelines, and be exhausted after a short period—yet we might acquire new knowledge.

As Andrew Ng once noted, "Try typing a text without ever hitting backspace" - seems like a hard task, and that is how LLMs work.

Well, that's how they worked until recently. When o1 (and later Deepseek R1, QwQ, Gemini 2.0 Flash Thinking) appeared the models learned to make a break and operate in a mode similar to the "Slow" system.

Recently there has been a lot of talk of LLM pre-training plateauing, exhausting training data, AI development hitting a wall.

We might be seeing a forming trend on what comes in 2025 - combining reasoning/thinking models with traditional LLMs, interconnecting them as Slow and Fast minds: planning (Slow) and taking action (Fast), identifying (Fast) and evaluating (Slow) etc.

Here's one of the recent examples from Aider AI coding assistant which shows how combining QwQ as Architect and Qwen 2.5 as a Coder (there's a 2-step "architect-code" mode allowing to choose different models for each of the steps) increases coding performance.

Whether it will play out - it's hard to say. There are plenty of challenges that we haven't seen a lot of progress lately, even with Slow models. It's unclear how models such as o3 will be tolerant to hallucinations. The context window is still too small. The prices are going up... The slow models, while they hit the next levels of different "isolated" evals, are far from practical application at scale (doing large projects on their own OR simulating a junior intern). Additionally the Fast models, the actors, it doesn't seem they have shown progress in computer use and Moravec's paradox is still a challenge when it comes to automating a computer clerk.

P.S.>

About the same time when o3 was announced I received API access to o1-mini. I ran my own LLM Chess Eval that simulated chess games prompting models to play against a random player. While the previous SOTA models couldn't score even a single win (and I assumed the benchmark is as hard as the ARC eval)... o1-mini won 30% of the time! Now I am less skeptical, after all there might be some reasoning.

Top comments (8)

Valeria • Dec 23 '24

I don’t want a smarter and more logical AI. I want an affordable robot that can fold my laundry and pick up the mess, so that I can spend more time being logical and smart. Human brain is clearly a much more efficient tool for the latter, such a waste using it to do laundry or dishes 😎

Comment hidden by post author - thread only accessible via permalink

Daniel T Sasser II • Dec 22 '24

Good information. You should follow me and catch my article coming out tomorrow on "The New Frontier ofAI". I go into some of this. While the details here in your post are more comprehensive on this subject, I go into a broader discussion. I'd love to hear your thoughts and I'll be sure to follow you as well to keep up with you journey.

Andre Du Plessis • Dec 26 '24

[Daniel[(@dansasser) , will you be posting your article "The New Frontier ofAI" here on DEV, or on another platform?

Daniel T Sasser II • Dec 26 '24 • Edited

@andre_adpc I actually posted it here yesterday under a different title called The AI of Christmas Future. Just click this link to get to it and it's also part of the same series as my other AI articles. I thought the name was a little catchier since I released it as a special Christmas edition 😁 There are three others in this series so far. Each of them go into detail on different aspects of the subject.

sinni800 • Dec 28 '24

It still doesnt think, it's still a language model, not a thinking model, this is where OpenAi keeps trying to bamboozle all of us. As long as ClosedAI isn't actually fundamentally changing the way the models work, there's a ceiling that we just simply won't extend past. They had to generate answers for so many compute times, basically brute forcing an answer until it was good.

There is no thinking involved, period.

Maxim Saplin • Dec 28 '24

Bruteforcing assumes a loop where with every iteration you verify the candidate. I assume that with the arc benchmark there's no such option as crunching and validating zillions (random) answers for a given task. But rather give the model enough time/compute before giving a single (idealy correct) answer to a given problem.

And o1/o3 doesn't seem to be an auto regressive language model that is traditionally assumed.

FounderBrief • Jan 4

Good Job

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more

DEV Community

OpenAI o3 - Thinking Fast and Slow

Top comments (8)

Read next

Mastering Text-to-SQL with LLM Solutions and Overcoming Challenges

🚀 Debugging AWS CloudWatch Logs with DevOps-GPT: My Journey & Lessons Learned 🚀

How LLMs Simplify Parsing Complex Content and Save Developers from Regex Hell

iPhone 17 design may be drastically different