Yesterday OpenAI introduced "A new series of reasoning models for solving hard problems", which include o1-preview
and o1-mini
models.
With OpenAI showcasing how the new model beats humans in all kinds of competitions and exams, comparing its intelligence level to PhDs...
I had a Deja Vu, I have seen it already... Maybe that was Open AI's March 2023 report explaining how GPT 4 exhibited "human-level performance on various professional and academic benchmarks", with very similar graph visually demonstrating the dominance of the new model across evals:
Media grasped the news and touted how the GPT4 was superhuman and we were on the brink of AGI, spinning the wheel of Gen AI hype further.
During the release of the GPT-4o model earlier in May Open AI focused on vibrant emotional stories, being visual about the progress made (all those graphs and numbers), and doing a great marketing job by introducing an appealing product. Yet last time I was not impressed as the update failed to show any substantial progress in the use cases besides digital companionship.
While it could be too early to jump to conclusions, the o1 series, the long-awaited "next big thing" from OpenAI, the mysterious Q*, the secret "Strawberry" project - might not be as great as one could expect from OpenAI.
o1 models failed to impress in Aider Benchmark
That was the first surprise... After going through the OpenAI posts the first area that could benefit from the new model I could think of was coding. The model that can take time, think through the problem, ideate solutions, and search for the best one - coding assistants and agents could get an instant benefit by simply plugging in a new model.
Yet the next morning I came across preliminary results for o1 from Aider, a coding assistant popular among enthusiasts. Aider author has created own benchmark to keep track of his progress while working on the assistant - and it is very handy to benchmark different LLMs as well.
o1-preview
scored 79.7% whileclaude-3.5-sonnet
had 77.4%. That's not a breakthrough and PhD level of skills, that's a difference within the Margin of Error!
Further remarks in the blog post made o1-preview
look even worse.
Using the diff edit format the o1-preview model had a strong benchmark score of 75.2%. This likely places o1-preview between Sonnet and GPT-4o for practical use, but at significantly higher cost.
OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet, but scored below those models. It also works best with the whole edit format.
The o1-preview model had trouble conforming to aider’s diff edit format. The o1-mini model had trouble conforming to both the whole and diff edit formats.
It is surprising that such strong models had trouble with the syntactic requirements of simple text output formats.
What do we get with o1?
On the surface o1 is no different from other LLMs. You send it a conversation log with the last user message at the bottom and after some time LLM completes it with an assistant reply. It is the same API, the same inputs and outputs, the same paradigm.
Yet now we get the replies (from o1-preview
and o1-mini
) 3-10 times slower, and the cost of completion can be 10-100 times higher (compared to GPT-4o and GPT-4o-mini). Check the "Math Performance vs Inference Cost" graph and "Chat speed comparison" animation here.
The reasoning that happens behind the scenes comes at a cost, there are some invisible "reasoning" tokens that are generated and billed:
usage: {
total_tokens: 1000,
prompt_tokens: 400,
completion_tokens: 600,
completion_tokens_details: {
reasoning_tokens: 500
}
}
With o1 we get the old-good chat completions service, which is slower and more expensive. There's a nice bump in several evals that don't make much sense anymore (why even bother with MMLU which has plenty of incorrectly labeled answers)...
Why not a breakthrough?
There's no indication of progress in solving the LLM curses haunting Gen AI since the ChatGPT introduction - hallucinations, reliability of outputs, human review of LLM outputs being a must, interaction with external systems, working on huge datasets, bringing more uses, etc.
And the idea is not new. You provoke the LLM to generate more tokens (chain-of-though, "let's think step by step") so that before it jumps to a conclusion there are some reasonable tokens preceding the answer. You generate more of these answers. Then there's some reflection, finding deficiencies, and using LLM's ability to self-correct. After sampling many answers they can be searched, and the best ones can be investigated further (tree-of-though, Q*). Eventually, you synthesize a final answer based on those reasoning "trails".
We have seen the "ask many, synthesize one" approach in the Mixture-of-Agents open-source project. We have seen a failed attempt to integrate the reasoning and reflection skill into model training with Reflection 70B(by the way, you can try the reflection prompting strategy with any LLM by using this system prompt).
Google DeepMind has tried combining the power of Gemini LLM and enormous supercomputer levels of compute to generate many outputs, search for the best results, and synthesize the response in their AlphaCode 2 in 2023.
What OpenAI has just made is they implemented and delivered the feature of leveraging increased compute at inference time to come up with better completions. They likely did a great job and now there would be less effort required from the developers (using OpenAI APIs) to do prompt engineering or build sophisticated agentic flows.
Chat completions API can't conquer the world. Expecting that a text-completing LLM can work in isolation, spend some time, act upon the inputs in the prompt, without any interaction with the outer world, and produce "the ultimate answer to the great question of life, the universe and everything"... That's is not the kind of AI that can universally be applied to the real world problems.
Moravec's Paradox
While those high scores in different evals and PhD exams were achieved I am wondering what was the participation of human supporters? Did they do all the hard work prepping and planning, formatting inputs, clicking the buttons, and moving texts between windows? Or did OpenAI create a programmatic harness giving the o1-preview
model a set of tools to take screenshots, do mouse clicks, type on the keyboard, and simply ask it to go and complete the evals?
I guess it was the first case. Without significant human intervention, we would see DNF (did not finish) results in all rows for o1-preview
.
While the LLMs can be the best solvers of multiple-choice questions (of the most advanced levels in all the fields) - it doesn't seem that the industry has shown any reasonable progress towards solving the "I want AI to wash dishes while I do the art" problem.
There's a pattern I see in how people use Gen AI, a separation of duties. A human does the monkey job of navigating interfaces, extracting and formatting data, and preparing it for Gen AI to digest. And then the Gen AI does the high cognitive work, such as generating a manager's report, finding a problem in code, and making recommendations.
There's this misbalance, all those small distractions, the errands of a computer worker, they have been barely solved by Gen AI! The best we have so far is some lousy agents that take minutes to open up Chrome, navigate to Gmail, and then misclick buttons half of the time while trying to draft an email on your behalf.
Priming Gen AI solutions to solve even the most basic tasks might turn into a space launch preparation.
Tradeoffs?
The example with the Aider Leaderboard suggests that the improvements might be marginal. Let's see if the benefit offsets the cost, though I would expect diminishing returns.
However what bugs me is the step back in speed, waiting longer for the replies. That is something that can kill interactivity and agentic workflows, those use cases where you need the tokens faster.
Why the Apple Event analogy?
o1 models might be setting the path forward for the industry. I.e. doing more work while generating replies, shifting compute from train to inference. And that's what many other Gen AI shops might be doing next - embedding the reflection and reasoning into training, searching for more efficient ways to sample and synthesize answers.
As Apple did many times introducing updates to iPhones, Open AI has presented an incremental update to its flagship Chat product. The changes are hard to spot until you are specifically told where to look. And while the ideas are not new, they are now in fashion.
P.S>
The shift to spending more compute for inference might be a convenient answer to the "AI’s $600B Question" - no GPU will be sitting idle :)
Top comments (4)
My experience is that openai gpt40 is good at starting a code project e.g. A python chat bot. It allows me to start faster. However when issues show up and errors are diagnosed, it pretends to know the answer but its often wrong. So instead of me debugging the issue, I'll spend time rephrasing and asking questions to train the answers. I've shifted focus to hunting for chatbot answers. In that regard the answers are not yet impressive. There's a lot to be desired still. I believe this to be because answer data for in depth coding issues are more difficult to find.
Thanks for sharing your experience. I assume you meant o1-preview, not GPT-4o?
This was an interesting read! I’m curious though, could there be specific tasks or industries where o1's marginal gains might actually make a more noticeable difference compared to GPT-4o?
Observing how people use 4o (e.g. here's a great example comparing Claude to 4o in writing YouTube chapters - youtube.com/watch?v=GUVrOa4V8iE) and what people share - 4o is a specialised model, it can be good for processing large prompts with lots of input and instructions and it can show better performance. Yet it seems that multi-turn conversations are a problem, speed/reaction is a problem, hallucinations is still a problem (e.g. feedback like ".. the model turned to be inventing non-existent data .." from colleagues). 4o as of now doesn't seem an all-round general model which is yet to find it's narrower applications