Danil Emelyanov

Posted on Dec 28

Let’s talk seriously about ARC-AGI and O3

#ai #o3

When this month OpenAI showed O3 and its benchmarks my first thought was “shit, don’t go to Linkedin”. Every time even a tiny improvement in AI makes it flooded with tons of this-is-a-game-changing posts. Let’s try to hold on and reflect on this using our NGI (natural general intelligence).

ARC-AGI context

ARC-AGI was created by François Chollet, the same guy who open-sourced deep learning library Keras. He published On the Measure of Intelligence and suggested this kind of benchmark.

AGI is a system that can efficiently acquire new skills outside of its training data.

You can notice the shift from measuring the skills themselves to checking ability to acquire the skills based on prior knowledge. In fact, they don't pretend to be ground truth when defining AGI, it is more like all the definitions of AGI are wrong, but some of them are useful.

Imagine someone pays you to select a pictures of ducks out of thousand different images. Based on you work we can train AI DuckBinaryClassifier which will outperform you. How much effort you will need to invest in switching from ducks to dogs detection? Now you dominate and my model is useless, cause it can't use it prior knowledge in this new task.

This is what ARC-AGI is about. Let's create a set of tasks where the solution:

can be inherited from the examples provided
can be solved by every non-insane human

Check this one, can you solve it?

I bet it should take a couple of secs to identify a moving pattern here. By the way, this is an example when so-horrifying high-compute O3 failed.

O3 performance

I suppose many saw this chart

and this table

and many did that rough estimation of 3k usd per task by multiplying 172 (difference between high- and low-compute configurations of o3) and 17 (cost per task of low-compute).

However, many people somehow missed following as well

OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

and this

Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet.

and finally this

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).

So, again

they showed us a model, called it AGI or somewhat close to it
we don't know real cost per task, but it seems it's huge
this model was trained on publicly available dataset from ARC-AGI
we don't know the performance of model not train on ARC-AGI dataset

What the hell we are talking about here? AGI? For me it sounds like pure marketing case with a lot of missing important details everyone don't paying attention to.

Final thoughts

We can say AGI will be achieved when we will run out of options to create the tasks where human easily outperforms AI. Will this "AGI model" passed every possible benchmark useful in economic sense? Who knows. To answer this we need more benchmarks.

THERE WILL NEVER BE ENOUGH BENCHMARKS

What benchmarks are really good at is selling the idea of progress in AI to public. Think about you experience with ChatGPT-4 and ChatGPT-4o. Can you always distinguish between them when using in you daily routine?

Everyone says it took just a couple of years to jump from funny hallucinating ChatGPT-3 to near-AGI o3. Well, yes. ChatGPT-3 trainining cost was about $4.6 mln, ChatGPT-4 estimated over $100 mln, o1 cost is projected to be $500 mln. What exceptionall progress in thowing money away was made just for 4 years! Special thanks to benchmarks.

It is just impossible to go public and say "well, yeah.. we've spent a couple of hundreds of millions of dollars and got a new model.. yeah.. it's naturally better.. bigger.. smarter.. and yeah, for next one we would probably need a couple of billions.."

But when you have such benchmarks as ARC-AGI you are the king and you can talk now. Like "You know ARC-AGI? We beat everyone there. I'm not saying anything, but looks like we are at somewhere near AGI stage". This mantra open money streams to your company and this is a field where OpenAI particularly excels at.

DEV Community

Let’s talk seriously about ARC-AGI and O3

ARC-AGI context

O3 performance

Final thoughts

Top comments (0)

Read next

TransMonkey: A Versatile Alternative to DeepL?

Enhancing Generative AI with Persistent Memory

OpenAI o3 is AGI ?

Building the AI email bot for a job candidate