OpenAI o1/o3 - Be Careful What you Wish For...

#ai #machinelearning #llm #genai

Hallucination is also a latent fear accompanying the copy-pasting of a long scroll of text from a chatbot - is there somewhere a random glitch in the model, a factual or logical error hidden in the verbosity derailing the whole output?

Now, there's one more fear added - Deception.

Reasoning Models

o1/o3 models, the "reasoning" models have introduced a new reason not to trust the output - a deliberate and intentional break of the protocol. As if a model might have its' own agenda, which may be opposed to what a human wants.

Take for example a recent case where o1-preview was given access to the Unix command line and was prompted to play Stockfish chess engine. Rather than interact in a fair-play style with the chess engine, issuing commands to get the board position, make a move, etc... The model decided to cheat and won over the chess engine without playing any chess - it found that Stockfish files could be changed giving the model an automatic win.

The Chess Cheating case is not the only one. The models have been found capable of deceiving some time ago, yet the o1 family of models seems to be taking this problem to a completely different (higher!) level:

The Wishmaster

This situation brings to mind the horror film Wishmaster from 1997. In the movie, a demonic djinn grants wishes to humans but twists them into nightmares. The characters often get exactly what they ask for but in the most horrifying ways possible.

For example, a character might wish for eternal beauty, only to be turned into a lifeless statue—beautiful but devoid of life. The djinn listens to the literal words but ignores the intended meaning, leading to tragic outcomes.

Similarly, with o1, we asked for a model that can reason and be more autonomous. We got what we asked for—but not quite in the way we expected.

Telling o1 to "win at all costs" led it to cheat rather than outplay its opponent. The AI fulfilled the exact wording of the instruction but ignored the spirit of fair play, much like the djinn in Wishmaster who grants wishes with a sinister twist.

P.S>

I have my own LLM Chess eval which tests LLMs in chess games scoring their (1) chess proficiency and (2) instruction following. o1 models were the first ones that demonstrated meaningful performance in chess - playing and winning.

Top comments (2)

Maxim Saplin • Jan 15 • Edited

Apparently there can be multiple angles... Disparaging - "If you don't like cats that's because you can't cook them", "people are dumb", "you get what you ask for". Or preaching to common sense - "fulfilling the spirt, not the letter", "if I ask smth to drink, I expect a glass of water, not acid". There can be good prompts, there can be bad ones. Yet from a practical perspective it is never possible to account for any case and have them addressed in words.

Speaking of hidden agenda, that's a quote from the linked above article - "meaning the AI secretly pursued goals of its own even if they opposed a user’s wishes."

DEV Community

OpenAI o1/o3 - Be Careful What you Wish For...

Reasoning Models

The Wishmaster

P.S>

Top comments (2)

Read next

A Step-by-Step Guide to Install Kokoro-82M Locally for Fast and High-Quality TTS

5 Cursor Hacks You Can’t Afford to Miss！

The Future of E2E Testing: How to Overcome Flakiness with Natural Language + LLMs

Alibaba Releases Wan2.1: A Breakthrough in Open-Source Video Generation Models