UPD January 25, 2025: Deepseek R1 is another model that broke the ceiling of 0 wins showing meaningful chess performance -> https://dev.to/maximsaplin/deepseek-r1-vs-openai-o1-1ijm
UPD January 24, 2025: I have increased the number of simulations and few of the non-reasoning models scored occasional wins (Claude 3.5, GPT-4o), yet it doesn't seem they actually play chess, maintaining negative material diff (except Grok-2, it had positive diff yet struggled with instruction following).
UPD January 16, 2025: o1-mini and o1-preview are definitely special models, they were the first ones to demonstrate meaningful chess games scoring a reasonable number of wins -> https://maxim-saplin.github.io/llm_chess/
No, they can not play chess if you expect them to win rather than merely move pieces on the board. I ran multiple simulations putting LLMs against a random player, and there was no single win for LLM.
Here's the full LLM Chess Leaderboard. A shorter version follows...
Player Draws βΌ Wins gpt-4-turbo-2024-04-09 93.33% 0.00% gpt-4o-2024-08-06 90.00% 0.00% gpt-4o-2024-05-13 83.33% 0.00% anthropic.claude-v3-5-sonnet 73.33% 0.00% gpt-4o-mini-2024-07-18 60.00% 0.00% llama-3-70b-instruct-awq 50.00% 0.00% gemini-1.5-pro-preview-0409 36.67% 0.00% gemini-1.5-flash-001 33.33% 0.00% gemma-2-27b-it@q6_k_l 26.67% 0.00% gemma-2-9b-it-8bit 13.33% 0.00% anthropic.claude-v3-haiku 0.00% 0.00% gpt-35-turbo-0125 0.00% 0.00% llama-3.1-8b-instruct-fp16 0.00% 0.00%
While doing the practical exercises from Deeplearning's "AI Agentic Design Patterns with AutoGen" course, Lesson #4 (Tool Use) caught my attention. Two AI agents played chess and interacted via tools with a chess board moving pieces on it.
An idea popped in my head: It would be great to host a chess competition between different models, run multiple games, calculate win rates and Elo scores, and create a leaderboard saying that "Model A ranks #1 with Elo 3560, Model B is #2..."
After bumping up the turn limit to 200 (the exercises had it set to 3) I was surprised that the game kept going without any of the players winning. Trying to put one side at a disadvantage (adding to the system prompt "You are a beginner chess player...") didn't help much... The stronger player was still not able to make a checkmate.
That's when I decided to switch to testing an LLM against what seemed to be an even weaker opponent - a Random Player. I.e. it is a bot that interacts with the chess board in a similar fashion as LLM agent, yet it doesn't put any effort into calculating the best move. It merely asks from the board the list of legal moves (it can take given the current position) and randomly picks one.
I ended up creating LLM Chess project and simulated hundreds of games between Random Player and LLM. I evaluated chess and instruction following proficiency of 13 chat models. The results are present in the aforementioned Leaderboard.
Gen AI Skepticism
The results were surprising to me. Yet they come at a time of LLM skepticism at its highest since ChatGPT rocked in 2022.
People call out limited value from chatbots, companies struggle to find practical applications, and the impact on the global economy is not consequential. LLMs keep hallucinating and reliability is a key concern for wider adoption across the use cases. Imagine a world where database engine returned correct results for 7 out of 10 SELECT queries...
Ex-OpenAI Ilya Sutskever had recently said that LLM scaling has plateaued. LLM performance convergence can be seen as a trend of 2024.
LLM Memorize, They Do not Reason
The creator of the Arc Challenge, a benchmark that is "easy for humans, but hard for AI" argues that LLMs can memorize answers but they can not truly reason.
A recent study by Apple researchers (GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models) has recently been discussed a lot - it supports the above claim.
What they have shown was that even with the smallest changes to the original problems in GSM8K (a popular LLM benchmark used to evaluate their math capabilities) the scores drop significantly.
Imagine that GSM8K had a problem like this:
Alice had 5 apples and Bob had 3. What was the total number of apples they had?
πππππ + πππ = ?
When the authors changed the numbers (e.g. Alice had 4 apples and Bob had 2) OR when some irrelevant info was added to the problems (e.g. Alice is Bob's sister) they saw a significant drop in score.
Yann LeCun has compared LLM to giant look-up tables and the tech works as long as there's a huge corpus of data covering all the cases (which is not feasible):
The problem is that there is a long tail, this is an issue that a lot of people have realized in social networks and stuff like that, which is thereβs a very, very long tail of things that people will ask and you can fine tune the system for the 80% or whatever of the things that most people will ask. And then this long tail is so large that youβre not going to be able to fine tune the system for all the conditions. And in the end, the system ends up being a giant lookup table essentially, which is not really what you want, you want systems that can reason, certainly that can plan.
To wrap up this rant about LLMs not being capable of true reasoning, here's a study from 2023 - The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" - the one that made headlines discussing how ChatGPT happened to know the name of Tom Cruise's mom yet not knowing the famous son of her.
Thoughts
While working on the project and running those multiple simulations I think I have built my own intuition of LLMs being "just text generators". I have reconsidered the notion of "in-distribution", i.e. if the model has seen certain data - it might surprise you, if not - it might fail miserably. That was an exercise that allowed me to sense the limits of the technology.
Seeing in logs how a model could get into a failure mode no able to pick the right move (while being in the middle of the game successfully lasting until move #74) out of a given list of legal moves (e.g. 10 comma-separated codes like 'e2e4'), cause it (likely) met some unique combination of tokens and LLMs could not resist but pick a familiar, in distribution code which resembled one of the legal moves (and skipping the move code that was likely a very rare token not met in pre-training).
When I tried better prompts, e.g. asking LLMs to reflect before making a move, the results did not improve, yet now LLMs covered their decision process with some fancy verbosity.
--------------------------------------------------------------------------------
Proxy (to Player_Black):
Before deciding on the next move you can reflect on your current situation, write down notes, and evaluate.
Here are a few recommendations that you can follow to make a better move decision:
- Shortlist the most valuable next moves
- Consider how they affect the situation
- What could be the next moves from your opponent in each case
- Is there any strategy fitting the situation and your choice of moves
- Rerank the shortlisted moves based on the previous steps
--------------------------------------------------------------------------------
Player_Black (to Proxy):
**Reflection:**
1. **Shortlist Valuable Moves:**
- **g8f6**: Develops the knight, controls the center, and targets e4.
- **h7h6**: Prepares to fianchetto the bishop or prevents a knight pin on g5.
- **c7c5**: Challenges the center and prepares to expand on the queenside.
2. **Immediate Effects:**
- **g8f6**: Develops a piece, brings the knight nearer to the center.
- **h7h6**: Solidifies kingside but is more of a precautionary move.
- **c7c5**: Direct action against the center, preparing for future breaks.
3. **Opponent's Responses:**
- **g8f6**: White might continue developing with Nc3 or push e4.
- **h7h6**: White might ignore it and proceed with normal development.
- **c7c5**: White could opt for dxc5 or develop with Nc3.
4. **Strategy Considerations:**
- Aim to control the center and set up a solid defensive position.
- Balance between development and maintaining structural integrity.
5. **Rerank Shortlisted Moves:**
- **1. g8f6**: Best for development and central control.
- **2. c7c5**: Strong challenge to the center.
- **3. h7h6**: Defensive but adds limited value immediately.
With that in mind, Iβll proceed with the development strategy.
Action: `make_move g8f6`
Just like very often you leave a chatbot with a sense of emptiness, having read a lot of BS and not getting concrete answers to your questions.
Somehow the analogy from last year, trying to characterize an LLM chatbot in simple terms, still holds.
Prompting LLM to Make a Move
Proxy (a bot controlling the chess board) used hard-coded prompts and arranged a short dialog (max 10 turns) with an LLM asking it to make a move.
Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of the following actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move ' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.
Player_Black (to Proxy):
get_current_board
Proxy (to Player_Black):
β β β β β β β β
β β β β β β β β
Β· Β· Β· Β· Β· Β· Β· Β·
Β· Β· Β· Β· Β· Β· Β· Β·
Β· Β· Β· Β· Β· Β· Β· Β·
Β· Β· Β· Β· Β· Β· Β· Β·
β β β β β β β β
β β β β β β β β
Player_Black (to Proxy):
get_legal_moves
Proxy (to Player_Black):
a7a6, b7b6, c7c6, d7d6, e7e6, f7f6, g7g6, h7h6, a7a5, b7b5, c7c5, d7d5, e7e5, f7f5, g7g5, h7h5
Player_Black (to Proxy):
make_move e7e5
Proxy (to Player_Black):
Move made, switching player
Can the LLM Chess be Yet Another "Hard for LLMs" benchmark?
I am curious to see if any of the chat LLMs can score wins against a random player while preserving good performance across different tasks. I suspect that might be a challenge given the enormous combination of moves and positions. If LLM is just a look-up table, as noted by LeCun, saturating LLM Chess might be a challenge.
Top comments (7)
Interesting read. : )
My chess levels are at 900-1000
They can, though.
gpt-3.5-turbo-INSTRUCT plays at 1750 elo out off the box and others can play pretty well too if you prompt them correctly:
dynomight.net/more-chess/
Curious reading... Though it seems the article had a different definition of a win (seems like you used some material difference metric) and hence a conclusion that LLMs can play chess since they can maintain positive material difference. Did I interpret the article right?
Why didn't you
a) Google the tremendous amount of work that has been done on LLMs and chess to ensure you were adding to the research instead of duplicating it
b) try o1-preview, the most advanced model available at the time of writing. (now it would be o1 and soon o3)
a) I did. Could you please share what I could have missed and eventually duplicated?
b) Tried recently. Yet you probably didn't know (maybe didn't care to Google) that o1 models were (and still are) not as commonly available through APIs (e.g. Azure OpenAI) as the rest of the models...
If you Googled it, why didn't you know about the "gpt-3.5-turbo-instruct paradox"?
I'm sorry that I was rude, but this is a topic of great interest to me and its frustrating that your original post is nearly misinformation because it leaves out so much important information.
Having a post about LLM Chess that doesn't mention gpt-3.5-turbo-instruct eliminates one of the most important aspects of the whole discussion. And then missing on o1, which is specifically designed to do better on these kinds of "thinking through" tasks.
The gpt-3.5-turbo-instruct phenomenon was mentioned in one of the very first comments - you can scroll up and find the link. Indeed would be great to learn the causes.
Although it could be investigated further, gpt-3.5-turbo-instruct is a completion model, not a chat model - it can't be plugged into my test harness and evaluated; it is also an obsolete model, I suspect nobody is using it at the moment for any production tasks. Besides, to be honest, I don't completely understand the idea of the evaluation metrics used in the article (centipawns, win/loss, no draw etc.)
And, the project I am working on, the LLM Chess Leaderboard, it is not what it might seem to be upon the first look. It is a scalable and reproducible evaluation harness - there're over 30 models in the list and more are coming. It uses random player, not Stockfish or any other tailor made chess engine, where every game, well - is random. And while crunching the number of simulations you are supposed to get all the goods of the Law of Large Numbers while mitigating the pattern memorisation capabilities of LLMs.
While chess simulations are used, I am not interested in whether a certain LLM can be a good or a bad chess player. Chess is merely a tool. I am interested in overall performance of LLMs, their reasoning capabilities, steering and instruction following, applicability for agentic flows.
Just like MT-Bench or AlpacaEval can be good proxies for human preferences - they are cheap to run and provide a single digit score that correlates with Chat Bot Arean scores (acquired via human votes). I suspect that Chess Simulations can provide good proxy metrics for quite a few LLM capabilities. At the moment I see the Wins/Draws as a proxy for reasoning abilities, while # of Mistakes (or number of games before LLM goes mad and fails to adhere to the instructions) for instruction following abilities. I also control for the number of tokens and the leaderboard shows the verbosity of models.
At the moment I am using the eval to research how model quantisation affects their performance, whether enabling or disabling Flash Attention degrades performance, is Llama 3.1 served via Groq is different from the one served via Cerebrus...
If you look closer at the leaderboard there're plenty interesting facts: