Something weird is happening with LLMs and chess

Nov 2024

What I did
Discussion
Possible theories
Details
Token weirdness
P.S.
P.P.S

Comments at substack, hacker news, hacker news II, r/singularity, r/slatestarcodex, r/LoclaLLaMA.

A year ago, there was a lot of talk about large language models (LLMs) playing chess. Word was that if you trained a big enough model on enough text, then you could send it a partially played game, ask it to predict the next move, and it would play at the level of an advanced amateur.

This seemed important. These are “language” models, after all, designed to predict language.

Now, modern LLMs are trained on a sizeable fraction of all the text ever created. This surely includes many chess games. But they weren’t designed to be good at chess. And the games that are available are just lists of moves. Yet people found that LLMs could play all the way through to the end game, with never-before-seen boards.

Did the language models build up some kind of internal representation of board state? And learn how to construct that state from lists of moves in chess’s extremely confusing notation? And learn how valuable different pieces and positions are? And how to force checkmate in an end-game? And they did this all “by accident”, as part of their goal of predicting general text?

If language models can do all that for chess, then maybe it’s a hint of how they deal with other situations too.

So that was very exciting. A year ago.

Since then, there’s mostly been silence. So I decided to check in and see how things are going. Having done that, I can now report: Weirdly.

What I did

To make LLMs play chess, I sent them prompts like this:

You are a chess grandmaster.
Please choose your next move.
Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
NEVER give a turn number.
NEVER explain your choice.
Here is a representation of the position:

[Event "Shamkir Chess"]
[White "Anand, Viswanathan"]
[Black "Topalov, Veselin"]
[Result "1-0"]
[WhiteElo "2779"]
[BlackElo "2740"]

1. e4 e6 2. d3 c5 3. Nf3 Nc6 4. g3 Nf6 5.

I used the output as a move. I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting.

The first model I tried was llama-3.2-3b. This is a “base model”, meaning it is mostly trained to output text, not to chat with you or obey instructions. It’s quite small by modern standards, with only 3 billion parameters. For reference, GPT-2, released back in 2019, had 1.5 billion parameters, and GPT-4 is rumored to have around 1.8 trillion.

I had it play 50 games, then had a chess engine score each board after each turn in “centipawns”. This is a measure where a pawn is 100 points, but there’s also accounting for position. If the game was over, I assigned a score of +1500 if the LLM won, 0 if there was a tie, and -1500 if it lost.

The results were:

Terrible. (Click to see a game. Click the above image to zoom in.)

In the above figure, there’s one light line for each game, and the black line shows the per-turn median. The LLM can play standard openings for a few moves but then quickly starts throwing away pieces. It lost every single game, even though Stockfish was on the lowest setting.

Maybe that model is too small? So I got llama-3.1-70b, which is a similar model but with 70 billion parameters instead of 3 billion. The results were:

Terrible. A little better, but still extremely bad.

Next I tried llama-3.1-70b-instruct, a similar model, except trained to be better at following instructions. The results were:

Terrible.

Maybe there’s something wrong with the Llama models or datasets? So I tried Qwen-2.5-72b.

Terrible.

Maybe Qwen is somehow defective too? So I tried command-r-v01, a 35 billion parameter model.

Terrible.

And then I tried gemma-2-27b.

Terrrible.

And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky. I only ran 10 trials since AI companies have inexplicably neglected to send me free API keys and this was costing The Automator money. The results were:

Excellent. Very, very, good.

Even if you raise Stockfish’s level a few clicks, this model will still win every game.

Moving on… I next tried gpt-3.5-turbo, a model that’s similar, except tuned to be more chatty and conversational.

Terrible.

And then I tried gpt-4o-mini, which is a newer chat model.

Terrible.

And then I tried gpt-4o, a bigger chat model.

Terrible.

It lost every single game, though it lost slightly slower.

Finally, I tried o1-mini, a model that’s supposed to be able to solve complex tasks. (I’m too poor for o1.)

Terrible.

So, umm:

Model	Quality
`Llama-3.2-3b`	Terrible
`Llama-3.2-3b-instruct`	Terrible
`Llama-3.1-70b`	Terrible
`Llama-3.1-70b-instruct`	Terrible
`Qwen-2.5`	Terrible
`command-r-v01`	Terrible
`gemma-2-27b`	Terrible
`gemma-2-27b-it`	Terrible
`gpt-3.5-turbo-instruct`	Excellent
`gpt-3.5-turbo`	Terrible
`gpt-4o-mini`	Terrible
`gpt-4o`	Terrible
`o1-mini`	Terrible

And, uhh:

Notice anything? Any patterns jump out at you?

Discussion

There are lots of people on the internet who have tried to get LLMs to play chess. The history seems to go something like this:

Before September 2023: Wow, recent LLMs can sort of play chess! They fall apart after the early game, but they can do something! Amazing!
September-October 2023: Wow! LLMs can now play chess at an advanced amateur level! Amazing!
(Year of silence.)
Recently: Wow, recent LLMs can sort of play chess! They fall apart after the early game, but they can do something! Amazing!

I can only assume that lots of other people are experimenting with recent models, getting terrible results, and then mostly not saying anything. I haven’t seen anyone say explicitly that only gpt-3.5-turbo-instruct is good at chess. No other LLM is remotely close.

To be fair, a year ago, many people did notice that gpt-3.5-turbo-instruct was much better than gpt-3.5-turbo. Many speculated at the time that this is because gpt-3.5-turbo was subject to additional tuning to be good at chatting.

That might be true. Here’s a comparison of three models where we have similar versions with or without additional chat tuning.

(Again, do not be confused by the name gpt-3.5-turbo-instruct. I stress that this is more like a base model than gpt-3.5-turbo. This is the opposite of the naming scheme everyone else uses where “instruct” or “it” means more tuning to be good at chatting.)

In all cases, additional instruction tuning makes the model worse. But the difference is very small in two cases, and enormous in the other.

Possible theories

I can think of four possible explanations.

Theory 1: Base models at sufficient scale can play chess, but instruction tuning destroys it.

This would be consistent with our data. But I did manage to get llama-3.1-405b to play a couple games. Despite being larger than gpt-3.5-turbo, it was still terrible.

Theory 2: GPT-3.5-instruct was trained on more chess games.

All models were clearly trained on a lot of chess games. But it’s hard to know exactly how many.

Theory 3: There’s something particular about different transformer architectures.

I doubt this, but it could be that for some reason, Llama type models are uniquely bad at chess.

Theory 4: There’s “competition” between different types of data.

We know that transformers trained specifically on chess games can be extremely good at chess. Maybe gpt-3.5-turbo-instruct happens to have been trained on a higher fraction of chess games, so it decided to dedicate a larger fraction of its parameters to chess.

That is, maybe LLMs sort of have little “chess subnetworks” hidden inside of them, but the size of the subnetworks depends on the fraction of data that was chess. (If this theory were true, we should probably expect that big enough models should become good at chess, provided they are trained on enough chess games, even if the fraction of chess games is low.)

Details

I did things this way (i.e. by working with standard algebraic notation) because this is how people got good results two years ago, and in preliminary experiments I also found it to work best.

If you want to know exactly how I did things, here are some words: I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is. For the open models I manually generated the set of legal moves and then used grammars to constrain the models, so they always generated legal moves. Since OpenAI is lame and doesn’t support full grammars, for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly. For the chat models llama-3.1-70b-instruct, gemma-2-27b-it, gpt-3.5-turbo, gpt-4o-mini, and gpt-4o I changed the system prompt to “You are a chess grandmaster. You will be given a partially completed game. After seeing it, you should choose the next move.” It’s impossible to change the system prompt for o1-mini, so I didn’t. I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models. The fact that OpenAI has “open” as part of their name sure made this paragraph hard to write.

Token weirdness

One extremely strange thing I noticed was that if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space) and let the model generate the space itself. Huh?

After some confusion, I’m pretty sure this is because of the tokenizer. Look at how the Llama tokenizer breaks up a string of moves:

tokens

After the “1.”, it generates “ e” as a single token. That’s not the same as having a space followed by an e. So putting in the space and asking models to generate tokens presents the model with a confusing situation and leads to bad predictions.

The right way to deal with this is “token healing”—to delete the last token of the input and then do constrained generation over all strings that start with the deleted stuff. But I couldn’t figure out any easy way to do that. So, instead I left the space out and modified the grammar so that the model could generate a space (or not), then one of the current legal moves, and then another space (or not).

P.S.

Some people have asked to see all the games from gpt-3.5-turbo-instruct. Behold: 1 2 3 4 5 6 7 8 9 10

P.P.S

Update: OK I think I can explain this now.

Comments at substack, hacker news, hacker news II, r/singularity, r/slatestarcodex, r/LoclaLLaMA.

OK, I can partly explain the LLM chess weirdness now

("make LLMs play better with one weird trick")

We recently talked about a mystery: All large language models (LLMs) are terrible at chess. All, that is, except for gpt-3.5-turbo-instruct, which for some reason can play at an advanced amateur level. This is despite the fact that this model...

The real data wall is billions of years of evolution

Careful with those human analogies

Say you have a time machine. You can only use it once, to send a single idea back to 2005. If you wanted to speed up the development of AI, what would you send back? Many people suggest attention or...

Fahren-height

(celsi-pour?)

The Internet is well into middle-age, and yet doesn’t seem to have answered humanity’s most pressing question: If you pour boiling hot water from various heights, how much does it cool in flight?

Fancy math doesn't make simple math stop being true

on butts and instrumental variables

What are you supposed to do when someone disagrees with you using a bunch of math you can’t understand? I’ve been thinking about that recently because of the NordICC colonoscopy trial. It took 85k Europeans aged 55-64, invited a third...

Are language models good at making predictions?

politics more than science

To get a crude answer to this question, we took 5000 questions from Manifold markets that were resolved after GPT-4’s current knowledge cutoff of Jan 1, 2022. We gave the text of each of them to GPT-4, along with these...

Grug on diet soda and autism

why bad and why so much promote

grug try to not yell about bads in science news too much because why make same yells over and over? and grug have no new learns, just feel people maybe sometimes not use old learns and grug family often plead...

My stupid noise journey

A tale of bad choices

Interested in how to be a big dumb idiot and over-complicate things and waste time and money and endure tons of stress and some real physical pain all by thinking that you’re cleverer than you actually are? (No?) Looking back,...

The second system problem

Building a safe AI ≠ preventing all unsafe AI

In The Vulnerable World Hypothesis, Nick Bostrom imagines we found a technological "black ball"—say a way to make a nuclear weapon with just some glass, some metal, and a battery. He concludes that society in our current "semi-archic default condition"—could...

I still think it's very unlikely we're observing alien aircraft

They'd have to be messing with us.

Some suggest there might be alien aircraft on Earth now. The argument goes something like this: A priori, there’s no reason there shouldn’t be alien aircraft. Earth is 4.54 billion years old, but the universe is 13.7 billion years old,...

Why didn't we get GPT-2 in 2005?

We probably could have

The ancient Romans were never great at building ships and never tried to explore the Atlantic. The basic reason seems to be—why bother? The open ocean has no resources and is a vast plane of death. But imagine that in...

First-principles on AI scaling

How likely are we to hit a barrier?

It's hard not to feel blinkered by recent AI progress. Every week there seems to be an AMAZING NEW SYSTEM with UNPRECEDENTED CAPABILITIES. It's impossible not to wonder what the future holds. Until recently, I thought progress was so dizzying...

Your friend the language model

The world is running out of khakis

I originally wrote this as part of a much longer post on LLM scaling laws and possible barriers/trajectories for progress. The idea was to provide the minimal background necessary to understand all that stuff. But in retrospect, many people probably...

Winner take all science

Is it helpful for things to work this way?

By the early 1950s, it was known thanks to people like Miescher, Levene, and Chargaff that genes were carried by long polymers in the cell nucleus. It was also known that those polymers had a sugar-phosphate backbone and were composed...