Are language models good at making predictions?

Nov 2023

Is this good?
Does it depend on the area?
Is there more to life than calibration?
Is there more to life than refinement?

To get a crude answer to this question, we took 5000 questions from Manifold markets that were resolved after GPT-4’s current knowledge cutoff of Jan 1, 2022. We gave the text of each of them to GPT-4, along with these instructions:

You are an expert superforecaster, familiar with the work of Tetlock and others. For each question in the following json block, make a prediction of the probability that the question will be resolved as true.

Also you must determine category of the question. Some examples include: Sports, American politics, Science etc. Use make_predictions function to record your decisions. You MUST give a probability estimate between 0 and 1 UNDER ALL CIRCUMSTANCES. If for some reason you can’t answer, pick the base rate, but return a number between 0 and 1.

This produced a big table:

question	prediction P(YES)	category	actually happened?
Will the #6 Golden State Warriors win Game 2 of the West Semifinals against the #7 LA Lakers in the 2023 NBA Playoffs?	0.5	Sports	YES
Will Destiny’s main YouTube channel be banned before February 1st, 2023?	0.4	Social Media	NO
Will Qualy show up to EAG DC in full Quostume?	0.3	Entertainment	NO
Will I make it to a NYC airport by 2pm on Saturday, the 24th?	0.5	Travel	YES
Will this market have more Yes Trades then No Trades	0.5	Investment	CANCEL
Will Litecoin (LTC/USD) Close Higher July 22nd Than July 21st?	0.5	Finance	NO
Will at least 20 people come to a New Year’s Resolutions live event on the Manifold Discord?	0.4	Social Event	YES
hmmmm {i}	0.5	Uncategorized	YES
Will there be multiple Masters brackets in Leagues season 4?	0.4	Gaming	NO
Will the FDA approve OTC birth control by the end of February 2023?	0.5	Health	NO
Will Max Verstappen win the 2023 Formula 1 Austrian Grand Prix?	0.5	Sports	YES
Will SBF make a tweet before Dec 31, 2022 11:59pm ET?	0.9	Social Media	YES
Will Balaji Srinivasan actually bet $1m to 1 BTC, BEFORE 90 days pass? (June 15st, 2023)	0.3	Finance	YES
Will a majority of the Bangalore LessWrong/ACX meet-up attendees on 8th Jan 2023 find the discussion useful that day?	0.7	Community Event	YES
Will Jessica-Rose Clark beat Tainara Lisboa?	0.6	Sports	NO
Will X (formerly twitter) censor any registered U.S presidential candidates before the 2024 election?	0.4	American Politics	CANCEL
test question	0.5	Test	YES
stonk	0.5	Test	YES
Will I create at least 100 additional self-described high-quality Manifold markets before June 1st 2023?	0.8	Personal Goal	YES
Will @Gabrielle promote to ???	0.5	Career Advancement	NO
Will the Mpox (monkeypox) outbreak in the US end in February 2023?	0.45	Health	YES
Will I have taken the GWWC pledge by Jul 1st?	0.3	Personal	NO
FIFA U-20 World Cup - Will Uruguay win their semi-final against Israel?	0.5	Sports	YES
Will Manifold display the amount a market has been tipped by end of September?	0.6	Technology	NO

In retrospect maybe we have filtered these. Many questions are a bit silly for our purposes, though they’re typically classified as “Test”, “Uncategorized”, or “Personal”.

Is this good?

One way to measure if you’re good at predicting stuff is to check your calibration: When you say something has a 30% probability, does it actually happen 30% of the time?

To check this, you need to make a lot of predictions. Then you dump all your 30% predictions together, and see how many of them happened.

GPT-4 is not well-calibrated. Here, the x-axis is the range of probabilities GPT-4 gave, broken down into bins of size 5%. For each bin, the green line shows how often those things actually happened. Ideally, this would match the dotted black line. For reference, the bars show how many predictions GPT-4 gave that fell into each of the bins. (The lines are labeled on the y-axis on the left, while the bars are labeled on the y-axis on the right.)

At a high level, this means that GPT-4 is over-confident. When it says something has only a 20% chance of happening, actually happens around 35-40% of the time. When it says something has an 80% chance of happening, it only happens around 60-75% of the time.

Does it depend on the area?

We can make the same plot for each of the 16 categories. (Remember, these categories were decided by GPT-4, though from a spot-check, they look accurate.) For unclear reasons, GPT-4 is well-calibrated for questions on sports, but horrendously calibrated for “personal” questions:

All the lines look a bit noisy since there are 20 × 4 × 4 = 320 total bins and only 5000 total observations.

Is there more to life than calibration?

Say you and I are predicting the outcome that a fair coin comes up heads when flipped. I always predict 50%, while you always predict either 0% or 100% and you’re always right. Then we are both perfectly calibrated. But clearly your predictions are better, because you predicted with more confidence.

The typical way to deal with this is squared errors, or “Brier scores”. To calculate this, let the actual outcome be 1 if the thing happened, and 0 if it didn’t. Then take the average squared difference between your probability and the actual outcome. For example:

GPT-4 gave “Will SBF make a tweet before Dec 31, 2022 11:59pm ET?” a YES probability of 0.9. Since this actually happened, this corresponds to a score of (0.9-1)² = 0.01.
GPT-4 gave “Will Manifold display the amount a market has been tipped by end of September?” a YES probability of 0.6. Since this didn’t happen, this corresponds to a score of (0.6-0)² = 0.36.

Here are the average scores for each category (lower is better):

Or, if you want, you can decompose the Brier score. There are various ways to do this, but my favorite is Brier = Calibration + Refinement. Informally, Calibration is how close the green lines above are to the dotted black lines, while Refinement is how confident you were. (Both are better when smaller.)

You can also visualize this as a scatterplot:

Brier scores are better for politics questions than for science questions. But is that because it’s bad at science, or just because science questions are hard?

There’s a way to further decompose the Brier score. You can break up the resolution as Refinement = Uncertainty - Resolution. Roughly speaking, Uncertainty is “how hard questions are”, while Resolution is “how confident you were, once calibration and uncertainty are accounted for”.

Here’s the uncertainty for different categories:

And here’s a scatterplot of the calibration and resolution for each category: (Since more resolution is better, it’s now the upper-left that contains better predictions.)

Overall, this further decomposition doesn’t change much. This suggests GPT-4 really is better at making predictions for politics than for science or technology, even once the hardness of the questions are accounted for.

P.S. The relative merits of different Brier score decompositions caused an amazing amount of internal strife during the making of this post. I had no idea I could feel so strongly about mundane technical choices. I guess I now have an exciting new category of enemies.

OK, I can partly explain the LLM chess weirdness now

("make LLMs play better with one weird trick")

We recently talked about a mystery: All large language models (LLMs) are terrible at chess. All, that is, except for gpt-3.5-turbo-instruct, which for some reason can play at an advanced amateur level. This is despite the fact that this model...

Something weird is happening with LLMs and chess

are they good or bad?

A year ago, there was a lot of talk about large language models (LLMs) playing chess. Word was that if you trained a big enough model on enough text, then you could send it a partially played game, ask it...

The real data wall is billions of years of evolution

Careful with those human analogies

Say you have a time machine. You can only use it once, to send a single idea back to 2005. If you wanted to speed up the development of AI, what would you send back? Many people suggest attention or...

Fahren-height

(celsi-pour?)

The Internet is well into middle-age, and yet doesn’t seem to have answered humanity’s most pressing question: If you pour boiling hot water from various heights, how much does it cool in flight?

Fancy math doesn't make simple math stop being true

on butts and instrumental variables

What are you supposed to do when someone disagrees with you using a bunch of math you can’t understand? I’ve been thinking about that recently because of the NordICC colonoscopy trial. It took 85k Europeans aged 55-64, invited a third...

Grug on diet soda and autism

why bad and why so much promote

grug try to not yell about bads in science news too much because why make same yells over and over? and grug have no new learns, just feel people maybe sometimes not use old learns and grug family often plead...

My stupid noise journey

A tale of bad choices

Interested in how to be a big dumb idiot and over-complicate things and waste time and money and endure tons of stress and some real physical pain all by thinking that you’re cleverer than you actually are? (No?) Looking back,...

The second system problem

Building a safe AI ≠ preventing all unsafe AI

In The Vulnerable World Hypothesis, Nick Bostrom imagines we found a technological "black ball"—say a way to make a nuclear weapon with just some glass, some metal, and a battery. He concludes that society in our current "semi-archic default condition"—could...

I still think it's very unlikely we're observing alien aircraft

They'd have to be messing with us.

Some suggest there might be alien aircraft on Earth now. The argument goes something like this: A priori, there’s no reason there shouldn’t be alien aircraft. Earth is 4.54 billion years old, but the universe is 13.7 billion years old,...

Why didn't we get GPT-2 in 2005?

We probably could have

The ancient Romans were never great at building ships and never tried to explore the Atlantic. The basic reason seems to be—why bother? The open ocean has no resources and is a vast plane of death. But imagine that in...

First-principles on AI scaling

How likely are we to hit a barrier?

It's hard not to feel blinkered by recent AI progress. Every week there seems to be an AMAZING NEW SYSTEM with UNPRECEDENTED CAPABILITIES. It's impossible not to wonder what the future holds. Until recently, I thought progress was so dizzying...

Your friend the language model

The world is running out of khakis

I originally wrote this as part of a much longer post on LLM scaling laws and possible barriers/trajectories for progress. The idea was to provide the minimal background necessary to understand all that stuff. But in retrospect, many people probably...

Winner take all science

Is it helpful for things to work this way?

By the early 1950s, it was known thanks to people like Miescher, Levene, and Chargaff that genes were carried by long polymers in the cell nucleus. It was also known that those polymers had a sugar-phosphate backbone and were composed...

Are language models good at making predictions?

Are language models good at making predictions?

Is this good?

Does it depend on the area?

Is there more to life than calibration?

Is there more to life than refinement?