# First-principles on AI scaling

Updated Mar 2023

It’s hard not to feel blinkered by recent AI progress. Every week there seems to be an amazing new system with unprecedented capabilities. It’s impossible not to wonder what the future holds.

Until recently, I thought progress was so dizzying and unpredictable that the best bet was to throw all the details out the window and just rely on a simple Outside View. Something like, “The longer AI systems keep getting better, the longer we should expect them to continue getting better”.

But after looking into things, I was wrong. We know enough to form a credible Inside View. We can do this:

1. Use scaling laws to guess how much large language models (LLMs) will get better at predicting words if you add more computational power or more data.
2. Use historical trends to guess how much being better at predicting words will translate into more “intelligence”.
• If we had 10x or 1000x more data, what would change?
• If we had 10x or 1000x more compute, what would change?
• How much data (or compute) is needed to really move the needle? Does enough even exist?
• What are AI companies likely to do now?
• What fundamental advances would be enough to change these dynamics?
• Why might this all be wrong?

This post assumes some vague familiarity with LLMs. If you don’t have that (is anyone out there?) you probably want to read your friend the language model first.

## Loss and scale

How good will language models get and how fast? You might think this question has no useful answer. But—surprisingly enough—we know enough to make a not-terrible guess.

### What is loss?

Models are getting better. But how to quantify better? The simplest option is to take lots of new text, feed in one word at a time, and check how well the model would have predicted it. In principle you could do this in various ways, e.g. using a Brier score. But people typically use a measure that’s variously known as the “log loss” or “log likelihood” or “cross entropy”. This is natural because it’s what LLMs are trained on, so it’s a direct numerical measure of how good a job an LLM is doing of what’s asked of it. Lower is better.

### Should you care about loss?

A loss is just a number. What matters is how a model behaves, right? I like measures that are real and concrete. BigBench is a giant collection of different language tasks. Here are prompts from a few:

hyperbaton: Which sentence has the correct adjective order: a “old-fashioned circular leather exercise car” b “circular exercise old-fashioned leather car”?

stategyqa: Are all the elements plants need for photosynthesis present in atmosphere of Mars?

mathematical_induction: A: 3 is an odd integer. k + 3 is odd for any odd k. Therefore, by two steps of induction, -3 is odd. (Is a valid argument?)

navigation: Turn right. Take 1 step. Turn right. Take 6 steps. Turn right. Take 1 step. Turn right. Take 2 steps. Take 4 steps. (Do you end up back in the same place?)

causal_judgment: Brown is playing a simple game of dice. The game requires that Brown roll a six to win. So, hoping to get a six, Brown throws a die onto the table. Unluckily for the other players, the die lands six-up and Brown wins the game. Did Brown intentionally roll a six?

These tasks seem… hard? If a generic language model could get human-level performance on all of these, that would look a lot like “intelligence”.

So, does loss matter? I’d like to show you a big plot that takes all the recent models and compares their loss to their performance on BigBench. Unfortunately, this is hard because models use different datasets so their losses aren’t the same, and often they don’t tell you their loss anyway. And they don’t always evaluate on the same BigBench tasks (or evaluate on BigBench at all).

Fortunately, we have a small range of models that were all trained on the same dataset and evaluated on the tasks. Here is a plot with loss on the x-axis and performance on the y-axis. (On BigBench, models get a score for each task between 0 and 100 on each task, graded with the expectation that a human expert would score close to 100, though in practice humans only seem to score around 80%. A perfect model would be in the upper-left corner.)

The blue line shows a fit to just the first three models. I AM HIGHLY UNCERTAIN ABOUT THIS FIT. It looks like once the error drops below around 0.5, BigBench accuracy starts to take off. But that’s being judged from very little information.

Still, it’s reasonable to expect that at least in the short term, improving the loss will make LLMs behave more “intelligently”. And if you took the blue line at face value—don’t—then you’d expect that reducing the loss to near zero would produce near-human-expert performance.

### What are these “scaling laws” everyone is on about?

If you read about recent language models, you’ll see all sorts of details like the number of layers, the number of “heads”, the size of the “keys” and “values”, the learning rate, the batch size, and the “cosine cycle length”. All this stuff matters! But starting with Kaplan et al. (2020) and continuing with the “Chinchilla” paper (Hoffman et al., 2022), people noticed that as long as you do a good job of all that stuff, you can predict the loss pretty well just from two numbers:

1. N: The number of parameters you put in the model.
2. D: The total number of tokens being trained on.

You’d expect that more parameters are good, and more data is good, right? Well, the Chinchilla folks trained a bunch of different models on one particular dataset and observed that the loss was well approximated by this equation:

Don’t panic—this is the only equation in this post. But it’s very important and will be with us for the rest of this post. Here are some preliminary comments.

First of all, it’s insane that this equation exists. There is no obvious reason that the loss should be predicted from just N and D, let alone in such a simple way, where N and D don’t even interact.

Second, you should be skeptical if this equation is true. It was fit using relatively small values of N and D. It also seems to generalize well to the much larger values used in state-of- the-art models. But there’s no guarantee that it will continue to hold for even larger models.

OK, so what’s going on in this equation? The left-hand side is the “loss” or how good a language model with N parameters will be if you train it using D tokens. On the right-hand side, there are three terms:

1. The model error is the loss that comes from having a finite number of parameters. If your model is too simple, you can’t represent the true complexity of language, so you are worse at predicting words.
2. The data error is the loss that comes from having a finite amount of data. If you don’t have enough signal, you can’t find all the true patterns in language, and so you’re worse at predicting words.
3. The irreducible error is the loss you’d still have even with an infinite number of parameters and trained for an infinitely long time on an infinite amount of data. This is nonzero because it’s not possible to perfectly predict what work will come next. It doesn’t matter how much data you have or what model you use—if language was deterministic, it wouldn’t contain any information! So some amount of loss cannot possibly be eliminated by any model. We don’t know for sure what that minimum loss is. The scaling law just says that current models can’t do better than 1.69.

To simplify things going forward, I’m going to define the “error” as the loss without the irreducible error, i.e.

Here is what this looks like:

The lines show different amounts of total error. Basically: more is good. If you have few parameters and tokens, you’re in the lower right and have high error (red). If you increase those both a lot, you’re in the upper right and have low error (blue).

Compute doesn’t explicitly appear in the scaling law. That’s because it’s determined by the number of parameters and tokens. Let’s be clear about what the above scaling law says:

• You have a program that will learn an LLM.
• You choose how many parameters N you want. This is just a number. You can choose it to be anything you want.
• You gather some pile of data with D tokens.
• You run your program. If N and D are small, the program will require little compute. If they are huge, it will require immense compute.

In practice, the total compute is simple. The total number of FLOPs (“how many calculations the computer does”) empirically seems to be very close to 6ND. Again, we are very lucky that this is so simple.

I think the above scaling law isn’t the best way to look at things. After all, you don’t really care about N. That’s just a number you type somewhere. It’s trivial to change it. What you care about is (1) how much data you need to gather, (2) how much compute you need to buy, and (3) what error you get at the end.

I find it much more helpful to visualize the scaling law this way: Imagine you have a certain dataset and a certain amount of compute. Then you can ask yourself: "If I choose any given number of parameters, and I look at any fraction of my dataset, how much compute will be used, and what total error will result?" Then you can make the choice that gives the lowest error subject to your compute budget and the amount of data you have. If you do that, then you get this graph:

Really what’s happening is that you can fix any two of {parameters, data, compute}, and the last is determined for you. i.e.

• If you have N parameters and D tokens, then you will need C=6ND FLOPs and the total error is E(N,D).
• If you have N parameters and C FLOPs, then you will have time to look at D=C/(6N) tokens and the total error is E(N,C/(6N)).
• If you have D tokens and C FLOPS, then you can afford a model with N=C/(6D) parameters and the total error is E(C/(6D),D).

Now to understand this graph, say that you have a computational budget of C FLOPs and access to D data. Now, you can choose any number of parameters N, then you would have time to look at C/(6N) data. But of course, you only have D data, so you can’t use more than that.

So, technically speaking, given an input of C FLOPs and D tokens, the best total error is given by minimizing E(N,C/(6N)) over N, subject to the constraint that C/(6N) ≤ D. You could work out the math for that if you want but who cares, I just did it numerically.

As you’d expect, more data is good and more compute is good. If you’d like more details on this, see the above dropdown box, or the appendix on “compute-optimal” models.

### Where is the loss coming from?

Mostly from data error. Different papers don’t publish their loss on a consistent dataset, and anyway you can’t observe if a mistake is due to model error or data error. But we can still take the published numbers for parameters and data and guess the errors by plugging them into the scaling law. If we do that, we get this table:

system parameters (billions) data (billions of tokens) FLOPs model error data error total error
GPT-2 1.5 21 1.9×10²⁰ 0.308 0.529 0.837
GPT-3 175 300 3.2×10²³ 0.061 0.251 0.312
Gopher 280 300 5.0×10²³ 0.052 0.251 0.303
Chinchilla 70 1400 5.9×10²³ 0.083 0.163 0.247
PaLM 540 780 2.5×10²⁴ 0.042 0.192 0.234

For a few years, everyone was obsessed with increasing the number of parameters. But according to the scaling law, that might have been a mistake: In GPT-3, Gopher, and PaLM, over 80% of error was due to limited data, not limited model size. Chinchilla broke the pattern by training a comparatively small model on a larger dataset. This gives a much lower error than Gopher despite having a similar computational cost.

This suggests that, at least in the short term, models will likely be smaller than PaLM. Lots more compute is needed, but that compute will be used to churn through more data, not to increase the number of parameters.

Incidentally, if you are skeptical about the scaling law, a good exercise is to ask if people with skin in the game are behaving as if the scaling law were true. The answer seems to be yes. LLaMA followed Chinchilla’s trend of training a comparatively small model on a huge amount of data. If rumors are accurate, GPT-4 will be similar or smaller than GPT-3, but trained on more data. (Part of this is that smaller models are also cheaper to run in production.)

### What’s with these “scale is all you need” t-shirts?

You could phrase the theory like this:

1. Scaling laws say that with enough data and compute we can reduce the total error to near zero.
2. The trend suggests that an LLM with near zero total error would have a performance of >90% on BigBench, which would look pretty “intelligent”.
3. So we don’t need any new breakthroughs, just scale.

If you trust the scaling law for loss and the above fit between loss and BigBench, then we could get this figure that says how “intelligent” an LLM would be given a certain amount of compute and data.

Again, YOU SHOULDN’T TRUST THIS GRAPH because the loss/BigBench relationship is only based on three observations. But if you did, then this says that all you need for human-ish “intelligence” is to move to the upper-right—take current models with around 10²⁴ FLOPs and 10¹² tokens and make both of those numbers much bigger.

I don’t know if this is true. But I do think there’s strong evidence that on the margin of current state-of-the-art models, more scale will surely increase “intelligence”.

## Scaling data

### How much data is needed?

A lot. As a first exercise, let's imagine that you had infinite compute. You can train an infinitely large model, only on a finite amount of data. How good would it be? Well, go plug N=∞ into the scaling law. If you have D tokens, you should expect a total error of E(∞,D). Here's what that looks like compared to a few well-known models.

Here the vertical gap between each model and the “unlimited compute” curve depends on how many parameters are in the model: PaLM is further above the line than Chinchilla, because PaLM has many more parameters, i.e. spent more compute on the same number of data.

So there's a lot to be gained from making datasets 10x or 100x larger than the biggest recent datasets (scaling from 10¹² to 10¹³ or 10¹⁴ tokens). If you *really* want maximum accuracy, you might want to go up to 1,000x or 10,000x larger than the current largest datasets (10¹⁵ or 10¹⁶ tokens).

For future reference, here is the minimum error for different numbers of tokens, assuming infinite compute:

tokens error w/ unlimited compute
10¹² (current models) .179
10¹³ .094
10¹⁴ .049
10¹⁵ .026
10¹⁶ .014

### Does enough data even exist?

It’s unclear. There’s definitely more data out there, but it won’t be easy to collect and hard to say if we’re going to hit a limit.

Here is my best guess for the number of tokens that could be found in different sources, if you’re willing to go to fairly extreme lengths. See Villalobos et al. (2022) for some similar calculations.

Source Tokens in current models Tokens accessible in principle
Internet ~ 10¹² 5 × 10¹⁴ (?)
Books 5 × 10¹¹ 10¹³
Wikipedia (English) 6.5 × 10⁹ 6.5 × 10⁹
Wikipedia (All) 2.5 × 10¹⁰ 3.9 × 10¹⁰
Scientific papers 2.7 × 10¹⁰ 1.5 × 10¹²
Text Messages 0 10¹² / year
Panopticon (English) 0 10¹⁵ / year
Panopticon (All) 0 2 × 10¹⁶ / year

(Do you know how many tokens are in all the emails? Or in all the Facebook posts? Or in all the phone calls?)

I’ve put the calculations behind these estimates in an appendix because they are fiddly and tedious. But here’s a few notes:

• Internet: Most models already take a decent fraction of the full internet as a starting point. But this isn’t remotely usable unless it’s heavily filtered to try to remove garbage and spam and non-text stuff like menus or binary data. There’s definitely a lot more text on the internet than is being used now, but filtering is hard and I don’t think anyone really knows how much “usable” text exists.

• Books: My estimate of accessible tokens is based on all the books in the Library of Congress, which I think is the largest library in the world and has around ⅓ of all surviving books.

• Wikipedia: Most models already use basically all of Wikipedia. It isn’t huge, but people tend to give it high weight.

• Scientific papers: A lot of models use the papers on arXiv, where the full text can be easily accessed. If you could collect all the papers ever written that would be around 100x more tokens.

• Twitter: We have good data on Twitter. The total number of tokens is surprisingly huge—around as large as all the books ever written. You have to wonder if all this data will get monetized at some point.

• Text Messages: I estimated how many tokens are sent through WhatsApp per year. If you believe in encryption, then this is impossible to train on, but seems doubtful if it would be worth it anyway, since neither the quality nor quantity seems all that great. (Though maybe you want an AI to act like a friend…)

• Youtube: Doing speech-to-text on all the videos doesn’t generate enough tokens to really change the things.

• Panopticon: What we’d get if we recorded every single word spoken by every native speaker of English (or all languages) in the world.

The biggest uncertainty is how much useful data is on the internet. If it's impossible to filter out much more than the current 10¹²-ish useful tokens, then it might be hard to scale datasets beyond 10¹³-ish tokens so the total error can't drop below around 40% of current models. But if you can extract 5×10¹⁴ useful tokens, then the total error could be reduced to only 13% of current models. That's a huge deal and I wouldn't have expected that humanity's future trajectory could possibly hinge on such a weird technicality. If you have 10¹³ tokens, then the total error could be reduced as low as E(∞,10¹³)=0.094 with enough compute (i.e. infinitely many parameters). On the other hand, with 5×10¹⁴ tokens it could be reduced as low as E(∞, 5×10¹⁴)=0.031. Current models have a total error around 0.23.

My conclusion is: If you want more than the 10¹² tokens in current datasets, you don’t have a lot of options. You can probably get an order of magnitude from Twitter or a big project to digitize all the books ever written. But the scaling law says that to get near-perfect performance you’d want 10¹⁵ tokens or maybe even more. The only places that seems possible are maybe the internet or some nightmare total surveillance regime.

So, limited data might pose a barrier to how good LLMs can get. What about limited compute?

## Scaling compute

### What happens if you increase compute?

You eventually hit diminishing returns unless you already increase the number of tokens. Let's fix the number of tokens D to various levels and vary the number of FLOPs we have to train with. For each number of FLOPs, pick the largest number of parameters you can "afford". (Remember, it's easy to change the number of parameters.) Then this is what happens: To be more careful, remember that the number of FLOPs is approximately 6ND. So if D and C are fixed, we can choose N such that C=6ND and then look at the error E(N,D).

The circles show the estimated error for GPT-3, PaLM, and Chinchilla. You get heavily diminishing returns from increasing parameters/compute unless you have a ton of data. For example, given GPT-3’s dataset, no amount of compute could ever equal the performance of PaLM.

### How much compute is needed?

A lot. Here's another exercise: Imagine you have access to unlimited data, but finite compute. How well would you do? This is a little subtle because even if you have access to unlimited data, you can't train on infinite data without infinite compute. If you have a fixed amount of compute, what you want to do is choose the best model size and number of tokens that fit in your budget, but give the lowest predicted loss. Here's what happens if you do that:

Remember, the number of FLOPs is around 6ND. So if you have a computational budget of C FLOPs and choose to use N parameters, then you’ll only have time to look at D=C/(6N) tokens. So you can look at all possible values of N and choose the one where E(N,C/(6N)) is lowest. This is sometimes called the “compute optimal” error because it’s the lowest error that the scaling law says you can get with a total of C FLOPs. If you like math you can write this as E*(C) = min_N E(N,C/(6N)). You could conceivably try to solve that equation with math, but rather than screwing around with that I just solved it numerically for various numbers of FLOPs C.

GPT-2 and Chinchilla were trained with large amounts of data for their size, so they achieve nearly optimal loss given the compute used. On the other hand, GPT-3 and PaLM have smaller amounts of data for their size, so are further above the “unlimited data” line.

So: There's a lot to be gained by spending 1,000x more on compute than the current largest models do (scaling from 10²⁴ to 10²⁷ FLOPs). If you really want maximum accuracy, you might want to use up to 1,000,000x more compute (scale to 10³⁰ FLOPs).

Here is the minimum error for different numbers of FLOPs, assuming unlimited data:

FLOPs error w/ unlimited data
10²⁴ (current models) .221
10²⁵ .155
10²⁶ .109
10²⁷ .077
10²⁸ .054
10²⁹ .038
10³⁰ .027
10³¹ .019

Notice that to reach a given level of error you need to scale compute much more than you need to scale data. That’s because you ultimately need to increase both the number of parameters and the number of tokens, and both of those require more compute.

### Does enough compute even exist?

Enough to make models better than they are now, sure. But there isn’t enough compute on Earth to approach zero error with current technology.

How much does it cost to train an LLM? That depends on what you measure. Electricity? Hardware? Engineer salaries? A reasonable estimate is the cost to rent hardware from a cloud computing provider, where one recent quote is that you could rent enough GPU power to train Chichilla for $2.5 million. Since we know how many FLOPs Chinchilla used, we can extrapolate to get what loss is achievable for any given amount of money (again, assuming unlimited data!): To give some sense of just how absurd 10⁹ million dollars is, I’ve included on the x-axis the yearly GDP of some of our favorite states/countries/planets. I feel comfortable predicting that no one will spend their way to a total error of 0.01 simply by building larger compute clusters with current hardware/algorithms. But the best current models have a total error of around 0.24 and cost around$2.5 million. To drop that to a total error of 0.12 would “only” cost around \$230 million. If my projection was accurate, that would mean a lift in BigBench performance of around 17%. That hardly seems out of the question. And the mid-right part of the graph isn’t that far out of range for a rich and determined nation-state. And compute is constantly getting cheaper…

## Why could this all be wrong?

For many reasons!

### Maybe the scaling law is wrong.

All these projections have relied heavily on the Chinchilla scaling law, which allows us to predict the total error from a given amount of compute and a given number of tokens. Should we trust that law? After all, there’s no deep theory for why it should be true, it’s purely empirical. And as far as I can tell, here are the places where it has actually been checked:

We are most interested in what happens in the upper-right corner of this graph. But to extrapolate from 10²² FLOPs to 10³¹ FLOPS and from 10¹² to 10¹⁵ tokens is a *huge* jump. The pattern looks good so far, and it likely continues to hold in the around the dots in the above graph. But we should have lots of uncertainty about how things generalize far beyond that. The Chinchilla paper says that they ran "over 400" trials, but never explicitly says what those are. I pieced this together from plots that show ranges of FLOPs between 6e18 and 3e21 (e.g. figure 3), a table that shows between 44M and 16.1B parameters (table A9), and the fact that they only have 1.4T trillion tokens total. If I take all those combinations of FLOPs and parameters, calculate the number of tokens for each, and then filter out the cases where the number of tokens is greater than 1.4T, I get a total of 428 cases, which fits well with "over 400".

### Maybe the loss/performance relationship is wrong.

Even if the scaling law is correct, that just tells us how much the loss improves. We don’t know how “loss” translates to usefulness or perceived “intelligence”. It could be that if you drop the error to near zero, BigBench performance goes to 100 and everyone agrees the system is superhuman. Or it could be that reducing the error below current levels doesn’t do much. We just don’t know.

### Maybe quality has a quality all its own.

The scaling law is independent of the quality of the data. The loss just measures how well you fit the data you train on. If you train on a huge pile of garbage and the model does a good job of predicting new (garbage) words, then you still get low loss. Everyone knows that the qualitative performance of LLMs depends a lot on how “good” the data is, but this doesn’t enter into the scaling law.

Similarly, everyone reports that filtering the raw internet makes models better. They also report that including small but high-quality sources makes things better. But how much better? And why? As far as I can tell, there is no general “theory” for this. We might discover that counting tokens only takes you so far, and 10 years from now there is an enormous infrastructure for curating and cleaning data from hundreds of sources and people look back on our current fixation on the number of tokens with amusement.

### Maybe specialization is all you need.

We’ve already pushed scale pretty hard in base language models. But, we are still in the early stages of exploring what can be done with fine-tuning and prompt engineering to specialize LLMs for different tasks. It seems likely that significantly better performance can come from improving these. Maybe we eventually discover that the base LLM only batters so much and the real action is in how you specialize LLMs for specific tasks.

## The words they burn

OK, OK, I’ll summarize.

1. There is no apparent barrier to LLMs continuing to improve substantially from where they are now. More data and compute should make them better, and it looks feasible to make datasets ~10x bigger and to buy ~100x more compute. While these would help, they would not come close to saturating the performance of modern language model architectures.

2. While it’s feasible to make datasets bigger, we might hit a barrier trying to make them more than 10x larger than they are now, particularly if data quality turns out to be important. The key uncertainty is how much of the internet ends up being useful after careful filtering/cleaning. If it’s all usable, then datasets could grow 1000x, which might be enough to push LLMs to near human performance.

3. You can probably scale up compute by a factor of 100 and it would still “only” cost a few hundred million dollars to train a model. But to scale a LLM to maximum performance would cost much more—with current technology, more than the GDP of the entire planet. So there is surely a computational barrier somewhere. Compute costs are likely to come down over time, but slowly—eyeballing this graph, it looks like the cost of GPU compute has recently fallen by half every 4 years, equivalent to falling by a factor of 10 every 13 years.) There might be another order of magnitude or two in better programming, e.g. improved GPU utilization.

4. How far things get scaled depends on how useful LLMs are. It’s always possible—in principle—to get more data and more compute. But there are diminishing returns and people will only do if it there’s a positive return on investment. If LLMs are seen as a vital economic/security interest, people could conceivably go to extreme lengths for larger datasets and more compute.

5. The scaling laws might be wrong. They are extrapolated from fits using fairly small amounts of compute and data. Or data quality might matter as much as quantity. We also don’t understand how much base models matter as compared to fine-tuning for specific tasks.

## What would change all this?

Even if all the above analysis is right a paper could be posted on arXiv tomorrow that would overturn it.

First, a new language model could arise that overturns the scaling laws. If you had created scaling laws before the Transformer was invented, they wouldn’t have looked nearly so optimistic. Or, someone might find a way to tweak the transformer to make it generalize better (e.g. by inducing sparsity or something) I guess it’s possible that the final piece of the puzzle came in 2017 and nothing else is left. But I doubt it.

Second, there might be innovations in data generation. In computer vision, it is common to make datasets bigger by randomly warping/scaling/shifting images. (If you zoom in on a cow, it’s still a cow.) These help computer vision models generalize better from the same amount of starting text. If similar tricks were invented for transforming text into equally-good text, this could also improve the scaling laws.

Third, there could be innovations in multi-modal training. If there isn’t enough English, then maybe you can train on other languages without harming performance. Or maybe you can train a model that predicts not just text, but also audio or images, or video. Sure, lots of the model would probably need to be specialized to one domain. As far as I can tell, the reason LLMs look intelligent is that predicting the next word is so damn hard that if you want to do it well enough, you can’t avoid learning how to think. Probably the same is true for predicting the next pixel, and maybe some of the “thinking parts” can be shared.

So, lots of uncertainty! But I think we know enough that the inside view is worth taking seriously.

Thanks to Andrew Conner, Tristan Homsi, Damian Bogunowicz, other DB

Appendix: Compute optimal models

Say you’re a company. You have unlimited data, but you only have 25,000 GPUs and you can only run them for one month. (I’m sorry, that must be hard for you.) You can create a gigantic model and run it for a small number of iterations (so it only sees a small amount of data) or you could create a small model and run it for a lot of iterations (so it sees a ton of data). Or you could do something in the middle.

That is, you can “spend” your compute on parameters or you can spend it on data. The idea of a “compute optimal” model is to spend your compute in the best possible way to get the best return.

Formally, in a compute optimal model, you fix the amount of compute C and then minimize E(N,D) over N and D with the constraint that C=6ND. You can do this with math you if you want, but who cares, here it is:

Here I’ve greatly decreased the minimum number of FLOPs to make the pattern more clear. When the computational budget is small, you want a roughly equal number of tokens and parameters. As the budget grows, we want to increase the number of parameters and the number of tokens, but the number of tokens increases faster.

Here’s the ratio of D/N in the above graph This increases dramatically when you have more compute.

Why does this happen? Ultimately, it’s because in the scaling law, N is taken to a power of 0.34 while D is taken to a power of 0.28, meaning it’s easier to kill off model error (by increasing N) than it is to kill off data error (by increasing D).

Of course, we can doubt this conclusion: The scaling law has only been validated up to around 10²⁴ FLOPs. There is no guarantee that the pattern continues to hold.

Appendix: Data estimates

My best guess is that current models seem to be using on the order of 10¹² tokens, but that with a lot of work on better filtering / cleaning / parsing, it might be possible to push this up to around 5 × 10¹⁴ but probably not a whole lot more?

Here’s my logic: Most recent models use data from Common Crawl, a nonprofit dedicated to crawling the web and making all the data available. (This exists!?) I made a rough estimate that if you take all the text data common crawl ever collected and assume it’s all usable, this would be around 5 × 10¹⁴ tokens.

However! The data in Common Crawl is a total mess. Some stuff is duplicated between different crawls, and apparently lots of non-text garbage leaks into the text. Beyond that, lots of the text is totally unusable. Some folks (Gao et al. 2020) tried to create an open dataset called The Pile that would be similar to the data GPT-3 trained on. They noticed that the text data in Common Crawl is a nightmare and so used their own algorithms to try to extract text from the HTML instead. They only ended up with around 5.6 × 10¹⁰ tokens, 4 orders of magnitude less. GPT-3 seems to have used around 7.5×10¹¹ token from Common Crawl.

Overall, most models seem to put low weights on current data extracted from Common Crawl. My guess is that with a lot more effort put into filtering and cleaning this can be pushed up.

• Common Crawl tries to return just the text content from HTML and RSS content in what they call “WET” files.
• In the most recent crawl, this was 9TiB GZIP compressed.
• Typically, compressed text is around 25-30% of the original size, so let’s call that 9 TiB/0.275 = 32.7 TiB (32.7 × 2⁴⁰ bytes) of raw text.
• This text seems to be UTF-8 encoded. In UTF-8, ASCII characters use 1 byte, others use up to 4 bytes. I’ll randomly use an average of 2, meaning we have 16.35 × 2⁴⁰ characters per crawl.
• At 4 bytes per token, that’s 4.09 × 2⁴⁰ tokens per crawl.
• Some duplication of older tokens, not taking this into account!
• Upper estimate: No duplication, no filtering, none of the text is binary garbage, assume same rate for the past 10 years: 4.09 × 2⁴⁰ (tokens / crawl) × 12 (crawls / years) × 10 years = 5.4 × 10¹⁴ tokens.

The “Will we run out of data?” paper used different math but arrived at a broadly similar estimate of number of tokens per year.

The Pile ended up with only 227.12 GiB of data from common crawl, and report a ratio of 0.2291 tokens per byte meaning only 5.6 × 10¹⁰ tokens.

The GPT-3 paper says “nearly a trillion words”. If it was a full trillion that would be around 10¹² * 0.75 = 7.5×10¹¹

OK, what about Twitter? We seem to have pretty good data about how many tweets have been sent and the average tweet length. Putting these together we get around 2.5 × 10¹³ tokens. As far as I know, no one admits to training on Twitter data. Although if they did, would they tell us? (Gwern reports that Twitter is not generally in Common Crawl.)

• Let’s take the average number of tweets per year, add them up, and multiply by 365: (340+500+500+546+592+634+683+729+775+821+876+876) * 1e6 * 365 = 2.87328 × 10¹² total tweets.
• The average tweet is apparently around 33 characters. So thats 9.481824 × 10¹³ characters or 2.37 × 10¹³ tokens.
• Let’s round that up to 2.5 × 10¹³ since 2023 is already starting to rush past us.

For blog posts, Villalobos et al. estimate that between 0.2 trillion and 2 trillion are written per year. If we take an average length of 1000 words, that would be 2.6×10¹⁴ to 2.6×10¹⁵ tokens. Since that’s more than the estimated number of tokens on the entire internet, I think something is wrong.

What about books? Using data from how many books and words were in Google Books in 2010 and extrapolating, my guess is that around 182 million books survive in English today.

How many tokens are in a book? The books that Google had scanned in 2010 had an average of around 69.4 thousand words or 92.5 thousand tokens per book. (Though note these books were surely a non-random sample.)

I’m not sure if it’s realistic to expect anyone to collect the full text of all 182 million books. Certainly, all that data doesn’t exist in any organized form today. Google reported in 2015 that they had scanned 30 million books, but they seemed to get bummed out about everyone suing them all the time and sort of gave up. A few years ago, Amazon sold around 48.5 million different print books. The Library of Congress has around 51 million books. In terms of ebooks, the Internet Archive has around 20 million books in its library, while Amazon kindle has around 12 million.

So I think the Library of Congress is the largest existing collection of books and has around 4.7×10¹² tokens, while all the books in the world have perhaps 1.7×10¹³ tokens. I think a very motivated company could collect more than the library of congress, but probably not all the books, so let’s just call it 10¹³ tokens that could be found in books in principle.

Meanwhile, Chinchilla seems to have used around 5×10¹¹ tokens. As far as I know this is the largest dataset of books used.

How many books survive? Google books in 2010 said they had scanned 5.2 million books, and estimated this to be 4% of all surviving books ever published. That would suggest 130 million books existed at the time. Wikipedia tells us that around 4 million books are published in English per year, so that suggests that around 130+13×4=182 million books survive today.

Google reported that their 5.2 million books (in English) had a total of 361 billion words. That suggests an average of 69.4 thousand or 92.5 thousand tokens per book.

Estimated number of tokens in all the books: 182 million books × 92.5 thousand (tokens/book) = 1.7×10¹³ tokens.

Estimated number of tokens in all the books in the library of Congress: 51 million books × 92.5 thousand (tokens/book) = 4.7×10¹² tokens.

The Gopher paper reports that they have around 4.23 bytes per token, while the Chinchilla paper reports that they had 2.1 TB of Books. So I estimate they had 2.1×10¹² bytes / (4.23 bytes/token) = 4.9×10¹¹ tokens. This is substantially more than the 6.7×10¹⁰ that GPT-3 used.

How about video content? In principle you could run speech-to-text on all the video ever uploaded to YouTube, right? Surprisingly, this doesn’t get you all that far: Villalobos et al. estimate that between 130 billion and 1.3 trillion words are uploaded to youtube each year. If we take the geometric mean of those estimates and assume 10 years of data, that’s a total of 4.1 × 10¹² tokens.

What about scientific papers? Villalobos et al. estimate that there are around 170 million scientific papers in the world and the average paper length is 6 thousand words, suggesting there are 1.02 × 10¹² total words or 1.36 × 10¹² tokens in all scientific papers. How many are used in current models? Often it’s unclear, but the recently published LLaMA model has 92 GB of ArXiv data out of a total of 4749 GB of data, and a total of 1.4 trillion tokens. Assuming different types of text have equal numbers of tokens/byte (nearly true in other papers) that would be 2.7 × 10¹⁰ tokens.

What about Wikipedia? Pretty much every model already uses all of English Wikipedia, which is around 3e9 tokens. GPT-3 reports using 3 billion tokens. LLaMA says it has 83 GB of data in 20 languages, which converts to 24.47e9 tokens using the same calculation as above for Arxiv. Alternatively, this page on Wikipedia says that English Wikipedia has 4.9 billion words while all of Wikipedia has 29 billion words. That converts to 6.5 billion and 38.6 billion tokens, respectively.

What about text messages? The largest platforms globally seem to be WhatsApp, followed by WeChat and Facebook Messenger. In 2020, Facebook reported that around 100 billion messages per day are sent on WhatsApp and some places report that the average length of an SMS is around 7 words. That suggests 9.3 ×10¹¹ tokens are produced per year.

How about panopticon? Apparently, the average person speaks around 16k words per day and there are around 420 million native English speakers in the world. Say companies recorded everything everyone said for a year and sent it to the cloud. (Is it that much worse than how much your smart speaker spies on you now?) That would be 2.45×10¹⁵ words or 3.27×10¹⁵ tokens recorded per year. Realistically you wouldn’t get everyone—babies don’t talk—so let’s call it 1×10¹⁵ that you could record per year. If you everyone in all languages, you’d have around 20x as many people, so 2×10¹⁶ tokens per year.

I wanted to estimate how many tokens are in all emails sent per year but I wasn’t able to come up with even a very rough estimate.

Scaling laws:

• https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications

• https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new-scaling-laws-for-large-language-models
• Kaplan paper https://arxiv.org/abs/2001.08361
• Chinchilla paper https://arxiv.org/abs/2203.15556

On running out of data:

• Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning https://arxiv.org/abs/2211.04325