DYNOMIGHT ABOUT RSS SUBSTACK

The real data wall is billions of years of evolution

The real data wall is billions of years of evolution

Oct 2024

Say you have a time machine. You can only use it once, to send a single idea back to 2005. If you wanted to speed up the development of AI, what would you send back? Many people suggest attention or transformers. But I’m convinced that the answer is “brute-force”—to throw as much data at the problem as possible.

AI has recently been improving at a harrowing rate. If trends hold, we are in for quite a show. But some suggest AI progress might falter due to a “data wall”. Current language models are trained on datasets fast approaching “all the text, ever”. What happen when it runs out?

Many argue this data wall won’t be a problem, because humans have excellent language and reasoning despite seeing far less language data. They say that humans must be leveraging visual data and/or using a more data-efficient learning algorithm. Whatever trick humans are using, they say, we can copy it and avoid the data wall.

I am dubious of these arguments. In this post, I will explain how you can be dubious, too.

The math checks out—humans see much less language data

Every day, an average person reads a few thousand words, and hears perhaps 16 to 40 thousand. So a well-educated 40-year old might have encountered 5×10⁸ words in their lifetime. Recent language models are trained on upwards of 10¹² 10¹³ words—200,000 times more. It’s not even close.

Imagine a fast reader who did nothing but read 300 words/minute for 80 years, never pausing to eat or sleep. They’d still see 1000 times fewer words than AIs do.

Vision is not the key to human intelligence

So then how do humans generalize so well from so little language data? Is “pre-training” on visual data the secret to our success?

No.

Because… blind people? What are we doing here?

Deaf people show that (non-verbal) sound isn’t critical either.

Could it be touch? There is a disease called congenital insensitivity to pain with anhidrosis or CIPA. People with CIPA often have intellectual disabilities, but much of that is surely due to the (horrible) consequences of CIPA or the protein misfolding issue that causes it. And anyway, many people with CIPA have normal intelligence. Miall et al. (2021) describe a person known as “KS” who is not paralyzed but has had no sense of touch at all since birth. They don’t do any IQ tests, but do mention that KS graduated from law school.

It seems unlikely that intelligence would be based on smell or proprioception.

Maybe we need vision or sound? At first, I thought Helen Keller was a counter-example to this. Clearly she was very smart, but she apparently had sight and hearing before losing them to meningitis at the age of 19 months. Other people are deafblind since birth. They often have intellectual disabilities, but Larsen and Dammeyer (2020) report that many don’t if given early access to language though tactile signing. However, they only report the fraction of people with IQs above 70, and I can’t tell if anyone born deafblind went on to have an average IQ.

Now, don’t write off other modalities. It could be that human brains are so adaptable, that we just need exposure to language plus some kind of high-resolution sensory data. Or maybe it’s critical that we interact with our environment. We have no examples of paralyzed people with no senses that somehow survive and passively absorb language for decades.

Maybe! Or maybe all that other sensory data is irrelevant. I don’t know. But that’s kind of the point—the example of humans just isn’t very useful for predicting how helpful other modalities might be for AI.

Humans get information from evolution

Many comparisons between humans and AIs seem to be based on the following analogy:

  1. AI systems “learn” from data.
  2. Human babies “learn” from experience.

The issue with this analogy is that humans are born with extremely sophisticated programming, provided by evolution. That programming integrates information from all our ancestors, arguably going back to the origin of life on earth.

When you train an AI, you have to learn lots of stuff that babies get “for free”. Your intelligence is based on “data” from your whole evolutionary history, not just your lifetime.

Now, a skeptic might accept that human babies get some information from evolution, but object that it can’t be much information. After all, a single month of CommonCrawl (used by all current models) is around 200 terabytes. Yet human DNA has around 2.9 billion base pairs, each of which can take 4 values. That adds up to only around 6 billion bits or 690 megabytes. That’s 30,000 times less.

And DNA has lots of other jobs beyond intelligence, like making ribosomes or making teeth or running an immune system. Is DNA too small to matter?

DNA is not that small

Your DNA probably contains more information than all the words you’ll encounter in your whole life.

Claude Shannon, the father of information theory, famously estimated that English language text on average has 11.8 bits/word. But recent language models suggest that it’s only 2.3 bits/word or perhaps even less. So the 5×10⁸ words a person might have heard probably only contains 1.2 billion bits of actual information—less than the 6 billion bits in DNA.

(People used to think that most DNA was “junk” because it didn’t directly code for proteins. But research increasingly suggests it plays other important roles, like determining how DNA folds or regulating the expression of other genes. This is an active research area with credible people on both sides. I’ll stick with 6 billion bits for simplicity.)

Now a modern AI dataset of 10¹³ words surely does contain much more information than DNA—perhaps 3800x as much. But a big fraction of that is things like, “to cure all health problems, buy CBD gummies now”, which isn’t super useful for intelligence. So the information in DNA isn’t that far off.

It’s not just DNA

Evolution provides lots of other information beyond just what’s written down in the DNA. And I’m not just talking about epigenetics—I mean all the information embodied in the physical structure of cells.

(I can feel everyone squinting right now.)

DNA is a “blueprint” for a cell. But information is needed to interpret that blueprint. Imagine a machine that could take in a DNA sequence and build a human cell. How many bits would be needed to describe that machine? A lot, right?

Of course, there’s a recursive “chicken and the egg” issue here: The machines that actually make human cells from DNA are… other human cells. But you need some information to get the loop started!

Here’s an analogy for programmers: Say I invent a new programming language called Dynoscript. (“Strongly typed immutable arrays and existential angst.”) And then I write a Dynoscript compiler in Dynoscript. Can I now run programs written in Dynoscript? No, because I have no way of running the compiler.

Where in DNA does it say that DNA is supposed to have a double-helix structure? Where does it say that “A” means C⁵H⁵N⁵? That information is represented in the physical configuration of the atoms in the cell, and is physically propagated when cells divide. (I’m sure there are better examples, but biologists get very stressed when pressured to think this way.) I have no idea how to quantify the amount of “embodied information” like this, but I’m sure it’s substantial.

But if you still think DNA is too small and you don’t believe in embodied information, that’s fine, because…

“Learning” is just one execution of the inner-loop

Here’s a cartoon showing how I think about the different contributions to human and artificial intelligence:

data wall cartoon

All of my test readers said that cartoon was confusing and futilely begged me to delete it. So probably I should explain.

We have been optimized by evolution. Partly evolution tuned our base instincts, like “food good” or “hypothermia bad”. But it also tuned the algorithm that we use to learn from our life experiences.

Human intelligence is the product of a “double-loop” optimization. In the outer loop, evolution tinkers with different learning strategies. In the inner loop, we are born as babies, we grow into adults following the strategy evolution gave us, we reproduce (or not) and we die. Then evolution picks the strategies that led to more offspring and uses them as the basis for further tinkering.

AI models are the product of a double-loop optimization, too. In the outer loop, human engineers tinker with different machine learning tricks. In the inner loop those algorithms are loaded into a giant cluster and run against data. The engineers pick out the strategies that work and repeat.

The most salient difference between these is who is in charge of the outer loop. But don’t get distracted by that.

The most important difference between these is that the evolution outer loop has executed many, many more times.

Also important is that each iteration of the evolution outer loop runs on “fresh data”. Imagine that GPUs became billions of times cheaper, and you hired billions of engineers so you could test billions of machine learning tricks in parallel. Problem solved? Not necessarily, because you’ll eventually “overfit” to whatever data you hold out to test generalization. There’s only so much you can squeeze out of a finite dataset.

Each bit from evolution integrates experience from millions of years of life and so may have a “multiplicative” effect on how effective in-lifetime learning is.

The human learning strategy might be vast and inscrutable

Even though it comes from evolution, humans are still using some learning algorithm. Just by definition, isn’t it possible to build an AI using the same tricks?

In principle, yes! But it might be very hard in practice. The key question is to what degree the human learning strategy “makes sense”. If it’s something simple, then probably we’ll eventually copy it. If it’s a collection of millions of unintelligible interacting “hacks” tuned to statistical properties of the environment, then maybe not.

Just because humans learn efficiently doesn’t necessarily mean their strategy will be easy to copy.

Caution on these cautions

Now hear me. I am not arguing that we will hit a data wall. Just because humans don’t need to pre-train on visual data doesn’t mean that visual data won’t be useful for AI. And just because human learning strategies integrate vast amounts of information from our evolutionary history doesn’t mean that algorithmic progress is impossible. I’m just saying that if you want to argue for visual data or algorithmic progress or whatever, a direct argument is more convincing than gesturing at some human babies.


P.S. I’ve been feeling grumpy about algorithms deciding what we read, so I’ve decided to experiment with links to writing that I think deserves more attention. Today’s link is to Philosophical Multicore on Outlive: A Critical Review.

Well joke’s on you, I already exercised today, and now I’m back to over-analyze saturated fat. My assessment:

  • Saturated fat is unhealthy in expectation: likely true (credence: 85%).
  • It’s a good idea for most people to reduce their SFA intake: possible (credence: 50%).
  • It’s a good idea for people with high cholesterol to reduce their SFA intake: likely true (credence: 70%).
  • The data are unclear: unclear. (Yes, it’s unclear whether the data are unclear. It depends on how much clarity you want.)

In the same spirit, if you’ve written a blogpost-sized response to this—or any—post let me know and I’ll strongly consider sending it out at the bottom of a future post. (Unless it’s really bad, in which case I won’t.) And in the spirit of that spirit, if you liked this post, consider sending it to a friend.

Comments at substack.

new dynomight every thursday
except when not

(or try substack or rss) ×
Fahren-height

(celsi-pour?)

The Internet is well into middle-age, and yet doesn’t seem to have answered humanity’s most pressing question: If you pour boiling hot water from various heights, how much does it cool in flight?

Fancy math doesn't make simple math stop being true

on butts and instrumental variables

What are you supposed to do when someone disagrees with you using a bunch of math you can’t understand? I’ve been thinking about that recently because of the NordICC colonoscopy trial. It took 85k Europeans aged 55-64, invited a third...

Are language models good at making predictions?

politics more than science

To get a crude answer to this question, we took 5000 questions from Manifold markets that were resolved after GPT-4’s current knowledge cutoff of Jan 1, 2022. We gave the text of each of them to GPT-4, along with these...

Grug on diet soda and autism

why bad and why so much promote

grug try to not yell about bads in science news too much because why make same yells over and over? and grug have no new learns, just feel people maybe sometimes not use old learns and grug family often plead...

My stupid noise journey

A tale of bad choices

Interested in how to be a big dumb idiot and over-complicate things and waste time and money and endure tons of stress and some real physical pain all by thinking that you’re cleverer than you actually are? (No?) Looking back,...

The second system problem

Building a safe AI ≠ preventing all unsafe AI

In The Vulnerable World Hypothesis, Nick Bostrom imagines we found a technological "black ball"—say a way to make a nuclear weapon with just some glass, some metal, and a battery. He concludes that society in our current "semi-archic default condition"—could...

I still think it's very unlikely we're observing alien aircraft

They'd have to be messing with us.

Some suggest there might be alien aircraft on Earth now. The argument goes something like this: A priori, there’s no reason there shouldn’t be alien aircraft. Earth is 4.54 billion years old, but the universe is 13.7 billion years old,...

Why didn't we get GPT-2 in 2005?

We probably could have

The ancient Romans were never great at building ships and never tried to explore the Atlantic. The basic reason seems to be—why bother? The open ocean has no resources and is a vast plane of death. But imagine that in...

First-principles on AI scaling

How likely are we to hit a barrier?

It's hard not to feel blinkered by recent AI progress. Every week there seems to be an AMAZING NEW SYSTEM with UNPRECEDENTED CAPABILITIES. It's impossible not to wonder what the future holds. Until recently, I thought progress was so dizzying...

Your friend the language model

The world is running out of khakis

I originally wrote this as part of a much longer post on LLM scaling laws and possible barriers/trajectories for progress. The idea was to provide the minimal background necessary to understand all that stuff. But in retrospect, many people probably...

Winner take all science

Is it helpful for things to work this way?

By the early 1950s, it was known thanks to people like Miescher, Levene, and Chargaff that genes were carried by long polymers in the cell nucleus. It was also known that those polymers had a sugar-phosphate backbone and were composed...