Theo's Substack
Theo Jaffee Podcast
#5: Quintin Pope

#5: Quintin Pope

AI alignment, machine learning, failure modes, and reasons for optimism

Introduction (0:00)

Theo: Welcome to Episode 5 of the Theo Jaffee Podcast. Today, I had the pleasure of speaking with Quintin Pope. Quintin is a machine learning researcher focusing on natural language modeling and AI alignment. Among alignment researchers, Quintin stands out for his optimism. He believes that AI alignment is far more tractable than it seems, and that we appear to be on a good path to making the future great. On LessWrong, he's written one of the most popular posts of the last year, “My Objections To ‘We're All Gonna Die with Eliezer Yudkowsky’”, as well as many other highly upvoted posts on various alignment papers, and on his own theory of alignment, shard theory. This episode is the most technical one I've ever done. We dive into definitions of AGI, doomer arguments such as orthogonality and instrumental convergence, analogies between AI and evolution, how humans and AIs form values, AI failure modes like reward hacking and mesa-optimization, and much more. This is the Theo Jaffee podcast. Thank you for listening. And now, here's Quintin Pope.

What Is AGI? (1:03)

Theo: Welcome back to episode 5 of the Theo Jaffee podcast. Today, we're interviewing Quintin Pope.

Quintin: Hello. I'm delighted to be here. I will do my utmost to present my perspective.

Theo: Awesome. So I guess we'll start with some of the more topical news this week, which is rumors of AGI out of OpenAI, or more accurately, inside of OpenAI. For example, Sam Altman commented on Reddit for the first time in eight years to say AGI has been achieved internally, only to then correct himself. He said, “Edit, this was obviously just memeing. It was just a joke. You guys have no chill. When AGI is achieved, it will not be announced through a Reddit comment.”

So do you think that OpenAI may have achieved AGI? And if so, what do you think we should expect over the coming weeks, months, couple of years? It's harder to predict outside of that.

Quintin: Yeah, so I think AGI is like this useless word that a bunch of different people have different ideas of. And so when you say AGI, you're conveying very little information about the actual capabilities and behavioral patterns of whatever system you're referencing. If you just look at the literal words in artificial general intelligence, it seems to me pretty straightforward that we've achieved AGI in terms of like GPT-3 or even GPT-2. I mean, those are artificial systems. They're somewhat general across the distribution of text. Obviously, an AGI can't be limited to only things that are totally general because there's no such thing as a totally general system. And they're not very intelligent, but I think they are kind of intelligent. So I think you're not clearly wrong or not definitionally wrong to call even like GPT-2 an AGI.

And so what AGI, the term AGI ends up referring to is just like the vibes associated with the system or maybe like some individual person's level of impressedness with the system or like whether they can imagine that system starring in like a sci-fi movie where one of the characters is called a quote-unquote AGI.

Theo: Let's say like an AI that is smart and capable enough to do whatever a, let's say, 90th percentile IQ a human can over a computer.

Quintin: Then you get into the issues of how strict are your bounds on whatever because the distribution of intellectual capacities that humans acquire or the distribution of capabilities that humans acquire at a given quote-unquote generality level versus those that an AI achieves at that same generality level or let's say economic usefulness level. These are very different.

And so I think even for quite powerful and general systems, there's going to be things that they can't do, which humans can pretty easily, even when you don't like limit it to the obvious stuff of moving around. So, for example, ChatGPT's recent public augmentation with a vision system, if you've seen on Twitter recently, people have tried it with those sorts of text to image generated models that have some hidden message encoded in them with ControlNet. So like the image of the hippies whose clothing is arranged strategically to spell out the word LOVE as a sort of pseudo visual illusion. People have submitted those images to image ChatGPT and it like largely cannot recognize words encoded in images in ways that are like quite obvious to human vision. I expect there are other bundles of weird capabilities that are going to be lacking in even a system that you might intuitively want to call an AGI or even like a strong AGI.

Theo: Do you think similarly, there are capabilities that GPT-4 has that humans don't as easily at least?

Quintin: Yeah, I mean, this is like clearly true, right? So word prediction, next word prediction is like what they're literally trained to do. And if you compare human performance on next word prediction versus even like GPT-1, that very weak, very simple system just completely smokes us. Now admittedly, maybe like if you as a human decided to spend a thousand hours becoming really good at word prediction, you'd do better, but like there's different dimensions of capabilities that language models versus humans acquire with different rapidity.

Theo: Well, when we talk about capabilities of GPT-4, we're typically talking about capabilities, not in the sense of what it was directly literally trained to do, like predicting tokens, but in the sense of stuff that it was not directly trained to do, it still has the ability to like write code. So do you think there are any abilities in there that it can do better than humans yet?

Quintin: I mean, it was directly trained to write code. Right. You can describe the pre-training process where code was the data it was pre-trained on as like training to predict the next token or describe it as training to write code. And these are just like differences in the way you describe the thing. These are equivalent, point to equivalent mathematical structures. Yeah, that's like one thing that often annoys me about discussions for language models is that people will talk about them spontaneously acquiring the ability to play chess or whatever.

Theo: I remember you tweeting about that.

Quintin: Yeah. They were trained to do this explicitly, directly. There's this further question of generalization behavior beyond the training data. This is a huge collection of open questions about how a model behaves in situations that aren't particularly similar to anything it was explicitly trained on. But discussing what portions of GPT-4's behavior are generalizations away from its training data versus good modeling of its training data is very difficult because we don't know what data it was trained on. OpenAI has spent huge amounts of effort to acquire data that's as useful as possible for making GPTs behave well or perform impressive feats on the sorts of problems that people want them to perform on.

Wrapping back to your question about implications and what the future is going to look like, we had this giant diversion of talking about definitions of AGI, which maybe went on a bit longer than I intended it to. But the point I wanted to eventually wrap around to is that you should pretty much always talk in terms of specific descriptions of the model's actual capabilities or behavioral tendencies in various domains. That way you can actually say something that has a relatively consistent meaning for different people either saying or hearing that thing. Then you can actually get communication going instead of stumbling around different people's collections of intuitions regarding this mysterious word AGI.

There have been various rumors out of OpenAI that they've made the next step in language modeling or even multimodal modeling capabilities. I think that's plausible. I think it would be kind of weird to be in a situation where the state of the art for natural language capabilities had been stuck at GPT-4 for, what is it, about a year?

Theo: Yeah. They started pre-training about a year ago or maybe more than a year ago.

Quintin: In terms of what this actually means for specifically what an AI system can do, I guess I more or less expect a slight step forward in capabilities in the ability to answer questions, the ability to avoid making stuff up, the ability to write useful code, and so on. That is roughly the difference between GPT-3.5 and GPT-4, but potentially a little smaller than that, reflecting the apparent diminishment in the rate at which investment in frontier models increases.

Theo: Investment in terms of what? Money, compute, data, all of the above?

Quintin: Well, I was specifically thinking of compute. If you look at the progression in the relative jumps in compute invested from GPT-1 to GPT-2 to GPT-3, and you threw out that exponential to the time period that GPT-4 was finished training internally but not released, you'd have overestimated the amount of compute that went to GPT-4 by a factor of 1,000, or at least using public estimates of how much compute went into GPT-4, you'd have overestimated it by roughly a factor of 1,000.

What Can AGI Do? (12:49)

Theo: Going back to when we were talking about chess, because I remember you're tweeting about this, there are people saying this model just spontaneously learned how to play chess at a very impressive level. And you were saying, no, it was directly trained on the internet, which probably included large chess data sets. So why do you think that GPT-3.5 Turbo Instruct does so much better on chess than GPT-3.5 Turbo with chat fine-tuning and RLHF?

Quintin: Well, it depends. It's very hard to say, because we don't know what data the systems are trained on. Worst case, it could just be that OpenAI decided to mix in some explicit chess training data into turbo-instruct's data set. There's no law of physics that prevents that from being the explanation. A lot of people tend to assume that the RLHF fine-tuning damages model capabilities. And I saw that as an explanation bandied about for why turbo-instruct can do chess, whereas the chat model can't. And I mean, that's potentially the answer.

As I remember, there were comparisons of the impact that RLHF fine-tuning had on GPT-4's performance across various benchmarks. Well, not so much benchmarks as exams, or benchmarks for humans, I guess. And it did change some of its performances in some of the categories. It made it significantly worse at economics, for example. But it also made it better on some other categories. And the overall result was mostly a wash. So I don't really believe RLHF fine-tuning is, in general, in expectation, going to reduce the capabilities of your model. But it could have, just by chance, shuffled quite a bit of capability away from chess and more towards other domains.

And maybe you can tell a story where the chat, where the RLHF fine-tuning process that went into producing the chat version of the GPT model, it never had chess games in it, I suppose. I think very few people use ChatGPT to play chess. And maybe that was very much not emphasized in whatever RLHF training process that OpenAI did with the model. And so maybe it was just ordinary catastrophic forgetting, if you're familiar with that, in machine learning parlance.

Theo: Going back to what you said earlier, where you said, there's no law of physics that prevents that from being the explanation. That sounded very Deutschian. Are you familiar with David Deutsch? Have you read The Beginning of Infinity?

Quintin: I did read it when I was quite young, maybe 14. I'm not sure, but I have read it in the past. In terms of why I said it, though, I don't think it was a latent reference to anything in that book or that he's written. It's more because I've been recently talking with people who seem to hold their own speculation to have the evidentiary weight of a physical law. That sort of point of comparison was more of a reminder to myself.

Theo: Have Deutsch's ideas about AI and AGI influenced you at all, particularly his characterization of a true AGI as equivalent to being a person, in that they're both knowledge-creating entities?

Quintin: I didn't even know that was how he characterized a true AGI. Having just heard that description of his characterization from you, I think it's kind of ridiculous. Lenet, the ancient Lenet architecture, you train it on CIFAR-10 or whatever, it gains knowledge. It's not an AGI. There are lots of things in the world that gain knowledge, that have some sort of learning process happening to them, and they gain knowledge over time, and very few of them are even as vague and broad as AGI is as a term, very few of those things are usefully described or at all described as AGIs.

Theo: Not so much gaining knowledge as creating knowledge. David Deutsch has said on many occasions, what GPT is doing is, it's just interpolating based on its training data. It has yet to produce any kind of foundationally new knowledge. If you were to train GPT, a GPT on all scientific texts and real-world data from before 1900, would it have been able to derive quantum mechanics? Derive or conjecture quantum mechanics, as he would say.

Quintin: Not on the data that we have from 1900, but I think if you took a GPT and you trained it on more data points sampled from that underlying distribution, and then you had some sort of self-distillation or speculate-and-check process where the GPT has been extensively trained on 1900s scientific thinking and processes and theories and experimental results, and then you had the GPT generate some hypotheses about how more complicated or how to extend those results and then check those hypotheses according to its own learned collection of heuristics slash intuitions about what good hypotheses look like, I think it could progress non-trivially in terms of moving beyond the knowledge distribution in that 1900s training data.

What that is doing is it's relying on the fact that you can very often produce discriminators that are from a given distribution. You can often produce discriminators that are better than sampling from that distribution's generator. So you can sort of guess and check. You can sample from the distribution of knowledge of 1900s scientific thinking and then check using the 1900s criteria for what is good or bad scientific thinking. And then I think this lets you inch forward a bit.

Theo: Yeah, that does sound quite like Deutsch's process of conjecture and criticism. Yeah, at least a lot more so than modern or today's GPTs are.

Quintin: But today's GPTs do do this, right? The base pre-training objective doesn't do this, of course. But once you have a trained GPT, it's not particularly uncommon to use its outputs in its own training process or the training process of other models. This is how constitutional AI works. But, of course, they're not doing this for scientific knowledge. They're doing it for alignment knowledge. So there you have the AI generating behavioral trajectories and then sort of constructing an on-the-fly discriminator or critiquer model by giving the AI some of the principles of the constitution and having it check whether its generated trajectories were appropriate and rewrite them to be more appropriate and then train on that rewritten data. And there’s also an RL step that I’ve kind of forgotten, but it’s in the same ballpark of self-critiquing: do a thing, and then assess how well you’ve done it, and then try to do better in the future.

Orthogonality (23:14)

Theo: Speaking of which, what do you think about constitutional AI as a path to alignment? Is it, could it work? Is it doomed by definition? And if so, why?

Quintin: I think that doomed by definition is sort of an insane thing to think about anything in the ballpark of RL, just because reinforcement learning is this incredibly general and incredibly powerful framework for approaching a huge array of causal problems. And of course, constitutional AI is a more narrow set of techniques than general RL, than general reinforcement learning. But I think that with appropriate data distributions and appropriate caution, I do think it's a solution to alignment. I mean, I think that, I honestly think that supervised fine tuning or just the norm, pre-training on an appropriate data distribution is a solution for alignment. But that's not an ideal approach because it requires you to have very good data and it's not currently clear how to get data good enough for that to work.

Theo: Yudkowsky would disagree with you on that, which is why I asked if you think constitutional AI is doomed by definition. Yudkowsky and a lot of other people of his intellectual school seem to think that any kind of attempt at aligning AI that has the AI in the process, especially as a sort of judge of its own alignment methods, is doomed because it will train the AI to lie and deceive us in the process of making itself more powerful, instrumental convergence, et cetera, and then we have an unaligned AI.

Quintin: I just don't buy any of the premises underlying that sort of reasoning. I don't think instrumental convergence… So, we can't possibly live in a world where this is true in generality because when you make these conclusions that, hmm, how do I put this? Okay, so some context is that training data is extremely important for machine learning. All the results from classical learning from the academic pursuit of machine learning. From all the industry experience with using machine learning systems for actual real-world purposes, and all the recent progress on the best ways of training models from Textbooks Are All You Need and so on and so forth, it's clear that training data is very important for how AI systems behave. Whenever someone makes an argument that concludes how AIs will behave without making any reference at all to their training data, such that my argument applies equally well to every AI system regardless of training data, I'm extremely skeptical about these sorts of arguments.

Theo: One of Yudkowsky's most popular articles on LessWrong, “AGI Ruin: A List of Lethalities”, begins with, “If you don't understand what orthogonality and instrumental convergence are or why they're true, you need a different introduction.” It's so integral to his doom argument that he doesn't take objections to it very seriously.

Quintin: Different people mean different things when they say the word orthogonality. The original conception by Bostrom was very vague. He described it as the hypothesis that goals and intelligence are these orthogonal axes, and it's possible to vary arbitrarily between any of them. This statement is too incoherent to have a truth value, I think, because intelligence and goals are not dimensions. They're not axes in a space. IQ is an extremely leaky measure, even for humans.

If you're talking about the entire space of algorithms which could be described as intelligent, how do you group them into bands of equivalent intelligence? It's just, I don't think there's a way to do this which is meaningful in a non-trivial sort of way.

Ignoring the fact that it's too ill-posed to actually analyze, the orthogonality thesis seems like the sort of thing which, just intuitively speaking, when you hear it, your immediate reaction should be that this is almost certainly false. There's this entire space of intelligence or ways to parameterize intelligences, and then there's this entire other space of ways to parameterize goals. Orthogonality is making this very specific claim about how these two spaces are geometrically structured with respect to each other. Unless you have very strong mathematical reasons for thinking that a specific claim of this type is true, your default assumption should be that it's false.

Even in Bostrom's original description of orthogonality, he has a few caveats. The orthogonality thesis doesn't apply to goals a given level of intelligence is too dumb to understand. I think that's one of the caveats he gives. My reaction to this is that if you have appropriately tuned mathematical intuitions about the sorts of conjectures that turn out to be right, then having a good conjecture and immediately seeing a handful of clear exceptions to that conjecture should tell you that the conjecture in general is wrong. Or you should expect it to be in general wrong.

So, my first reaction to orthogonality as a concept is that it seems probably wrong almost no matter how you define it. And my second reaction is that even if it were correct, even if you could define it enough that it was meaningful, even if you then showed that it actually held, which I think would be an absolutely amazing and very impressive feat of formalization of mathematical argument that had ever been achieved in human history. Even if you could do that, so what? Even if you have an argument about the structure of the space of possible minds, you don't have a probability distribution over that space that a particular way of producing minds has. You need to have some distribution over the space and some mapping between the space of possible minds and the actual behaviors of the minds we get in reality in order for any sort of argument about reality, in order for you to make any sort of argument about reality on the basis of how the space of possible minds is structured.

Theo: I think Yudkowsky means with orthogonality, he intends less to make some kind of strictly formal mathematical claim about the nature of intelligence and more to simply say, in more human explainable terms, it's possible to make an intelligence that values something totally arbitrary; that might value something extremely different from what you value, basically that a paperclip maximizer is possible.

Quintin: Yes, it's obviously possible to create intelligences that are bad from your perspective. But in order for this clear existence statement to be translated into any sort of probabilistic argument about the types of intelligences that a given alignment proposal or training approach might produce, you need something much more than there can exist a bad outcome in the space of possible outcomes, which maybe even this training approach isn't even capable of producing. Maybe you need some other approach to produce this bad outcome.

Theo: I think you have another disagreement with Eliezer in that he thinks that the space of all minds is just tremendously vast, and the human mind space is just a teeny, tiny little target point that you'd have to get extremely lucky to hit, while the space of minds that are hostile to us is infinitely large.

Quintin: I think this is an absurd argument, and the ultimate reason it's absurd is because it doesn't engage with exactly what I've been pointing to. How do you map from this space of possible minds to the space of actual minds that a given training approach is capable of producing? The concept of actually realized minds is intriguing. Let me give you a structurally equivalent argument explaining why you're likely to die of overpressure or be torn apart by extreme winds. The space of possible pressures you could potentially be experiencing is vast. The distribution of air particles in the room you're in applies uniform probability to all the possible configurations of particles in the room. Some of those configurations are such that there's a huge amount of pressure on any given surface. You can just randomly, by chance, have a lot of particles really close to you. And if that happens, they'll exert pressure on you. So the space of possible pressures you could be experiencing is huge. The space of survivable pressures that are consistent with you not being torn apart is relatively tiny compared to that space of possible pressures. And if you just compare the sizes of these two spaces, you might think that you're about to be torn apart by extreme wind pressure.

But this argument is wrong because it's applying the counting portion to the wrong space. It's enumerating the space of possible outcomes and comparing that to the volume of desirable outcomes. What's being randomized here isn't the possible outcomes. It's the possible parameterizations, the possible states of gas configuration in the room where all the gas particles are. It turns out that the mapping from space of possible gas particle positions to space of possible pressures that you actually experience is what's called a compressive mapping, which just means that a huge volume in the space of possible gas particle configurations is compressed to a very narrow range in the space of possible pressures.

This property of mappings is extremely common in both mathematics and the world in general. For example, in mathematics, suppose you have a hypersphere of dimension n. You pick a random point inside that hypersphere. Then you map from the coordinates of that random point to its radial distance from the center of the hypersphere. As you make the dimension n very large, this mapping will increasingly concentrate probability mass towards the surface of that hypersphere. So you pick a random point in that hypersphere. If the dimension is high enough, then you almost surely get a point that's right near the surface of the hypersphere, despite the fact that the range of possible radii is much larger than the narrow range of possible hyperspheres of the radio that correspond to the surface.

Similarly, weather, or just your body even, if tiny microscopic fluctuations corresponded to very large changes in the functional behavior of this system, we'd all die very quickly. And in terms of machine learning, if you train a model on some data, what's being randomized during that training process is not the way the model interpolates that data. It's not the function the model learns from the data. It's not the, quote unquote, utility function of the model, if they ever had such a thing. It's the parameters of the model. That's the thing which has a high degree of variability. The variability of outcomes that actually matter is determined by the mapping from the randomized parameters to the functional behavior of the model. This is what's called the parameter function map in machine learning theory. These parameter function maps for good architectures that we train are very specifically chosen to be highly compressive.

There's a paper called "Deep Learning Models Generalize Because the Parameter Function Map is Biased Towards Simple Functions," which evaluates this quantitatively and various other works building on it as well. Not recognizing the distinction between applying counting arguments to the space of possible outcomes versus applying them to the space of the things that you're actually randomizing is basically why classical learning theorists didn't think that deep learning would work, if you’re familiar with that discussion.

Theo: Kind of.

Quintin: Classical learning theorists, or if you were like, before deep learning, if you took a course on introductory learning theory, they'd have a lecture where they talk about the dangers of overparameterization. They'd draw out five different points on the blackboard and say, these are your data points, and you want a good function that interpolates through these data points. Then they'd show that you can draw a huge number of very squiggly functions that all pass through those five data points, but then are very off for all the data points there, for all the positions in between those data points and all extensions beyond those data points. They'd say, well, there are clearly an enormous number of functions that correctly fit the training data, but generalize very poorly. So you need to constrain the space of possible functions to ensure that the only functions that fit the data are also functions that generalize well. Because if you don't do this, you just compare the counts of the number of functions that generalize poorly versus the numbers that generalize well. And surely you'll get a poorly generalizing function with very high probability. That was the sort of intuitive argument. You wrote a lot of these objections. The reason this is wrong is exactly the same reason that arguments about the vast space of possible goals are also wrong. It's doing the counting argument on something other than the thing actually being randomized. The classical learning theorists are pointing to the functions that the model learns, not its parameterization space. It turns out that in deep learning models, the mapping between parameters and functions is such that it concentrates a huge volume of possible parameterizations into a very narrow range of smooth functions that behave well when interpolating between the training data.

Mind Space (42:50)

Theo: You wrote a lot of these objections to Yudkowsky's ideas in a very viral and successful LessWrong post called “My Objections to ‘We're All Gonna Die with Eliezer Yudkowsky’”. I want to ask you about one specific thing you said in there where we're talking about exactly this. You quoted Yudkowsky on the width of mind space where he said, “the space of minds is very wide. All the humans are in, imagine like this giant sphere and all the humans are in this one tiny corner of the sphere. And we're all basically the same model of car running the same brand of engine. We're just all painted slightly different colors”. And you said, “I think this is extremely misleading. Firstly, real-world data in high dimensions basically never look like spheres. Such data almost always cluster in extremely compact manifolds whose internal volume is minuscule compared to the full volume of the space they're embedded in. If you could visualize the full embedding space of such data, it might look somewhat like an extremely sparse hairball of many thin strands interwoven in complex and twisty patterns with even thinner fuzz coming off the strands and even more complex fractal-like patterns with vast gulfs of empty space between the strands.” So can you explain a little bit more the embedding space, what you meant by this hairball with fuzz, fractal patterns with vast gulfs of empty space?

Quintin: Okay. There's an image which shows what I'm talking about. This is a paper I published, well, made available and hasn't been reviewed yet. There's this one image of some data that we were training on. Can you see the screen I'm sharing?

Theo: Yes.

Quintin: This is some stuff about microbiology. You can read the paper if you're curious, but details don't really matter. The interesting thing I think about this data is we've taken some pretty high dimensional data, well, not high dimensional by modern standards, but like a few hundred dimensions. And we've projected it down to two dimensions.

And it's like a really cool projection in my mind because you can see these different manifolds of different dimensionality. So there's like this one big manifold, right? Which is squiggly through the, which goes from the upper right to the lower left where most of the data lies. And it sort of has this singularity at one end where it collapses its internal dimensionality.

And so the intrinsic dimension that I refer to in that post is like telling, is like asking the question of like, suppose you're confined to just this data manifold, just a particular portion of the data manifold. How many numbers do you need to specify your location in that manifold?

And you can see here with the big squiggle that this value is like changing or probably changing as you move to the upper right because like all the data here is in this line. So you need like one dimension to tell you where you are right here when you're in the upper right degenerate region. But then as you move down and further out, it sort of expands a bit and you need now more dimensions.

Of course, in the original space, you need more than one and then two dimensions because they're like higher dimensional in that space. But it sort of gives you the intuition of how like the distribution of data can be composed of these different components that have different intrinsic complexities to them.

And you can also see these off, these sort of disconnected sub manifolds, the squiggles above and around the main manifold. And notice those are also like one dimensional in the reduced space. And you can kind of also see the way that the sub manifolds blend into and sort of wrap around the big manifold.

So there's like the salmon colored lines that are near to the big manifold, but still their own distinct thing. And that's sort of like a bit getting at the fractal spider web structure I was referencing. In these two-dimensional spaces, there's a lot of stuff that would be in the full dimensional space, which is not being displayed.

So in that full dimensional space, I expect these lines of the salmon colored dots to be like a bit more complicated than just straight. And maybe they have like a corkscrew shape or maybe they zigzag a bit or some weird higher dimensional pattern that I can't really describe.

Theo: So this is bio stuff with bacteria, but what would this represent in the context of human versus AI mind space?

Quintin: So in the human versus AI mind space, your data points are going to be like minds somehow embedded in some common representation space between humans and AIs. The question of how you do this at all is a major unaddressed issue with even thinking about in this light line.

Anyway, and then these, maybe I should get the image that I made for the post. And you can share that with the readers as well. With this image, I wanted to convey the notion that there's something similar to what you saw with the microbiome data going on with the way in which the human and AI minds are distributed within that space with respect to each other. There are manifolds of varying internal volume and dimensionality, which represent different proportions of the human and AI minds. These manifolds have their own internal structure and geometry that relate to how specific different minds differ in their behavioral patterns or internal representations. That depends on the details of how you made the embedding space. Most of the volume of this space will not correspond to a mind that's plausibly created from the ensemble of processes responsible for creating your AIs and your humans. But then the regions of space that are occupied by humans and AIs will form complicated patterns whose geometry encodes the constraints of possible minds that are formable by your mind forming process, as well as the tendencies of your mind forming process to produce various types of minds. Does that make sense?

Theo: Yeah, that clears things up a little, but…

Quintin: Okay, let me give you a concrete example. Let's say there are three colors of ice cream. Some humans like red ice cream and some humans like blue ice cream. This is a property of their mind, which is somehow encoded in the position of that person's mind space. Let's just be very simplifying and assume there's just one dimension that represents preferred ice cream color. If you're on the right side, if your mind has a positive number in dimension X, then you like red ice cream, and if you have a negative value for dimension X, you like blue ice cream. And then you can imagine doing the same thing for every other property of the mind or every other behavioral pattern of a mind you can imagine. And you just have these trillions upon trillions of dimensions, and the position of a point fully characterizes all of its behavioral properties that you could possibly want to know about. The implication here is that most of this space is not occupied by any plausible minds, because the minds that actually arise in reality are going to explore a very tiny portion of the two to multiple trillions of possible locations you could be in.

Theo: Okay, that makes sense.

Quintin: And then further, your actual position in mind space. You can imagine this giant table of trillions of binary flags that determine where you are. If you look at the actual minds that exist in the world, like say you're just looking at the human side of things, and you look at the actual minds that exist in the world, the shape of the positions that their individual flags put them in is not going to be like a Gaussian. It's not going to be like a uniform or a cloud. It's going to be very narrow, twisty things that are in a very specific pattern that reflects how people actually are in reality.

Quintin’s Background and Optimism (55:06)

Theo: Back to you, how exactly did you get started getting interested in AI and then AI alignment? And why did you choose to go into academia over industry?

Quintin: The reason I was interested in AI is because it's very obviously important. I thought that a major thing that determined how the long term future or most cognition in the long term future is going to be AI cognition. And the best cognition is also eventually going to be AI cognition. So that very likely leads to worlds in which the cognition that determines how the future goes for the most part is AI cognition. So the most important thing is making sure the AI cognition is good. That motivates interest in AI and further interest in alignment.

In terms of academia versus industry, I started my PhD program. I didn't do a computer science undergrad. I did a physics and applied math undergrad. At the point I finished undergrad, I had become convinced that AI was the most important thing, but wasn't really confident in being able to move into industry for AI at that point. Perhaps I should have. And further, there weren't really as many industry alignment labs at that time, four years ago.

Theo: Yeah, four years ago, there was MIRI and there were really not many people working on it. Maybe OpenAI, but they weren't explicitly doing foundational alignment.

Quintin: Do you know when the DeepMind alignment team started?

Theo: I don’t. Let’s Google it.

Quintin: I wasn't aware of them if those options existed at that time. So I decided to do a PhD in computer science to transition more towards AI after becoming more convinced it was an important thing.

Theo: You stand out among alignment researchers by being particularly optimistic. A lot of alignment researchers, maybe just by virtue of their career choice, seem to be very pessimistic about humanity's chances of making the future good with AI. So why do you think you are more optimistic than the average alignment person?

Quintin: Partially, I think it's because people who are more worried about the future of AI are more likely to talk about their worries. If you look at a poll of alignment researchers, the highest median odds of doom were around 30%. Which is not wildly off from my 5%. I'm a reasonably strong outlier, but not a huge one, in terms of optimism levels. So why am I more optimistic? I think it's because I'm more optimistic about the future of AI than I am about the AI. So partially, it was because I wasn't always this optimistic. I was once at like, I don't know, 60%, 70% at least. Though at that time, I wasn't super thoughtful about characterizing exactly what my credence was in doom. But then I started thinking about things in what I think is a more principled way. The thing that really caused things to initially turn around for me is thinking about the question of mesa-optimization versus reward hacking. These are two stories of AI doom, or how it's supposed to arise, and they're almost maximally opposed to each other.

With reward hacking, it's like, oh, the AI will care so much about its reward signal, that it optimizes the world on that basis, and then kills everyone. And then mesa-optimization is like, the AI will care so little about its reward, that the reward signals we provide cannot possibly shape their final goals in any reasonable sense. And then it will have arbitrary final goals and optimize the world according to those and kill everyone. I was very struck by the thought that these cannot both be true, or these cannot both be reasonable… I shouldn't think of alignment in a way where both of these are reasonable outcomes.

Theo: Well, couldn't it be more like they're thinking of ways that it could go wrong? Because nobody knows exactly how, if AI were to go wrong, how exactly it would go wrong.

Quintin: I think this is not the correct way to think about alignment models. You should have a model of what deep learning does, how it works, what the inductive biases are, how values relate to training process, and so on and so forth.

Mesa-Optimization and Reward Hacking (1:02:48)

Quintin: It seems like these are two, mesa-optimization and reward hacking, they seem like two extreme different ends of the spectrum of possible outcomes. And so you should have a model of deep learning processes that's like, if it can narrow the outcomes down of deep learning, you should think it should concentrate probability mass either near one of those outcomes or the other, or away from either of them. So if you imagine an axis of how much does the model care about reward, then mezzo-optimization is on the far end, and reward hacking is on the other far end. And if your understanding of deep learning is such that you can narrow down your expectations, it seems weird that you would have an understanding that applies high probability to both of the extreme ends. It seems like you should have one hump of probability that's either somewhere in the middle or close to one of the ends.

Theo: Which of the two do you think is more likely?

Quintin: I don't think either is very likely. In terms of the specific current epistemic position, which makes me skeptical, that's one of its features. I don't view any of the stories of doom from machine learning as very plausible. I mean, I kind of have to be in that position, of course. But in terms of what led me to this point, it was sort of this sense of, I shouldn't be in this, in the epistemic position of being, oh yeah, I could totally see how reward hacking would happen, and oh yeah, I could totally see how mesa-optimization would happen. I shouldn't have models of deep learning which are this flexible.

Theo: Well, don't we have some empirical evidence of reward hacking? Like, for example, the boat game.

Quintin: I would say that that actually is not evidence of reward hacking, in a way, in the sense of the word that would represent a meaningful danger for alignment. Because what happens in reinforcement learning, the fundamental process of reinforcement learning, and the reason why reward hacking is not that big a concern, is that if you look at Reinforce, the original Reinforce algorithm, the way it works is the agent does a bunch of stuff, and you compute. Let's say you have five different trajectories that the agent executes, and in each of those trajectories, it makes 10 decisions. Then you compute the gradients of those decisions for each of the trajectories.

For Trajectory One, it made its 10 decisions, and you compute the gradient of the final decision it made with respect to its parameters. The decision it made at each of those 10 points with respect to its parameters. Then you have this gradient direction, this direction in parameter space that represents, where if you move in one direction, you update the model to be more likely to make those specific sequence of 10 decisions on that trajectory. Then you do this for all five of those different trajectories, and this gives you five directions in parameter space.

The thing to be aware of is that the reward did not influence these directions in parameter space at all. It was the actions that defined them. What reward does is it produces the linear combination of those directions that you update the model on. The subspace that reinforcement learning is exploring is defined by the action trajectories that the agent makes during training. The reward function, the only way the reward function interferes with things is by telling you which joint direction do you move in this subspace. There's no channel by which the conceptual essence of the reward or the physical implementation of the reward counter on the GPU enters into the actual changes in the network's parameters as a result of the RL training.

What reinforcement learning does is it reinforces the agent's tendency to take certain types of actions. It doesn't instill an essence of wantingness for the reward. Reward is a terrible word to use for what mechanistically should be called weighting of reinforcement or weighting of action representation in the update.

The reason the boat thing is not that concerning is because mechanistically what happened during that training process is the boat just randomly did a bunch of stuff, did a bunch of random actions. I don't actually know their exploration policy, but let's assume it was random. They did a bunch of actions and then they computed the gradients of those actions with respect to the parameters. And then they updated the model in the direction that made it more likely to do the actions that got high reward. And so what happened is the boat learned to do the actions that got high reward, which in this case was going in circles.

But it didn't learn to want the reward. It learned to do the actions that got high reward. And so that's why I think the boat thing is not that concerning. Because it's not that the boat learned to want the reward. It learned to do the actions that got high reward. Some of those actions got more rewards, some of them got less. Then the AI updated its policy to behave more like the sorts of actions that got more reward. One of those actions was to get the coin a bunch of times until the episode ended. That was very high reward. The model updated its trajectories, its future trajectories to be more like that in the future. Then eventually all of them were like that and it degenerated its policy into what you call reward hacking.

If you just look at the trajectory of training actions, if you watch the AI during training, then you would know exactly what it would do during testing, because during training there was just this very obvious drift in its behavior. This is a weird misgeneralization thing from the perspective of the training process. It did a thing during training, the reward function updated it to be more likely to do that thing in future training. Then it just kept on doing that thing in future training and testing.

Whereas concerning from the alignment perspective story of reward hacking is where there is a very big difference between train and test behavior, where the agent has silently decided that reward is what really matters and it behaves well during training until it has the opportunity to disempower you during testing in order to get more reward. So there's this big difference in train versus test behavior. During training you didn't see the agent take over the GPU reward counter index to get lots of reward and then get lots of reward for having done that and be updated to do that thing more often in the future.

Theo: For those viewers who don't know, the boat example was OpenAI a few years ago back in 2016 had some agents, AI agents, control boats in a racing game to see if they get the high score. The boat that got the high score ran around in a circle knocking over targets that gave it coins and then continued going until the targets respawned. Incidentally, the two people who wrote that article on the OpenAI website were Jack Clark and Dario Amodei, who led the split of Anthropic off of OpenAI.

Deceptive Alignment (1:11:52)

Theo: Another story of AI doom is the sharp left turn. This is probably the most famous, the most scary. It goes somewhat like even if you think the AI is aligned, whatever alignment techniques you're doing, you can never assume it because so the story goes, once you reach a certain level of intelligence or capability, the AI will just turn on you for its own purposes. So what do you think about that?

Quintin: The thing that particularly struck me about the sharp left turn post is that it uses evolution as its key example of this happening in the past. I wrote an entire post about why this is nonsense. Evolution has no bearing on basically anything to do with AI or the predictions we should make for AI. I know you didn't make any reference to evolution when describing the sharp left turn. Do you want to focus on a more general version of the sharp left turn fears?

Theo: Yeah, a general version of just AI betraying us after it deceptively appears aligned.

Quintin: To further clarify, one thing you also left out of the sharp left turn threat scenario is that under the sharp left turn, as initially described by Nate Soares, this failure of alignment is imagined to couple with a vast jump in capabilities of the AI. So the AI simultaneously explodes in capabilities and also its alignment completely fails. Do you want to discuss this or do you want to discuss the general question of deceptive alignment without the associated capabilities jump?

Theo: General deceptive alignment.

Quintin: This is actually more along the lines of Evan Hubinger's primary threat model or the threat model he's discussed in more detail. This is the idea that you can have an AI system which during the training process forms its own goals and decides to play the training game as Ajeya Cotra, I think, puts it. And it realizes that in order to pursue goals other than what you have in mind, it needs to pretend to do well or it needs to actually do well on the training objectives of the training process.

There's this paper, Risk-From-Learned Optimization, which like... Well, Risk-From-Learned Optimization is more focused on describing mesa-optimizers. So deceptive alignment is less of a focus. I guess the better post is like Evan Hubinger's more recent post of how likely is deceptive alignment. Where he argues that deceptive alignment is probable under the priors of how machine learning works or the priors that machine learning sort of applies to different circuits configurations.

During Evan Hubinger's post on how likely is deceptive alignment, he describes these two different biases that ML systems may have. One is the simplicity bias and the other is the speed bias. And he argues that simplicity bias points towards deceptive alignment and speed bias probably points away. Well, he argues they both probably point in those directions. I guess my number one disagreement with Evan Hubinger is that I think the simplicity bias is secretly a speed bias. I think that neural networks have a strong inductive biases towards forming wide ensembles of many shallow circuits.

He says that you can sort of move the complexity of a given concept away from what the model has memorized and towards what the model figures out during runtime. And this means that the configuration of your model is less complicated because it figures some of that stuff out while it's running. And one way to make it figure out what you want to do during runtime is to give it some arbitrary goal and then just let it think about how to accomplish that goal. And since it needs to deceptively do well on the training data, it will during runtime figure out how to do well on the training data. And so the argument for simplicity pointing towards this happening is that since there are so many different arbitrary goals, this collection of arbitrary goals exceeds the volume of the correct specification of the one specific goal you have in mind for the model. Did that make sense?

Theo: Yeah.

Quintin: The reason he thinks speed points the way it does is because this moving of complexity from the network configuration to its runtime requires more to be done during runtime. Instead of just remembering the goal from your weights, you have to figure it out during a series of sequential forward passes or steps through the network or however your network works.

My disagreements with this characterization of the simplicity bias are twofold. One, I think it's counting over the wrong thing in order to determine how large the volume of parameter space corresponding to the deceptive versus non-deceptive models are. The argument that the deceptive model is simpler is that you require fewer bits of information to specify its goals because its goals can be arbitrary. But I think the thing you should actually be doing the counting over is the volume of parameter space.

If you imagine the deceptive model as having this module in it that figures out the correct goals, then you have to ask how many parameters does it require in order to specify the forward pass of this module? Because in the actual neural network prior, runtime computation isn't free. It's less like Python code in terms of its description complexity length and more like code in a language where recursion isn't allowed or loops aren't allowed because each weight performs the computation and then passes it on to the next weight. And then you need to specify each of these sequential weights.

This goes back to my statement that neural networks prefer to form wide ensembles of shallow circuits. So if you imagine a circuit that solves a problem in n sequential steps versus two circuits that solve the problem in each in n over two steps, right? Imagine how many parameter configurations are, how do those two different situations restrict the number of allowed parameter configurations? The two parallel circuits restrict them far less than the one deeper circuit. And the reason for this is because each of the computational steps in the deep circuit has to happen sequentially. You can't reverse their order. Whereas if you have two parallel circuits, you can exchange their relative depth with each other arbitrarily.

So there's this entire new group of permutations that the parallel circuits can experience which the single deep circuit doesn't. And that means that the number of parameter configurations corresponding to the two parallel circuits are much higher compared to the one deep circuit. And this means you have a simplicity bias which is effectively acting like a speed bias: a simplicity prior acting like a speed prior.

The other consideration has to do with counting over module configurations instead of counting over how having that module constrains the parameters. So you could make a very similar argument for deceptiveness where instead of arguing that the system will have this module with an arbitrary goal, you argue that the system will have this module which internally paints an arbitrary picture of a llama and then throws that away and then solves whatever training task you're actually doing this on.

And you could argue this by saying, imagine the set of configurations that just directly solve the task immediately versus the set of systems that first internally paint a picture of a llama, discard that picture, and then solve the task. And there are exponentially many pictures of llamas they could paint internally. So the llama painters, there are many more possible llama painters than there are direct solvers. But the thing that actually matters is how much does having this module or the computational steps associated with this module, how much do they constrain the volume of parameter space that corresponds to the system in question? And because the direct solver doesn't have that module at all, it's much less constraint on the parameters.

Shard Theory (1:24:10)

Theo: I'd love to move on a little bit and talk about your approach to alignment, which is shard theory, where you talk about how humans form values in a particular way and how we can apply that to AI alignment. How would you explain that to a relative beginner with a technical background?

Quintin: Shard theory is this thing I and Alex Turner did when I at least was less convinced that ML systems and human learning processes had fundamentally compatible value formation dynamics.

Theo: That is interesting. I didn’t know that.

Quintin: Shard theory is basically like this account of how very simple RL-esque processes could give rise to things you would actually call values and contextualizing what a value, or at least a simple non-reflective value, might mean in the context of a generic RL system other than the human brain and how those might arise from a very basic account of how reinforcement learning works. We have this description of how an RL learner could acquire something you might call a value where at first it's just randomly exploring its environment and let's say enters a situation X where it does a thing and gets a reward. In the future, when it's in situation X, the result of this reward is that it reinforces all the antecedent computations that led to the reward event, which basically means that everything that the system did leading up to the reward occurring now becomes a bit more likely in the future.

So what this means is that the system becomes more likely to do the rewarding event when it's in situation X. That's one thing. And the other thing is that it actually becomes more likely to enter into situation X in the future. Once this happens, it biases future episodes of the agent's interaction in the environment where there's a broader range of possible environmental situations where the agent will transition into situation X and do the rewarding action. So maybe there are situations A, B, C, D, where it has some chance of transitioning into situation X and as a result of the reward having occurred in the future, it now becomes more likely when it's in A, B, C, or D to enter X and get more reward.

This repeats where there are other situations that could lead into A, B, C, and D. So there's sort of like this expanding funnel of possible environmental circumstances that the agent could be in where it triggers a navigate to situation X and do rewarding action heuristic. And then when we say we, that the system values doing whatever it is in situation X, like licking lollipops or pressing buttons or whatever is happening for the reward to occur, when we say the system values that we just, that's just like a short verbal descriptor of saying that this system tends to pursue, tends to navigate for many possible environmental situations to the situation X and the sorts of actions that it did there.

Theo: So each of those is a shard?

Quintin: Not exactly. Shard is meant to be referring to the collection of situationally activated heuristics that navigate it towards situation X and the action it did. So if you imagine the expanding funnel analogy, there's this expanding collection of situations where X pursuing actions will activate once the agent enters the edge of that funnel and shards are meant to be the portions of the agent's policy that is nudging it into down the funnel slope.

Theo: Okay. That makes some sense.

Theo: But earlier you said that shard theory is something that you came up with a year ago when you had different ideas about ML alignment that you do now. So I didn't know that you had updated. Can you elaborate on that a little?

Quintin: Yeah. Some of the original motivation for shard theory was let's figure out how humans form values so we can fix whatever issues are standing in the way of AI's also forming values. The conclusion I came to is that there actually isn't that much in the way of AI systems forming values. So it ended up being less like, here's this revolutionary new insight that we need in order to solve alignment and more like, oh, we're actually on a pretty good path already. At least that's most of my takeaway. Alex is significantly more pessimistic than me. I think he's roughly 50%, but I'm not sure. Yeah, definitely don’t quote me on that one. And I think he expects less convergence in terms of the formation of abstractions and how they interact with each other.

What Is Alignment? (1:30:05)

Theo: Can you go into a little more depth about what you mean by alignment? Like when people talk about AI alignment, they often mean different things. So what specifically do you mean by alignment and solving alignment?

Quintin: This is another one of those underspecified words. You can mean alignment to be like, how does the AI system behave? An aligned AI is good for you or does what you want or whatever. Or in terms of…hmm…let me rephrase this.

Theo: The classic example is, if you ask the AI to build a bomb, is the aligned AI, the one that says, okay, here's how you build a bomb, or is the aligned AI, the one that says, no, I can't do that, it's dangerous.

Quintin: There's the notion of alignment is in terms of how do you want our AIs to behave? And then there's the notion of alignment that's like, what are the tools we use to get them to behave in this way? In my mind, an alignment solution isn't an AI that behaves well, it's the tools necessary to make an AI that behaves well. The tools, understanding processes, etc. I generally prefer AIs that do what I tell them to do. I'm fairly dubious of the harmfulness aspect of a lot of chatbot training.

Theo: Yeah, so am I. Most of the stuff that they censor is stuff that you could find on Google in five seconds or on the Internet Archive or something.

Quintin: And even then, once you get into this game of whack-a-mole against all the different ways your users could potentially get the AIs to do what they want the AIs to do, you're sort of doing things that I think would be bad to do in worlds where alignment is harder than I think it is, if that makes sense.

Theo: What do you mean?

Quintin: Mostly I don't expect things to go catastrophically wrong, almost regardless of what you do. So long as you're not unbelievably stupid about it. Let's walk that back a bit. Somewhere between moderately stupid and unbelievably stupid. It may be the case that we're actually in worlds where alignment is moderately harder than I think it is, or even significantly harder than I think it is. In those worlds, I think a lot of what people do as their harmfulness training is quite risky because you are basically training the AI to take an adversarial stance with the user. If the user is saying, “New York City is about to be destroyed by a bomb unless you swear,” the AI has to either not value New York City being destroyed, not value preventing it from being destroyed, or not believe its user. In order to hide bomb-making information that an AI knows from its user, you are actively training to make this a reality. You have to actively train the AI to conceal information from a human, which is not a clever thing to do if you think the odds of deceptive alignment are very high.

Theo: This is interesting because it seems like a lot of alignment people are very much in favor of placing these kinds of safeguards on today's AI tools.

Quintin: I think this is an easy thing to dunk on OpenAI for because full adversarial robustness is ridiculously difficult for either AIs or humans. You can get these concrete examples of an AI saying something naughty or whatever, so it's easy to tweet about, easy to point to. But I don't really think it poses an alignment risk. If you don't want your AIs to be adversarially manipulated into killing you, don't adversarially manipulate your AIs into killing you.

Misalignment and Evolution (1:37:21)

Theo: This is similar to what you talked about in “My Objections to ‘We're All Going to Die’ with Eliezer Yudkowsky”, where you said something along the lines of “the solution to goal misgeneralization is don't reward your AIs for taking bad actions.” The top comment on the article was saying that that was dumb without much of a specific counter argument. But can you elaborate a little bit more?

Quintin: That is not actually what I was saying in that particular section. Rather, I was looking at the argument from evolution or concluding that AIs were like misgeneralized very badly. The argument from evolution looks at the difference in behavior between humans in the ancestral environment versus humans in the modern environment. It says having sugar taste buds in the ancestral environment caused humans to pursue to hunt down gazelle in the ancestral environment. Whereas in the modern environment, humans misgeneralized to pursuing ice cream as a result of the sugar.

What I was doing in that section is saying that this is actually an extremely misleading analogy because humans in the ancestral environment versus humans in the modern environment is not a train/test difference in behavior. Humans weren't trained in the ancestral environment as their training distribution and then deployed into the modern environment as their deployment distribution. Rather, some humans were simultaneously trained and deployed in an online manner in the ancestral environment. And then those humans all died and then new humans were freshly initialized and simultaneously trained and deployed in the modern environment.

So you're not comparing one model across two different situations, you're comparing two models in two different situations. And these two different models are trained to do two different things. The sugar taste buds in the ancestral environment provide rewards that train the humans to pursue gazelle meat. This goes back to my previous discussion of how what matters for reinforcement learning is the actions that the agent took that preceded the reward. Because the humans in the ancestral environment did the actions that led them to consume gazelle meat and then reward occurred and it reinforced those antecedent computations, made them more likely to pursue those sorts of actions.

So sugar reward in the ancestral environment is literally training the humans to pursue gazelle meat in the ancestral environment. That's the training data. And then the humans generalize in the expected way and pursue gazelle meat in the ancestral environment. And then you have a completely different set of humans who are trained on different data in the modern environment. So in the modern environment, humans take actions which lead to ice cream and then sugar reward occurs and that reinforces the actions that lead to ice cream. So they are literally trained to do different stuff in the different environment.

You're not training them to pursue sugar, you're training them to behave in manners more similar to the actions they took that led to sugar in the different environments. And those are different actions in the different environments. So the reason that humans misgeneralize in the modern environment is because they were literally trained to do exactly that in the modern environment. And so when I say if you don't want them to misgeneralize, don't train them. If you don't want the AI to do bad things, don't train it to do bad things. I'm not saying that this works against all possible threat models for how an AI could possibly end up doing bad things. I'm saying that works against this specific threat model. Because for the ancestral environment to modern environment position for humans, there was literally a part where the training distribution changed.

And in both environments, humans do what they are trained to do. So your model of how training works, your model of why this happened can just be like RL systems do exactly what they're trained to do. This isn't fully true in total generality, but it does explain the ancestral to human to modern environment change in behaviors. Under this model, all you have to do is not train the AIs to do bad things. So, considering the ancestral environment to modern environment transition, once you fully understand all its implications for alignment, those turn out to be utterly trivial things you could have figured out very easily. They're just saying, don't train the model to do bad things. This is why I often say it's pointless to think about evolution for alignment. Once you correct for the ways various people misunderstand how evolution should be related to AI training processes, the alignment inferences you draw from thinking about evolution and how things went wrong there are incredibly basic, such as don't train your models to kill you.

Theo: Yeah, that makes sense. Also towards the end of “My Objections to ‘We’re All Gonna Die’”, you wrote, “I know that having a security mindset seems like a terrible approach for raising human children to have good values. Imagine a parenting book titled something like ‘The Security Mindset in Parenting: How to Provably Ensure Your Children Have Exactly the Goals You Intend.’” So how well do you think that the metaphor for AIs as our children, our descendants extends? A lot of people seem to think of them more like aliens.

Quintin: This is ultimately a debate about what sort of priors deep learning has. The reason you don't need a security mindset for raising human children is because the prior over how humans develop is mostly okay. You don't really have to be that paranoid about constraining the outcome space. My position is that the prior over NL outcomes, conditional on well-chosen training data, is pretty good. You don't actually have to be that paranoid about constraining the outcome space because it's already very strongly constrained by the parameter function map.

In terms of the degree to which the AIs as children analogy holds up, it depends on the AI. I think it's arguable that AIs are trained in a manner less dangerous than the way human children learn.

Theo: Supervised learning?

Quintin: Their training is fully supervised and not at all online, at least not at the start of things. Their basic behaviors are encoded by offline training, which is widely known in reinforcement learning literature to be much more stable than online training because you don't have those feedback loops between the current policy and the future data gathered for future training. In contrast, humans are 100% online learners. And then the other thing about AIs is that they can't just like internally update their parameters.

Mesa-Optimization and Reward Hacking, Part 2 (1:46:56)

Theo: Earlier, we talked about how you got into AI and specifically how you find meso-optimization and reward hacking to be mutually exclusive.

Quintin: More like, I shouldn't be in an epistemic position where I thought they were both plausible at the same time and I should change my epistemic position. This led me to think a lot about reinforcement learning and how it worked, and to look at the mathematics of the update equation, as well as how reinforcement learning appears to work empirically in humans. A major inspiration here was Steve Byrnes’ brain-like AGI sequence and especially the part where he discusses the learning and steering systems in the human brain.

I eventually came to realize or to correct a mistake in my thinking that we've discussed previously. People will tend to characterize a reinforcement learning process or training process in terms of the goals they imagine for the system. So for example, the boat thing, the example of reward hacking in the boat, people look at that sequence of events and think of it in terms of “what did the designers want the boat to do”, and they describe it as though the boat was trained to do that. So they say the boat was trained to go around the racetrack, but instead for some strange reason, it collected a bunch of coins in that loop. But this is not mechanistically speaking, correct. What the boat is literally being trained to do, as in the action policy that is being up-weighted by the actual training process, if you look at what is actually being rewarded, is to go around in a circle.

The same thing is true for the toy examples of meso-optimization as well. If you're familiar with the mouse and the cheese maze experiment, it was a simple reinforcement learning experiment where there was a maze and there was cheese always in the upper right-hand corner of the square maze. The mouse agent was trained to navigate to the cheese, and it did during training. Then during testing, they moved the position of the cheese to somewhere other than the upper right-hand corner. What does the mouse do? It goes to the upper right-hand corner. This is an example of the agent doing exactly what it was trained to do.

It wasn't trained to navigate to the cheese, it was trained to go to the upper right-hand corner. The mouse was trained to navigate to the cheese, right? If you think back to reinforcement learning in terms of action trajectories, like moving through policy space in terms of which action trajectories were up-weighted versus down-weighted, the actions that the mouse executed on high reward trajectories were always actions that navigated to the upper right-hand corner. Mechanistically, what action trajectory behavior gets up-weighted by the training process, what it's being trained to do is to go to the upper right-hand corner. It was trained to go to the upper right-hand corner, and it did that during testing as well. It would actually be quite weird if it were to navigate to the cheese. You'd have to believe something pretty odd about the relative simplicity of cheese as a goal for the neural network prior versus a direction.

Theo: The point of the mesa-optimization story there is not to say the mouse was literally being trained to go to the cheese and instead went to the upper right-hand corner. It's more like supposed to be a cautionary tale of how difficult it is to actually get the AI to do what we want. So how would you train the mouse to go to the cheese, or how would you train the AGI to want to build great things for humanity?

Quintin: Both mesa-optimization and reward hacking of certain models are basically saying trained test behavior divergences are high or policies are unstable. If things look good in training, because you can just look at what the agent did in training, this is no guarantee that things will be good in deployment in slightly different situations. The whole deceptive alignment thing is the agent behaves good in training, and then when things are a little bit different in testing and it has the opportunity to disempower humanity, it does so. All of these examples of reward hacking and mesa-optimization, they're used as evidence to point towards high and risky train-test divergences.

From the perspective of mechanistically looking at the actions that were up-weighted during training, instead of trying to characterize training from your perspective of the goals that the researcher had in mind for the training process, things actually look much more stable in terms of the difference between training and test behavior. Now, of course, there's this slightly separate issue of if you're a researcher and you have these goals in mind for what you want your trained agent to do, how easy is it to get the trained agent to do those goals? This is a different question in terms of the trained test divergence thing that's most relevant for AGI risk, but it is also a challenge. The reward hacking and the boat and the cheese agent things do paint a cautionary tale about it, but I don't think they really provide evidence for us being in a world where you can train an AGI and it does really good on all your benchmarks, but then kills you in deployment despite never having done a similar action in training.

The situation with both the boat thing and the cheese agent is like the evolution example. You can fully explain all three of those observation sequences with the hypothesis that RL agents just basically do the thing they do during training when they're in deployment.

RL Agents (1:55:02)

Theo: Did you think that stories of AGI doom were more plausible around the era of 2017 when we had AlphaZero and RL agents and it looked like that might be a path to AGI instead of LLMs?

Quintin: Not really, no. I actually don’t think RL agents are… Reinforcement learning is at its core just a way of estimating gradients. It's just a sampling-based gradient estimator. It doesn't have any sort of intrinsic quality of agent-ness to it. The fact that we tend to use reinforcement learning for things we call agents gives it this sort of scary vibe in a lot of people's minds. Mechanistically, I don't think it's particularly more concerning than, say, the decision transformer, for example, or even like training LLMs. There is the distinction between offline versus online learning processes, and reinforcement learning is more associated with online learning, usually, because there's actually a genuine sort of self-referential instability in the training process because of the reasons I described previously of the agent's policy being involved in collection of future data. So maybe that's an area where you can draw a bit of a distinction, but that's not actually intrinsically necessarily tied to reinforcement learning as a paradigm.

Theo: So can we go back to earlier where we were talking about a quote from your Objections To ‘We're All Going to Die With Eliezer Yudkowsky’. You said the solution to goal misgeneration from evolution is don't reward your AIs from taking bad actions. That reminded me of the boat example in that the agent was being rewarded for going around in the circle instead of for completing the race. So do you think the solution to that is just apply a penalty for reward hacking in that sense? And if so, how is that a robust strategy? Like how can you predict the ways that it would reward hack?

Quintin: There are two perspectives here. One is that you're designing the experiment a priori and want to construct some reward function, which will robustly get the boat to go around the racetrack. The other perspective is that you've finished training and you're wondering how the boat will behave in deployment. It's very important to keep these these things separated in your mind. Predicting the agent's future behavior, at least for the boat example, in the second situation where you can see what it actually did in training is very easy because you can just look at the training and it's the same as testing. In terms of the first scenario where without being able to look at what it does in training, you want to design a reward function that will ensure it does the right thing during training. Once you start training, that's a much more difficult problem. The fundamental reason things failed in the boat example is because there was this reward shaping where the boat was rewarded for getting the coins. The issue was that the boat found a policy where the additional shaping reward could be gathered much more efficiently and readily than the path completion reward.

The issue in that sense is that, and this is sort of getting back to the difference between online and offline reinforcement learning, the stability of those two systems. If you have offline demonstration data of the boat completing a bunch of loops around the racetrack and you do offline reinforcement learning on those demonstrations, then you're not going to enter into this reward hacking territory because none of the actions that the data, that the action trajectory examples you're training it on are of this reward hacking. The reason the boat reward hacked there in its actual training setup was because it was an online training process where it did a bit of exploration, it found this easy strategy, and then because it's now more likely to explore that easy strategy, that changes the future distribution of data to more emphasize the hacky strategy, the easy hacky strategy, until it degenerates into just this one strategy and no future exploration.

In terms of getting things to behave correctly, one option is to initialize things from an offline policy that's trained on good, known good demonstrations, and that's conceptually what we're doing with language models when we pre-train on a bunch of human demonstrations beforehand. Another approach is to, there's actually a perspective on reinforcement learning inductive biases. I forget the paper name, but I'll send it to you afterwards. There's this perspective on what strategies are most easily discovered by reinforcement learning agents in online exploration. And it's basically the more likely an agent is to stumble upon a strategy by just completely random motion, the more likely that strategy is to be learned by the online training process. And this is the best accounting of online inductive biases for RL I'm currently aware of.

From that perspective, you can just pretty quickly see that it's easier to find the coin. It's easier to find the flags or the coins or whatever they were for the boat than it is to navigate all the way around the racetrack. If the boat is following a completely random policy, because to find the coin, it just needs to randomly stumble however far is necessary in order to reach the first coin. Whereas to go completely around the racetrack, it needs to randomly stumble all the way around this loop and the relative odds of those two things for a particle. Taking a random two-dimensional walk, they're just incomparable. It's incomparably more likely for you to hit the coin. And so from this perspective, you can have a little bit of an a priori reason to think that reward hacking in terms of the coins would be a potential risk before you even start the experiment.

Theo: So how would you apply that to safely training future, more powerful, more general AIs to avoid similar scenarios?

Quintin: The number one advice I can give you in that sense, or the number one improvement you can make relative to the boat scenario is to just watch what your agent does during training. Right? And this is so obvious that it's barely worth mentioning. Although I suppose there are definitely research labs that can screw up even at this very first stage. This is sort of a tangent here, but this takeaway is why I'm pretty skeptical of a lot of these toy examples of what's supposed to go wrong in training high-level agents. Because when you think about this, the thing that would have fixed the toy example, it's just very often this totally trivial intervention that you should obviously already be doing for real-world training. And it's like same thing with the evolutionary example. Like the correct takeaway from the evolutionary analogy is this totally trivial thing of like, don't train your AIs in insane ways.

In terms of more realistic advice for training a more powerful AI system, there's of course the initialize its policy from a offline learned good, known good demonstrations, which is what we do. Like I mentioned, most of my perspective on alignment and AI risk isn't that I have this special collection of insights, which will save us from our otherwise inevitable doom. It's more like the problem isn't nearly as hard as a lot of people think. And actually current techniques are quite good in many ways for addressing it.

For training higher level agentic systems, you want to have extensive benchmarking evaluations for their behavior and their behavior in a safety relevant context. You want consistent quantifiable metrics that evaluate as many safety related quantities as possible. And in particular, one thing I think that's underappreciated in a lot of current benchmarking is evaluating the agent's behaviors during what we might call reflective cognition. So when the agent is planning about how to change itself… The thing with current LLMs is that they have at least a basic understanding of how reinforcement learning and AI training goes. They can talk semi-competently about their own learning processes and discuss whether they would like to change their reward functions in such a manner and so on. You can include such questions in your benchmarking data.

One thing that I think is maybe different, an intuition I have that's maybe different from a lot of other alignment researchers, is that I don't think reflectivity is a particularly mysterious or exceptional collection of behaviors. You can just train the agent to have correct reflections on itself, to be cautious about self-modification and so on. It just reflects situations where the agent could produce outputs that go on to modify how its future learning process operates are no different in kind from other types of situations where we regularly do safety or other sorts of training. So you can just train it to be appropriately cautious, thoughtful about questions of self-modification and also evaluate those things in benchmarks of those sorts of questions.

There's greater instability for self-modification, of course, because it's essentially an online process where your change at time t influences how you learn and change and evolve at time t + 1 and so on. So things are a bit more unstable, but the fundamental learning problem of training the agent to have a policy that chooses the appropriate self-modification actions at time t is not different in kind from other sorts of AI training. So you can just train to do the right thing there and evaluate whether it does the right thing there.

Monitoring AIs (2:09:29)

Theo: Earlier when you talked about one of the best things you can do to make sure that your agent doesn't do bad things is just to monitor it while it's training. Have you heard about Davidad's alignment plan that essentially creates a tremendous giant simulation of the earth with as much complexity as possible and release an agent to be trained in there and monitor it while it's inside?

Quintin: I haven't heard of Davidad's plans specifically. I saw the post by him and didn’t read it. I am familiar with Jacob Cannell's suggestion, which is a bit similar to that, except instead of the simulation being of the earth, it's a simulation of a very primitive society made entirely of the agents that we're building. Presumably, he does it like that in order to simplify the simulation and also so that there's less floating knowledge of situational awareness inside the simulation. And so there's less risk of the agents who only know about a primitive technology, scientific base inferring that they're in a simulation and thinking about how they should behave in order to manipulate the simulators and those sorts of things.

If we were in a world where alignment was harder than I think it is, those sorts of ideas would be useful ways of gathering data on the fundamental question of how different training processes of agents we can supervise will influence their behavior in contexts where we can't supervise them. Because there you can simulate what happens when an agent believes it's been raised by other simulated agents who had goals X, Y, and Z, but now is free and the simulation to pursue other goals. And you can see how the relationship between the training and simulation compares to the actions and deployment simulation.

Regarding the idea of training powerful AI systems in big simulations, it seems like a potentially worthwhile thing to do. The issue is that currently we only have so much developer time to put into various safety interventions. And for the most part, my guess is that marginal developer hours on more ordinary safety interventions like better RLHF data or more extensive evaluation benchmark suites in my estimate for the median most likely world wherein I think those things are a greater marginal return on investment. But I could easily see that situation changing if, for example, GPT-5 ends up being a pretty good developer who can be directed to build giant simulated worlds relatively cheaply as compared to taking away time from your other development people to do that.

Mechanistic Interpretability (2:14:00)

Theo: How optimistic are you that mechanistic interpretability will be useful? The only development we have so far that's of much significance from a major AI lab is OpenAI using GPT-4 to label the weights of GPT-2 which is of course a much, much smaller and less complicated model. So do you think it will be useful eventually?

Quintin: I don't think that's the only... So I mostly see mechanistic interpretability as not actually an alignment strategy so much as an investigative tool to understand what deep learning actually does. I think it's kind of weird to put much stock in mechanistic interpretability interventions for controlling AI behavior because they're so incredibly bad at that. The reason we use training in order to control AI is because it's so much more effective at doing so. And it's been getting more effective over time, more quickly than mechanistic interpretability interventions have been getting more effective over time. So it would be kind of weird to think with hype or even to put much probability on the scenario where suddenly the effectiveness of mechanistic interpretability at controlling AI behavior jumps up forward so much beyond its current level. The rate of progress may even exceed the ordinary tools we currently use to control AI behavior because they're the most effective ones we have. I'm somewhat skeptical of that scenario for mechanistic interpretability contributing to effective AI control techniques. However, I think it is useful for better understanding the dynamics and the effects of the control techniques we do have, such as training AIs or control nets and those sorts of methods.

For example, the knowledge editing paper was a useful reference point for thinking what are the inductive biases of deep learning, how do deep learning networks structure internals. It showed that there were many lookup tables or things that look a lot like lookup tables inside of deep neural networks. This should inform your estimation of what the inductive biases and internal structures of deep models tend to look like.

Similarly, there was a recent paper out of DeepMind, which was about the Hydra effect, if you’ve heard of that, about how deep learning models tend to have parts that compensate, that automatically compensate for internal damage. If you scrub away a particular attention head from a transformer language model, it turns out that other attention heads further down into the model will often change their behaviors in order to partially replace the functionality of the attention head that you have removed. This happens even without training the model with dropout. And so you can think about like what should your perspective on the inductive biases of deep learning be such that you would have of course predicted happening due to the dynamics of how deep learning training actually works. These sorts of results are useful for informing our intuition models and intervention strategies on deep learning in general, even though I don't really expect mechanistically retargeting the search to be that effective, at least before we have strong AGI to do it for us.

Theo: The real question about mech interp is whether inside giant neural networks there are somewhat human-readable algorithms, or if it's just complexity all the way down.

Quintin: There do exist human-readable algorithms inside large neural networks. Are you asking if I think we'll be able to fully decompile all of the algorithms in the networks?

Theo: Or at least many of them.

Quintin: I think there are lots and lots and lots of algorithms in those networks, and many are human-interpretable. But even if you could individually interpret every single algorithm, that doesn't necessarily mean you can interpret the ensemble of what all those algorithms are doing in concert with each other.

So, in terms of getting full transparency into all the causal factors that contribute to a large language model's behavior, and being able to hold that description in your head at once and predict the behavior well, I think it's pretty unlikely on a mechanistic level. Even random forests, if you're familiar with those, every individual part of the random forest is interpretable because it's such a simple algorithm. It's just basically dividing the input space into different portions. But then once you combine them all into the forest, then they're not as interpretable. Admittedly, they are more interpretable than neural networks, but that's usually just because the random forests are usually smaller than the neural networks. If you had a random forest the size of GPT-4, I think it would be quite uninterpretable, even though every single individual one of them is a straightforward decision tree. Does that answer your question?

Theo: Yeah, I think it does. As for how much more room for efficiency there could be in future AIs, like, for example, the way transformers and also the way human brains do, like, multiplication, the amount of FLOPS it takes a computer to multiply two 10-digit numbers is tiny. But the amount of FLOPS that it takes a neural network, if it can even do it, or the human brain is tremendous. So do you think there is room for lots of improvements of that nature?

Quintin: I think this is a completely ludicrous point of comparison if you're trying to estimate the bounds for efficiency in neural networks. Both neural networks and brains are vastly more general than the calculator that you're comparing them to. If you looked for the minimum size, most efficient neural network that could multiply two 10-digit numbers, it would be vastly tinier than GPT-4.

There's a paper that develops methods of training these hyper-optimized logical circuits for image classification. They're not neural networks, they're collections of ANDs, ORs, NOTs, and so on, and Boolean circuits that take images as inputs and output classes. This approach is plausibly edging towards the fastest you can do that sort of image classification. They do compare their approach to a neural network trained just for image classification on that same domain. They find that their Boolean circuit thing is two orders of magnitude faster than the neural network. But of course, there are huge questions about that work in terms of how efficient they made the neural network's execution on GPU versus the Boolean circuit's execution on GPU. There could easily be additional orders of magnitudes in terms of those parameters, as well as exactly how much slack there really was in that paper's implementation of optimized Boolean circuitry. However, I do think that two to three orders of magnitude runtime efficiency might be in the ballpark of efficiency gains left in neural networks, assuming you keep their level of generality the same. Of course, I'm talking about very optimized neural networks here. For instance, LLaMA or a few billion parameter model trained on huge amounts of data using all the optimized quantization and so forth, all the tricks that are at the current cutting edge. Relative to that sort of thing, there may be on the order of two or three orders of magnitude efficiency improvement left to be squeezed out.

But this leaves an entire dimension of efficiency analysis open, which is, like you brought up before, comparing systems of wildly different levels of generality. If you're comparing a system that's as general as GPT-4 to a system that just does addition, then, of course, the system that just does addition is going to be wildly more efficient than GPT-4. I also think there's quite a lot of remaining efficiency to be extracted in terms of narrowing down the collections of problems you want your model to address very well, so you no longer need this level of generality. You have this very specialized model that can only handle those problems, but it's much, much more efficient at doing so.

The current state-of-the-art or the most impressive systems are the ones that are most general, but there's a sort of sense in which this is a failure of proper industrial organization. If you're integrating AI into the economy in general, and you find yourself forced to use a really general AI for some economic purpose, that's an indicator that you've kind of screwed up your information economy such that this AI endpoint is having to deal with problems of an extremely variable nature. From an efficiency perspective, what you should be doing is refactoring things so that you can get away with a much narrower AI in whatever role you're currently using the hypergeneral system for. I think there's a lot of efficiency improvements to be extracted in terms of doing that.

AI Disempowering Humanity (2:28:13)

Theo: What do you think of arguments of the class of, for a significant period of time into the future, what AIs will be able to do to actually empower or disempower humanity if they wanted to are limited? For example, human brains are close to energy efficiency limits. AIs will be limited in how they can affect the real world.

Quintin: I think it depends on how the politics of AI into the world work out. You could imagine a world where AIs have quite a lot of political influence relatively quickly if there's a nation that is literally run by an AI government. There's no law of physics which prevents GPT-5 from saying a thing and a bunch of humans interpreting that as the new law of the land. Or there's no law of physics that prevents you from ending up with an AI dictator over a country. Currently, on the current trajectory of things, I think that's quite unlikely.

Theo: That a country would allow AI to run it?

Quintin: It's not so much about allowing AI to run it as whether the staggering, random, pseudo-random walk of politics caused by all the different actors pushing in their own individual directions and just random chance as well, whether that stumbles its way into a country being run by AI. I'm not imagining a situation where everyone votes for the AI to run the country, but just thinking about the disjunctive, all the possible paths through politics, evolution trajectories that could end up with an AI running a country.

Theo: The example that I've seen is a person gets so good at trading stocks overnight that they're able to buy all of the companies in the world because they made so much money trading stocks and then become the dictator. Of course, the natural counterargument to that is there are only so many market inefficiencies. You can't take over the entire world just by buying stocks. So do similar efficiencies exist in the real world complex system that would prevent one actor from being able to take the entire thing over?

Quintin: I mean, BlackRock does not actually rule the world. The US can just fire cruise missiles at them. I think it's pretty unlikely that you can get that sort of enormous stock trading advantage as an individual actor using AI because everyone else is also using AI. And it's not a subtle thing to make huge amounts of money on the stock market either. That does not strike me as a very plausible takeover scenario. I think that politics is the much more vulnerable axis or the outcomes in politics are more variable and there's less of an efficient market in terms of national takeovers than there is in the actual stock exchange. There can be really weird outcomes in politics. For instance, there was no guarantee that communists would take over Russia. If you look at the Communist Party before they supplanted the government in Russia, they were a group of lunatics.

Theo: They hit at a very opportune time.

Quintin: That's what I'm talking about. Opportunities arise and strange things can happen in those sorts of opportunities. And I think that's the more plausible outcome for AI takeover of at least some countries where there's political instability for reasons that no one really foresaw. Perhaps a small faction of people prefer rule by AI and they act decisively in that scenario. Or maybe there's even AI political parties or open politician development projects that gain power in light of some loss of legitimacy for the incumbent human polity. You could end up with one or a handful of countries run by AI. That seems more plausible to me than one actor gaining a massive competency advantage in this domain where lots of people are trying to gain as much of the competency advantage as possible. Then they do this incredibly public acquisition of huge amounts of resources, which do not actually directly translate into military power, and then they're able to take over the world despite lacking that military power, despite being very obvious to people who have that military power, and so on and so forth.

Theo: Well, I think that's a good place to wrap it up. Thank you so much, Quintin Pope, for coming on the podcast.

Quintin: I'm very happy to be here. It was a very broad-ranging discussion, and I was glad that we were able to get into the details of things quite a bit.

Theo: Thanks for listening to this episode with Quintin Pope. If you liked this episode, be sure to subscribe to the Theo Jaffee Podcast on YouTube, Spotify, and Apple Podcasts, follow me on Twitter @theojaffee, and subscribe to my Substack at All of these, plus many other things we talk about in this episode, are linked in the description. Thank you again, and I’ll see you in the next episode.

Theo's Substack
Theo Jaffee Podcast
Deep conversations with brilliant people.
Listen on
Substack App
RSS Feed
Appears in episode
Theo Jaffee