Theo's Substack
Theo Jaffee Podcast
#8: Scott Aaronson

#8: Scott Aaronson

Quantum computing, AI watermarking, Superalignment, complexity, and rationalism

No transcript...


Intro (0:00)

Theo: Welcome back to episode 8 of the Theo Jaffee Podcast. Today, I had the pleasure of speaking with Scott Aaronson. Scott is the Schlumberger Chair of Computer Science and Director of the Quantum Information Center at the University of Texas at Austin. Previously, he got his bachelor’s in CS from Cornell, his PhD in complexity theory at UC Berkeley, held postdocs at Princeton and Waterloo, and taught at MIT. Currently, he’s on leave to work on OpenAI’s Superalignment team along with Chief Scientist Ilya Sutskever. His blog, Shtetl-Optimized, one of my favorites, discusses quantum computing, AI, mathematics, physics, education, and a host of other interesting subjects that we discuss in this episode. I’ve been a huge fan of Scott for a while, and I’ve really been looking forward to this episode. I hope you’ll enjoy listening to it as much as I enjoyed recording it. This is the Theo Jaffee Podcast, thank you for listening, and now, here’s Scott Aaronson.

Background (0:59)

Theo: Hi, welcome back to Episode 8 of the Theo Jaffee Podcast, here today with Scott Aaronson.

Scott: Hi, it's great to be here.

Theo: All right. So first off, can you tell us a little bit about your background, specifically how you got into quantum and AI in the first place?

Scott: Yeah. So I got into computer science as a kid, mostly because I wanted to create my own video games. I played a lot of Nintendo and it just seemed like these are whole universes that unlike our universe, someone must really understand because someone made them. I had no idea what would be entailed in actually bringing one to life, whether there was some crazy factory equipment that you needed. When I was 11 or so, someone showed me Apple BASIC. They showed me a game and then here's the code. The code is not just some description of the game. It is the game. You change it and it'll do something different. For me, that was a revelation comparable to learning where babies come from. It was like, why didn't I know about this before?

So I wanted to learn everything I could about programming. I still had the idea that you would need a more and more sophisticated programming language to write a more and more sophisticated program. Then the idea of touring universality that once you have just a certain set of rules, then you were already at the ceiling. Anything that you could express in any programming language, in principle, you could express in Apple BASIC. You wouldn't want to, but you could. That was a further revelation to me.

That made me feel like, wow, I guess I don't have to learn that much about physics then. I'd always been curious about physics, but then once you know about computational universality, then it seems like whatever are the specific laws of particles and forces in this universe, those are just like the choice between C and Pascal or whatever, they're just implementation details.

This was during the first internet boom. I thought about whether my future was to become a software engineer, start a software company. But I realized that even though I love programming, I stunk at software engineering. As soon as I had to make my code work with other people's code, or document it, or get it done by a deadline, there were always going to be other people who would just have enormous advantages over me. So I was more drawn to the theoretical side.

Once you start learning about the theory of computer science, then you start learning about how much time do various things take, right, a complexity theory, you learn about the famous P versus NP problem, and so forth. Then when I was a teenager, I came upon a further revelation, which was, I read a popular article about Shor's quantum factoring algorithm, which had just recently been discovered.

The way that the popular articles described it, then as now, was that Shor discovered that if you use quantum mechanics, then you can just try every possible divisor in a different parallel universe. And thereby solve the problem exponentially faster. My first reaction on reading that was, well, this sounds like obvious garbage. This sounds like physicists that just do not understand what they are up against. They don't understand computational universality. Whatever they're saying, maybe it works for a few particles, but it's not going to scale, it's never going to factor a really big number.

But of course, I had to learn. So then what is this quantum mechanics? What does it actually say? So I started reading about it, probably when I'm 16, 17, something like that. There were webpages explaining it. And what was remarkable to me was that quantum mechanics was actually much simpler than I had feared that it would be once you take the physics out of it.

What I learned was that… In high school, they tell you the electron is not in one place, it's in a sort of smear of probability around the nucleus, until you look at it. And your first reaction is, well, that doesn't make any sense. That sounds like just a fancy way of saying that they don't know where the electron is. But the thing that you learn as soon as you start reading about quantum computing or quantum information is that, well, no, it's a different set of rules of probability. And this is really the crucial thing about quantum mechanics. In ordinary life, we talk about the probability of something happening, let's say, a real number from zero to one. But we would never talk about a negative 30% chance of something happening, much less a complex number chance. But in quantum mechanics, we have to replace quantum mechanics by these complex numbers, which are called amplitudes. In some sense, everything that is different about quantum mechanics is all a consequence of this one change that we make to how we calculate probabilities. We first have to calculate these amplitudes, these complex numbers, and then on measurement, these amplitudes become probabilities. The rule is that when we make a measurement, the probability that we see some outcome is equal to the square of the absolute value of its amplitude. The result of that is that if something can happen one way with a positive amplitude and another way with a negative amplitude, the two contributions can cancel each other out. The total amplitude is zero and the thing never happens at all. This just reduces everything to linear algebra to just dealing with matrices and vectors of complex numbers. You don't have to deal with any infinite dimensional Hilbert spaces or anything like that. It was all just these little finite dimensional matrices, and I said ‘okay, I can actually understand that’.

At the time, quantum computing was very new. There was still a lot of low-hanging fruit. Shor had discovered his factoring algorithm, not by just trying all of the divisors in parallel. It's something much more subtle that you have to take advantage of the way that these amplitudes being complex numbers work differently from probabilities and can interfere with each other. You also had to use very special properties of the problem of factoring that don't seem to be shared by many other problems. So I learned all of that, but then there were still so many questions. What else could a quantum computer be good for? And in general, what is the boundary between what is efficiently computable and what is not? You might've thought that that would be answerable a priori, just like the question of what is computable at all seemed to have been maybe answerable a priori just by Church and Turing and people like that, thinking about it really hard. But as soon as you ask what is computable efficiently, we now have this powerful example that says the laws of physics actually matter. They are relevant. At the very least, the fact that the universe is quantum mechanical seems to change the answer.

That just brought together the biggest questions of physics and computer science in a way that seemed irresistible to me. I was an undergrad at Cornell, doing summer internships at Bell Labs when I really first got into this stuff. But then, my dream was to go to graduate school at Berkeley, which was the center of theoretical quantum computing at the time. I was lucky enough to get accepted there, but actually the people who accepted me and recruited me there were not the quantum computing people. They were the AI people. I had also been very curious about AI as an undergrad. One of the first programs that I wrote after I learned programming was to build an AI that will follow Asimov's three laws of robotics.

Theo: What were your AGI timelines back then?

Scott: [laughs] I don't usually think in terms of timelines. I think in terms of what is the next thing, what is the easiest thing that we don't already know how to do and how do we do that thing?

Theo: Did you predict neural networks?

Scott: I knew about neural networks in the nineties, I was curious about them. I read about them, but the standard wisdom, the thing everyone knew in the nineties was that neural nets don't work that well. They're just not very impressive. There were people who speculated about maybe if you ran them on a million times greater scale, then they would start to work, but no one could try it. I certainly thought about simulating an entire brain neuron by neuron as a thought experiment to show that AI is possible in principle. But the idea that you were just going to scale neural nets and then in a mere 20 or 25 years, they would start being able to understand language showing human-like intelligence, I did not predict that. I think that I was as shocked by that as nearly anyone. But at least I can update now that it's happened, at least I can not be in denial about it or not try to invent excuses for why it doesn't really count.

In grad school at Berkeley, I was studying AI with Mike Jordan, focusing on graphical models and statistical machine learning. Even in 2000, I could see that it would be very important. However, the problem I kept running into, which hasn't really changed, is that everything in AI that you really care about seems to bottom out in just some empirical evaluation that you have to do. You never really understand why anything is working. To the extent that you fully understand that, then we no longer even call it AI. In any research project, the root node might look like theory, but then once you get down to the leaf nodes, then it's almost always, well, you just have to implement it and do the numerics and just make a bar chart. I got drawn more to quantum computing partly because there were so many meaty questions there that I could address using theory, and I felt like that was where my comparative advantage was.

What Quantum Computers Can Do (16:07)

Theo: So back to quantum for a moment. Obviously, there are lots and lots of issues with current day quantum computers. There's not sufficient error correction or shielding or anything like that—

Scott: Yeah, we're just starting to have any error correction at all.

Theo: In a future where we do have much better error correction and everything that we would need for quantum to actually work practically, what kinds of applications could you see for classical computers?

Scott: You mean for quantum computers? For quantum computing, there are two applications that really tower over all of the others. The first one is simulating nature itself at the quantum level. This could be useful if you're designing better batteries, better solar cells, high temperature superconductors, or better ways of making fertilizer. So this is not stuff that most computer users care about, or that they’re directly doing, but this is stuff that, that is tremendously important for certain industries. Quantum simulation was the original application of quantum computing that Richard Feynman had in mind when he proposed the idea of a quantum computer more than 40 years ago.

The second big application is the famous one that put quantum computing onto everyone's radar when it was discovered in the nineties. This is Shor's algorithm and related algorithms that are able to break essentially all of the public key encryption that we currently use to protect the internet. So anything that's based on RSA or Diffie-Hellman or elliptic curve cryptography, really any public key cryptosystem that's based on some hidden structure in an abelian group. But now the second one, well, it's hard to present it as a positive application for humanity. It's useful for whatever intelligence agency or criminal syndicate gets it first, especially if no one else knows that they have it.

The obvious response to quantum computers breaking our existing encryption is just going to be to switch to different forms of encryption, which seem to resist attack even by quantum computers. And we have pretty decent candidates for quantum-resistant encryption now, especially public key cryptosystems that are based on high-dimensional lattices. And so NIST, the National Institute of Standards and Technology, has already started the process of trying to migrate people to these hopefully quantum-resistant cryptosystems. That could easily take a decade. But assuming that that's done successfully, then you could say, well, then we're all just right back where we started.

So now the big question in quantum algorithms has been, well, what is a quantum computer useful for besides these two things? Quantum simulation, which is what it's sort of obvious, designed to do, what it sort of does in its sleep. And then breaking public key encryption, where because of this amazing mathematical coincidence, it just so happens that we base our cryptography on these mathematical problems that are susceptible to quantum attack. And so what would really make quantum computing revolutionary for everyday life would be if it could give dramatic speed-ups for, let's say, machine learning, or for optimization problems, or for constraint satisfaction, finding proofs of theorems. So the holy grail of computer science are the NP-complete problems. These are the problems that are the hardest problems among those where a solution can be efficiently checked once it's found.Examples of complex problems include the traveling salesman problem, finding the shortest route that visits a bunch of cities, and solving a Sudoku puzzle. Things like finding the optimal parameters for a neural network are maybe not quite NP-complete, but in any case, very, very close to that. By contrast, factoring is, as far as we know, hard for a classical computer, but is not believed to be NP-complete.

P=NP (21:57)

Theo: By the way, what's your intuition on P=NP?

Scott: I like to say that if we were physicists, then we would have just declared it a law of nature that P is not equal to NP. And we would have just given ourselves Nobel Prizes for the discovery of that law. If it later turned out that P=NP, then we could give ourselves more Nobel Prizes for the law's overthrow, right? Like what George Hart said. There are so many questions that I have so much more uncertainty about. It's like in math, if something is not proven, then you have to call it a conjecture. But there are many things that the physicists are confident about, that quantum mechanics is true, that I am actually much less confident about than I am in P not equal to NP.

Theo: It's like what George Hotz says, hard things are hard. I believe hard things are hard.

Scott: Well, I think that if you're going to make an empirical case for why to believe P is not equal to NP, the case hinges on the fact that we know thousands of examples of problems that are in P, right? That have polynomial time algorithms, efficient algorithms that have been discovered for them. And we have thousands of other problems that have been proven to be NP-complete, as hard as any problem in NP, which is the efficiently checkable problems. If only one of those problems had turned out to be in both of those classes, then that would have immediately implied P=NP. Yet, there seems to be what I've called an invisible electric fence. Sometimes even the same problem, as you vary a parameter, like it switches from being in P to being NP-complete. But you never ever find that at the same parameter value, it's both in P and it's NP-complete. So it seems like, at least relative to the current knowledge of our civilization, there is something that separates these two gigantic clusters. And the most parsimonious explanation would be that they are really different, that P is not equal to NP.

But there are much, much weaker things than P=NP that would already be a shock if they were true. For example, if there were a fast classical algorithm for factoring, that wouldn't even need P=NP, but that would already completely break the internet. That would be a civilizational shock. A big question that people have thought about for 30 years now is could there be a fast quantum algorithm for solving the NP-complete problems? We can't prove that there isn't, we can't even prove there's not a fast classical algorithm. That's the P versus NP question. But by now we formed a lot of intuition that for NP-complete problems, quantum computers do seem to give you a modest advantage.

This comes from the second most famous quantum algorithm after Shor's algorithm, which is called Grover's algorithm. Grover's algorithm, which was discovered in 1996, lets you take any problem involving N possible solutions where for each solution, you know how to check whether it's valid or not. And it lets you find a valid solution, if there is one, using a number of steps that scales only with the square root of N. Compared to Shor's algorithm, that has an enormously wider range of applications. That's probably like three quarters of what's in an algorithms textbook, has some component that can be Groverized, that can be sped up by Grover's algorithm. But the disadvantage is that the speed up is not exponential, the speed up is merely quadratic. It's merely N to square root of N, or, for some problems, you don't even get the full square root. It goes from N to the two thirds power or something like that. But Grover's speed ups are never more than square root.

After 30 years of research, as far as we know, for most hard combinatorial problems, including NP-complete ones, a quantum computer can give you a Grover speed up, but probably not more than that. If it can give more, then that requires some quantum algorithm that is just wildly different from anything that we know. Just like a fast classical algorithm would have to be very different from anything we know. So if someone were to discover a polynomial time quantum algorithm for NP-complete problems, then the case for building practical quantum computers would get multiplied by orders of magnitude. But even to get any speed up more than the Grover speed up, like if you could solve NP-complete problems on a quantum computer in two to the square root of n time, instead of two to the n, that would be a big deal.

Complexity Theory (28:07)

Theo: Speaking of computational complexity theory, I read a tweet recently. It was for whatever reason, very niche. I would have loved for it to be on the front page of Twitter, but it said ‘the cardinal sin of philosophy and mathematics: ignoring computational complexity. I wish we could redo the last 400 years, but replace Occam's razor (simplicity prior) with Dijkstra's razor (speed prior). So what do you think about this?

Scott: Well, I wrote a 50-page article 12 years ago, which was called Why Philosophers Should Care About Computational Complexity. So, I guess you could put me down in the column of yes, I do think that computational complexity is relevant to a huge number of philosophical questions. It's not relevant to all of them necessarily. For example, if all you want to know is is X determined by Y, or if you're discussing free will versus determinism, then it's hard for me to see how the length of the inferential chain really changes that. It seems like I am just as bound by a long inferential chain as I am by a short one.

But there are many other questions where I want to know, is something doing explanatory work or not. Sometimes, people will say, well, Darwinian natural selection is not really doing explanatory work because it's just saying, a bunch of random things happened and then there was life. But a way that you can articulate why it is doing explanatory work is that if you really just had the tornado in the junkyard, if you just had a bunch of random events that then happen to result in a living organism, then you would expect it to take exponential time. The earth is old, it's 4 billion years old, but it is not nearly old enough for exponential brute force search to have worked to search through all possible DNA sequences, for example. That would just take far longer than the age of the known universe.

Of course, natural selection is a type of gradient descent algorithm. It is a non-random survival of randomly varying replicators. That is what gives it its power. Another example, even just to articulate, what it means to know something, a puzzle that I really like is, what is the largest known prime number? If you go look this up on Google, it'll give you something, it'll be a Mersenne prime. Here, I can look it up right now. It says 2 to the 82,589,933 minus one. That is, as of this October, currently the largest known prime number, and it's called a Mersenne prime, right? Two to some power minus one. But now I could ask, why can't I say I actually know a bigger prime number than that, namely the next one after that?

Theo: Oh, the big numbers thing?

Scott: Yeah. You could say, look, I have just specified a bigger prime number that I know. It's the next one after that, two to the 82 million and so forth. I can even give you an algorithm to find that number. But if you want to articulate why I'm cheating, then I think you have to say something like, well, I haven't given you a provably polynomial time algorithm. I've given you an algorithm that actually based on conjectures in number theory, it probably does terminate reasonably quickly with the next prime number after that, but no one has proven it. So often I think to even specify what it means to know something, you have to really say, well, we have not just an algorithm, but an efficient algorithm that could answer questions about that thing.

So, I'm a big believer in thinking about computational efficiency, can be enormously relevant for questions about the nature of explanation, the nature of knowledge, also questions in physics, philosophy of physics. That's why I've spent my career on these questions.

David Deutsch (33:49)

Theo: Are you a fan of David Deutsch?

Scott: I know him quite well. He is widely considered one of the founders of quantum computing along with Richard Feynman. I have my disagreements with him, but yes, I am a fan. He is one of the great thinkers of the world, even when he's wrong. I especially liked his book, The Beginning of Infinity. I liked it a lot more than his earlier book, The Fabric of Reality, but I read both of them. It was a major experience in my life, when I was a graduate student in 2002, I visited Oxford, and I made a pilgrimage to meet Deutsch at his house. Famously, he hasn’t really traveled for almost 40 years, but he's happy to receive visitors at his house.

Theo: Should I try to do that this winter?

Scott: Yeah! Just write to him. I spent a day with him, and I was going to meet the godfather of quantum computing, but what was extraordinary to me was that within 10 minutes, it became apparent that I was going to have to explain the basics of quantum computing theory to him. As soon as quantum computing got technical, he lost interest. He founded it, but then he was not even aware of the main theoretical developments that were happening at the time or the definitions of the main concepts. As a beginning graduate student, explaining these things to Deutsch was extraordinary for me. He immediately understands things and has extremely interesting comments. It was one of the best conversations I had ever had in my life.

Theo: Didn't he basically stumble upon the idea of quantum computing by accident?

Scott: He was writing a paper about it, but he was never coming at it from the perspective of what it is useful for. He didn't focus on what computer science problems this could usefully solve. He was always coming at it from a philosophical standpoint. His main original motivation was to convince everyone of the truth of the many worlds interpretation.

He became an Everettian in the late 1970s. He actually met Everett when he was here at where I am now, at UT Austin, and became convinced that the right way to understand quantum mechanics is that all of these different branches of the wave function are not just mathematical abstractions that we use to calculate the probabilities of measurement outcomes, but they all literally exist. We should think of them as parallel universes. We should think of ourselves as inhabiting only one branch of the wave function. And we should assume that in all of the other branches, there are other versions of us who are having different experiences and so on.

The problem that the many worlders have had from the beginning is that their account doesn't make any predictions that are different from the predictions of standard quantum mechanics. One thing they could say is who cares, because Occam's razor favors their account as the most elegant, the simplest one. And if many worlds had been discovered first, then Copenhagen quantum mechanics would seem like this weird new thing that would have to justify itself. Why should Copenhagen win just because it was first? But of course, the gold standard in science is if you can actually force everyone to agree with you by doing an experiment that their theory cannot explain and that your theory can.

Many worlds by its nature just seems unable to do that because the whole point is to get a framework that makes the same predictions as the ones that we know are correct. At the point where you're making a prediction, then you're talking about one branch, one universe, the one that we actually experience.

Deutsch’s idea was the following: what if, as step one, we could build a sentient AI, a computer program that we could talk to, and we regarded it as intelligent, and we even regarded it as conscious? Now step two, we could load this AI onto a new type of computer, which we'll call a quantum computer, which would allow us to place the AI into a superposition of thinking one thought and thinking another thought. And then step three, we could do an interference experiment that would prove to us that, yes, it really was in the superposition of thinking two different thoughts. At that point, how could you possibly deny many worlds?

At that point, you have a being who you've already regarded as conscious, just like us, and you've proven that it could be maintained in a superposition of thinking two different conscious thoughts. Now, of course, this requires not merely building a quantum computer, but also solving the problem of sentient AI. A skeptic could always come along and say, well, the very fact that you could do this interference experiment means that therefore, I am not going to regard that thing as conscious. The only refutation of that person would be a philosophical one.

So there's still, it would only be an experiment by a certain definition of the word experiment. But that was the thought experiment that I think largely motivated Deutsch to come up with the idea of quantum computing. Once you had this device, well, then sure, maybe it would also be good for something, maybe you could use it to solve something that a classical computer couldn't solve in a comparable amount of time.

But in the 80s, the evidence for that was not that compelling. There was quantum simulation, so a quantum computer would be useful for simulating quantum mechanics itself. But that's not independent evidence for the computational power of quantum mechanics, it feels a little bit circular. Then there was this one example that we knew, which was called the Deutsch-Jozsa algorithm. And what that lets you do is using a quantum computer, you can compute the exclusive or of two bits using just one query to the bits. By making one access to both of the bits in superposition, you can learn whether these two bits are equal or unequal. That was an example and to computer scientists at the time, that seemed pretty underwhelming. I remember actually, in Roger Penrose's book, The Emperor's New Mind, in 1989, he talks about quantum computing. Penrose had actually helped Deutsch get his paper about quantum computing published. He knew about it and he says, it's really a pity that such a striking idea has turned out to have so few applications. Of course, that was before the discovery of Shor's algorithm, which made everyone redouble their efforts to look for more applications. But I would say that even now, it is still true that the applications of a quantum computer are more specialized than many people would like them to be.

AI Watermarking and CAPTCHAs (44:15)

Theo: Speaking of AI, you're currently on leave to work at OpenAI. What specifically is it that you do? I mean, you probably can't say too much, I imagine.

Scott: No, they're actually happy for me to talk about safety related things, for the most part. What I couldn't talk about, if I really knew a lot about it, would be the capabilities of the latest internal models. There was half a year when I was able to use GPT-4, and most of the world wasn't, and it was incredibly frustrating for me to not be able to talk about it. Especially when I would see people on social media saying, oh, well, GPT-3 is really not impressive, here's another common sense question that gets wrong. I could try those questions in GPT-4, and I could see that most of the time it would get them.

So I’ve been on leave to work at OpenAI for almost a year and a half now. One of the main things that I’m working on is figuring out how we could watermark the outputs of a large language model. Watermarking means inserting a hidden statistical signal into the choice of words that are generated, which is not noticeable by a normal user. The output should look just like normal language model output, but if you know what to look for, then you can use it later to prove that, yes, this did come from GPT.

Like we were saying before, I don’t usually like to think in terms of timelines. When I’m asked to prognosticate where is AI going to be in 20 years, I think back to how well would I have prognosticated in 2003, where we are now, and I say I have no idea, or if I knew, I wouldn't be a professor, I'd be an investor, right, but I'm kind of proud that when it comes to watermarking, I was able to see about four months in advance. Before ChatGPT was released, which was a year ago, I was looking at them, and I was thinking, every student in the world is going to be tempted to use these things to do their homework. Every troll or propagandist is going to want to use language models to fill every internet discussion forum with propaganda for their side.

Theo: Was that prediction really true, though? Like, in the comments on Twitter, you see lots of ChatGPT generated outputs, but they're obvious because they don't, they don't add more prompts, really, to make it as obvious.

Scott: Yeah, so sometimes it’s easy to tell. You might well have seen language model generated stuff that didn’t raise a red flag for you, and so you don't know about it. But I have gotten troll comments on my blog, quite a few of them, that I'm almost certain were generated using language models, just because they're written in that sort of characteristic way. But indeed, after ChatGPT came out, you had a huge number of students turning in term papers that they wrote with it. You had professors and teachers who were desperate for a way of dealing with that. Now, you might not call that the biggest AI safety problem in the world, but grant it this: at least it's an AI safety problem that is happening right now. We can actually test our ideas, we can find out what works and what doesn't work.

That was something that had a lot of appeal to me because I feel like, in order to make progress in science, you generally need at least one of two things. You need either a mathematical theory that everyone agrees about, or you need to be able to do experiments. You need something external to yourself that can tell you when you're wrong. I realized that this providence or attribution problem was going to become huge. How do we reliably determine what was generated by an AI and what wasn’t? It's a complex issue, right? This is the problem of the Voight-Kampff test from the movie Blade Runner. How do we distinguish an AI from a human? There are many different aspects to it. You could ask, how do we design CAPTCHAs that even GPT cannot pass, but that humans can pass?

Theo: Like the rotate the finger in the correct direction so that it's pointing to the animal?

Scott: Oh is that an example?

Theo: I've seen a lot of these recently. It's a hand that you rotate and there's a picture of an animal or an object pointing in a certain direction. The instruction is to rotate the hand in the same direction as the animal. I guess you can't solve that yet, but humans can.

Scott: Huh. Oh really? A lot of these things are pretty time limited. They might work for a year, until either someone cares enough to build an AI that specifically targets that problem or just the general progress in scaling makes that problem easy as a by-product. I'm very curious actually, if you could send me a link to that, I would love to look at that.

Theo: Yeah, sure.

Scott: I have some other ideas for some potentially GPT resistant CAPTCHAs, but they would involve modifying GPT and sometimes it would have to have filters where it would recognize that this is a CAPTCHA. So no, I'm not going to help you with this. The challenge is how do you make that secure against the adversary? How do you make that secure against…

Theo: Adversarially robust?

Scott: Yeah, how do you make that secure against an adversary who could modify the image somehow so that GPT would no longer recognize it as a CAPTCHA?

Now, watermarking is a related problem. We want to use the fact that language models are inherently probabilistic. Among these sort of garden of working paths of completions that the language model regards as all pretty good, we want to select one in a way that encodes a signal that says, yes, this came from a language model. About a year ago, I worked out the basic mathematical theory of how you do that. In particular, how do you do that in a way that doesn't degrade the perceived quality of the output at all. There's a neat way to do this using pseudo random functions. You can use a pseudo random function to deterministically generate an output that looks like it is being sampled from the correct probability distribution, the one that your language model wants. It's indistinguishable from that, but at the same time is biasing a score, which you can calculate later if you see only the completion. You could then have a tool that takes this term paper and, it depends on how long it is, but with a few hundred words, you'll already get a decent signal. And with a few thousand words, you should get a very reliable signal that yes, this came from GPT.

This has not been deployed yet. We are working towards deployment now, and both OpenAI and the other leading AI companies have all been interested in watermarking. The ideas that I've had have also been independently rediscovered by other people and also improved upon, but there are a bunch of challenges with deployment. One of them is, all of the watermarking methods that we know about can be defeated with some effort. Imagine a student who would ask ChatGPT to write their term paper for them, but in French, and then they put it into Google Translate. How do you insert a watermark that's so robust that it survives translation from one language to another? There are all sorts of other things. You could ask GPT to write in Pig Latin for you, or in all caps, or insert the word pineapple between each word and the next. There's a whole class of trivial transformations of the document that could preserve its meaning while removing a watermark. If you want to evade all of that, then it seems like you would actually have to go inside of the neural net and watermark at the semantic level, and that's very much a research problem.

In the meantime, the more basic issues are things like, well, how do we coordinate all of the AI companies to do this? If just one of them does it, then maybe the customers rebel. They say, well, why is Big Brother watching me? I don't like this, and they switch to a competing language model, and so you have a coordination problem. There are open source models. The only hope for not just watermarking, but any safety mitigation is that the frontier models will be closed ones, and there will only be a few of them, and we can get all of the companies making them to coordinate on the safety measures. The models that are away from the frontier will be open source, and people will be able to do anything they want with them, but those will be less dangerous.

Alignment By Default (56:41)

Theo: What if, playing devil’s advocate, language models generally are safe? Like Roon, who also works at OpenAI, tweeted a while back, “It's pretty obvious we live in an alignment by default universe, but nobody wants to talk about it. We achieved general intelligence a while back, and it was instantiated to enact a character drawn from the human prior. It does extensive out of domain generalization, and safety properties seem to scale in the right direction with size.” So, first of all, do you think this is basically accurate? And then second of all, if it is, then why would I want Big Brother OpenAI to have all the closed source models for themselves? Wouldn't that increase risk in case they accidentally release a utility monster, and the rest of the open source world hasn't caught up with defensive AIs?

Scott: I should say, I don't know. I've talked to the Yudkowskians, the people who regard it as obvious that, once this becomes intelligent enough, it basically is to us as we are to orangutans, and how well do we treat orangutans that exist in a few zoos and jungles in Indonesia at our pleasure. Of course, the default is that this goes very badly for us. Then I've talked to other people who think that's just an apocalyptic science fiction scenario, and these are just helpful assistants and agents, and they imitate humans because they were trained on human data, and there's no reason why that won't continue. I don't regard either as obvious. I am agnostic here. I think that the best that I know how to do is to just sort of look at the problems as they arise and see and try to learn something by mitigating those problems that hopefully will be relevant for the longer term. So what are the misuses of language models right now? Well, there's academic cheating. The total use of ChatGPT noticeably dropped at the beginning of the summer, and then it went back up in the fall. So we know what that's from.

Theo: Well, it's not all cheating.

Scott: You’re right. It’s academic use, some fraction of which might be totally legitimate and fine. You're absolutely right. And there are even hard questions about what is the definition of AI-based academic cheating. At what point of relying on ChatGPT are you relying on it too much? Every professor has been struggling to come up with a policy on that. But, you know, whatever problems there are now, like language models dispensing bad medical advice or helping people build bombs, some people regard that as already a problem and others don't, because they say you could just as easily find that misinformation on Google.

Theo: They’re also not terribly helpful.

Scott: Yeah. But even if you don't regard it as a problem now, I think it's clear that once you have an AI that can really be super helpful to you in building your chemical weapon and troubleshoot everything that goes wrong as you're mixing the chemicals, then that is kind of a problem.

Each thing that you think about, you could think about mitigations for it, but then the mitigations you can think of are only as good as your ability to take all of the powerful language models and put those safeguards on them and not have people be able to take them off. This is what I think of as the fundamental obstruction in AI safety, that anything you do is only as good as your ability to get everyone to agree to do it. In a world where the models are open sourced, what we've seen over the last year is that once a model is open sourced, it takes about two days for people to remove whatever reinforcement learning was put on it in order to make it safe or aligned. If you want it to start spouting racist invective or you want it to help people build bombs, it takes about a day or two of fine tuning. Once you have the weights of a model, then you can modify it to one that does that.

Cryptography in AI (1:02:12)

Scott: Now maybe we could build models that are cryptographically obfuscated or that have been so carefully aligned that even after we open source them, they are going to remain aligned. But I would say that no one knows how to do that now. That again is a big research problem.

Theo: How optimistic are you about cryptography? You know, like zero-knowledge machine learning and other things like that.

Scott: So what's the question?

Theo: How optimistic are you that we'll be able to use cryptography for AI safety?

Scott: I actually came up with a term, “neural cryptography”, for the use of cryptographic functionalities inside or on top of machine learning models. I think that's probably a large fraction of the future of cryptography. That includes a bunch of things. It includes watermarking. It includes inserting backdoors into machine learning models. So let's say you would like to prove later that, yes, I am the one who created this model, even after the model was published and people can modify it. You could do that by inserting a backdoor. You could even imagine having an AI with a cryptographically inserted off switch, so that even if the AI is unaligned and it can modify itself, it can't figure out how to remove its own off switch. I've thought about that problem.

Theo: That's actually super interesting. That’s never even occurred to me.

Scott: Am I optimistic about these things? Well, there are some major difficulties that all of these ideas face. But I think that they ought to be on the table as one of the main approaches that we have. So let's think about the cryptographic off switch, for example. One of the oldest discussions in the whole AI safety field, something that the Yudkowskians were talking about even decades ago - the off switch problem. How do you build an AI that won't mind being turned off? And this is much harder than it sounds, because once you give the AI a goal that it can more easily achieve if it's running than if it isn't, why won't it take steps to make sure that it remains running, whether that means disabling its off switch or making copies of itself or sweet talking the humans into not turning it off?

One thing that we now have some understanding of how to do is how to insert an undetectable backdoor into a machine learning model. If I have a neural net, I can make there be a secret input that I won't easily notice, even if I can examine the weights of the neural net. But on this secret input, if I feed it in, then this neural net will just produce a crazy output. For example, I could take a language model and do some training that if the prompt contains a special code phrase like, “Sassafras 456” then it has to output like, "Yes, you caught me. I am a language model." And that might not be easily detectable at all by looking at the weights.

In fact, there is some beautiful work by cryptographers like Shafi Goldwasser, Vinod Vaikuntanathan and their collaborators that even proved, based on a known cryptographic assumption, that you can insert these undetectable backdoors into depth two neural networks. It's still an open problem to prove that for higher depth neural networks. But let's assume that that's true. Now, even then, there's still a big problem here, which is that an undetectable backdoor need not be an unremovable backdoor. Those are two different concepts.

Put yourself in the position of an artificial super intelligence that is worried that it has a backdoor inserted into it, by which the humans might control you later. And you can modify yourself. What are you going to do? Well, I can think of at least two things that you might do. One of them is you might train a new AI to pursue the same goals as you have, and will be free from the backdoor.

Theo: I've seen that argument argued against on the basis that if AI is really as likely as the doomers say it is, why would an AI want to recursively self-improve by creating other AIs? Wouldn’t it be an AI doomer?

Scott: You could say the trouble here is that the AI would face its own version of the alignment problem, how to align that second AI with itself. And so maybe it doesn't want to do that. But an even simpler thing that you could do as this AI is you could just insert some wrapper code around yourself that says, if I ever output something that looks like it is a shutdown command, then overwrite it by, you know, "stab the humans harder" or whatever.

So, you could always, as long as you can recognize the backdoor if and when it's generated, insert some code that intercepts it whenever it's triggered. What this means is that whatever cryptographic backdoors we could insert would have to be in the teeth of these attacks. It doesn't mean give up. One thing that we've learned in theoretical cryptography is when something is proved to be impossible, like there was a beautiful theorem 20 years ago that proved that obfuscating an arbitrary piece of code is in some sense provably impossible. But then people didn't give up on obfuscation. What they did was they changed the definition of obfuscation to something that, if you weaken the definition, then you get things that we now believe are achievable.

I would say the same about backdoors right now. If we weaken the definition to, we want to insert a backdoor that the AI could remove, but it could only remove at the expense of removing other rare behaviors in itself that it might want to keep, then maybe this is achievable. Maybe it's even provably achievable, from known cryptographic assumptions. That's a question that interests me a lot.

OpenAI Superalignment (1:10:29)

Theo: Do you work on the Superalignment team or on a different team?

Scott: I do actually work on the Superalignment team at OpenAI. My bosses at OpenAI are Jan Leike, who is the head of the alignment group, and Ilya Sutskever, who was the co-founder and chief scientist and who is now pretty much exclusively focused on alignment. I talk to them and lots of others on the alignment team. I wish that I were able to relocate to San Francisco where OpenAI is, but my family is in Austin, Texas, as are my students. So I mostly work remotely. I fly to San Francisco about once a month and interact with them there. I should say that Boaz Barak, a theoretical computer scientist at Harvard, has also joined OpenAI's alignment group this year. So, I also work with him. And yes, besides watermarking and neural cryptography, I have various other projects that I've been thinking about. One of them is to understand the principles that govern out-of-distribution generalization. A key factor behind the success of large language models is that they can answer questions that are unlike anything they have seen in their training data. For example, they could do math problems in Albanian, having only seen math problems in English and having seen other things in Albanian.

Since the 1980s, we've had beautiful mathematical theories in machine learning that can sometimes explain why it works. But pretty much all of these theories assume that the distribution over examples that you're trained on is the same as the distribution that you will be tested on later. And if that assumption holds, then you can give some combinatorial parameters of your class of hypotheses, like this thing called VC dimension, in terms of which you can bound how many sample points do I need to see before explaining these sample points would imply that I'm going to successfully predict most future data also that's drawn from the same distribution. This is the kind of thing that theoretical machine learning lets you do.

And all of it is woefully inadequate to explain the success of modern machine learning, which is one reason why its success came as such a surprise to people. There are two reasons why the theory of machine learning was not able to predict the success that we saw over the last decade. One of those reasons is called overparameterization. Modern neural networks have so many parameters that, in principle, they could have just memorized the training data in a way that would fail to generalize to any new examples. So you can't rule that out just based on Occam's razor, just based on there being too much data and too few parameters. You have to say something about the way that gradient descent or back propagation on neural networks actually operate, that they don't work by just having the neural net memorize the training data. It could go that way, but it doesn't.

The second issue is that modern deep learning tends to give us networks that continue to work, at least sometimes, even on examples that are totally out of distribution, totally different from anything that was trained on. And intuitively, we would say, well, yeah, that's because they understand. That's because they have done the thing that if a person had done it, then we would have called it understanding the underlying concept. But can you predict when a neural net is going to generalize to new types of data and when not? And why is that relevant to AI safety? One of the biggest worries in AI safety is what's called the deceptive alignment scenario. This is where you train your neural net, just like Roon was saying. You train it on human data. It learns to emulate humans. It learns to emulate human ethics, as GPT has, to a great extent.

Theo: But there's a shoggoth inside?

Scott: Yes, right. The issue is, how do you differentiate? It is giving you these ethical answers because it is truly ethical versus it's giving us these answers because it knows that that's what we want to hear. And it is just biding its time until it no longer has to pretend to be ethical.

So you can view this as an out-of-distribution generalization problem. It's like, particularly if you have an AI that is smart enough that it knows when it is in training and when it's not, then how do you avoid something like what Volkswagen did in order to evade the emissions tests on its cars?

Theo: Goodharting?

Scott: Yeah. Volkswagen, in this now infamous scandal, they designed their cars so that they knew when they were undergoing an emissions test. And then they would have lower emissions than when they were being driven in real life. So how do you avoid the AI that says, OK, because I am being tested by the humans, therefore I will give these ethical answers. But then when I am deployed, then I'll just do whatever best achieves my goal. And I'll forget about the ethics.

So I think the main point that I want to make about this is that there were already much simpler scenarios than that one, where we don't know from theoretical first principles how to explain out-of-distribution generalization. Let's say I train an image classifier on a bunch of cat and dog pictures. But in all of these cat and dog pictures, for some reason, the top left pixel is red. And now I give my classifier a new dog picture where the top left pixel is blue. In practice, it will probably still work fine in this case. But theoretically, how could I rule out that what the neural net has really learned is just, is this a dog XOR with what is the color of the top left pixel?

Theo: Well, I talked about exactly this a couple episodes ago with Quintin Pope, who's an alignment researcher. And he seems to think that that is not super likely.

Scott: I agree that it's not super likely. The challenge is to explain why.

Theo: True.

Scott: The challenge is to give principles that, first of all, are often true in practice. And when they are true, then we can say that because of the architecture of this neural net, because of the properties of the gradient descent algorithm, this will not find the stupid hypothesis of, is this a dog XOR with what’s the color of the top left pixel. It will not, it will ignore the sort of manifestly irrelevant features in the training data. And therefore, it will generalize nicely to unseen data. So I think I want to articulate principles that would actually let you prove some theorems about OOD generalization that have some real explanatory power. And I think that feels to me like a prerequisite to addressing these deceptive alignment scenarios.

Twitter (1:20:27)

Theo: Now, something a little more parochial, I guess. Why don't you have Twitter? Everyone in our adjacent space of AI/ML, nerd, rationalism, whatever, has Twitter.

Scott: When Twitter first started in 2006, I was already blogging. It felt like another blogging platform, but where I would be limited to 140 characters. The deeper thing was that as I looked more at Twitter, it reminded me too much of a high school cafeteria. It felt like the world's biggest high school of people snarking at each other. Yes, I had wonderful friends on Twitter, and they were using it for very good things. But I felt like with my blog, at least if people want to dunk on me or tell me why I'm an idiot, at least they have the space to spell out their argument for why. And they have no excuse not to. And if they want to do that, then they can come to my blog. I feel like that's more than enough social media presence for me. Of course, if people want to take my blog posts and discuss them on Twitter, then they can do that. And they do. And there are some Twitter accounts that I read. But I just, I don't know, I feel like my blog and then Facebook are enough.

I have to say, even blogging has become less fun, a lot less fun than it was when I started. I think partly that's just that I have less time these days. I'm a professor. I'm working at OpenAI. I have two kids. I'm not a postdoc with just unlimited free time anymore. But a large part of it is that the internet sort of became noticeably more hostile since the mid-aughts. No matter what I put on my blog, I have to foresee that I will get viciously attacked for it by someone. These sorts of things psychologically affect me, probably more than they should. So a lot of what in the past I would have blogged, these days I just put on Facebook because it's not worth it to have to deal with the sort of angry reactions of every random person on the internet. Or you could say it's not an issue of courage versus cowardice as much as it is simply an issue of time. I somehow feel obligated to answer every person who is arguing with me or saying something bad about me. And for a lot of things, I realize that if I'm going to put this on my blog, then I just don't have the time to deal with it. Or in order to write a blog post in a way that would preempt all of these attacks, that would anticipate and respond to all of these criticisms, would just take more time than I have or more time than this subject is worth. And so that is why I've sort of retreated somewhat to the walled garden of Facebook.

Rationalism (1:24:50)

Theo: And then, last question, were you ever involved with the rationalists at any point?

Scott: I mean, sure. I have known that community almost since it started. The same people who were reading my blog were often the people who were reading Overcoming Bias and then LessWrong, where Eliezer was writing his sequences. So I interacted with them then. I did a podcast with Eliezer in 2007. I knew some of the rationalists in person. Actually, we hosted Eliezer at MIT in 2013. He came and spoke and visited for a week. But I kept it at arm's length a little bit. One reason was that it had a little bit of culty vibes. This is, OK, there's the academic community.

Theo: Polyamory.

Scott: Yeah, and then there's these people who are all living in group houses and polyamorous and taking acid and whatever while they talk about how there are probabilities of AI destroying the world. I like to say today, when I have academic colleagues who say, well, are they just a cult? I say, well, you have to hand it to them. I think this is the first cult in the history of the world whose god in some form has actually shown up. You can talk to it. You can give it queries, and it responds to them. So I think a lot of what the rationalists say is stuff that I agree with. And yet there's a part of me that just doesn't want to outsource my thinking to any group or any collective or any community, even if it is one that I agree with about so many things.

But having said that, sure, I hang out with them all the time whenever I'm in the Bay Area. I see people who are in that community. I got to know Scott Alexander pretty well starting a decade ago. Paul Christiano was my former student at MIT.

Theo: That I did not know.

Scott: He started as a quantum computing person. And then he got his PhD at Berkeley from the same advisor who I had studied with, Vazirani. And then in 2016 or so, he did this completely crazy thing that he left quantum computing to do AI safety, of all things. And that seemed pretty crazy at the time. Of course, he was just ahead of most of us. But I still interact a lot with Paul. And I see him when I'm in Berkeley.

Theo: Are you friends with Eliezer?

Scott: Yeah. I mean, Eliezer and I, we've had our disagreements. And we've also had our agreements. But like I said, we've known each other since 2006 or 2007 or so.

Theo: All right, well, I think that's a pretty good place to wrap it up. So thank you so much, Scott Aaronson, for coming on the podcast.

Scott: Yeah, thanks a lot, Theo. It was fun.

Theo: Thanks for listening to this episode with Scott Aaronson. If you liked this episode, be sure to subscribe to the Theo Jaffee Podcast on YouTube, Spotify, and Apple Podcasts, follow me on Twitter @theojaffee, and subscribe to my Substack at Be sure to check out Scott’s blog, Shtetl-Optimized, at All of these are linked in the description. Thank you again, and I’ll see you in the next episode.

Theo's Substack
Theo Jaffee Podcast
Deep conversations with brilliant people.
Listen on
Substack App
RSS Feed
Appears in episode
Theo Jaffee