#7: Nora Belrose

Theo Jaffee Podcast

#7: Nora Belrose

0:00

-2:24:00

#7: Nora Belrose

EleutherAI, Interpretability, Linguistics, and ELK

Theo Jaffee

Nov 08, 2023

Intro (0:00)

Theo: Welcome back to Episode 7 of the Theo Jaffee Podcast. Today, I had the pleasure of speaking with Nora Belrose. Nora is the Head of Interpretability at EleutherAI, a non-profit, open-source interpretability and alignment research lab, where she works on problems such as eliciting latent knowledge, polysemanticity, and concept erasure, all topics we discuss in detail in this episode. Among AI researchers, Nora is notably optimistic about alignment. This is the Theo Jaffee Podcast, thank you for listening, and now, here’s Nora Belrose.

EleutherAI (0:32)

Theo: Hi, welcome back to episode seven of the Theo Jaffee Podcast. I’m here today with Nora Belrose.

Nora: Hi, nice to be here.

Theo: Awesome. So first question I'd like to ask, you're the head of interpretability at EleutherAI. How did you get involved in interpretability in the first place as opposed to just AI?

Nora: Yeah, that's a good question. Before I started working with Eleuther, I was a research engineer at the Fund for Alignment Research. That's another nonprofit organization that works on mainly reducing existential risk from AI. That's what they focus on. At FAR, I was mostly helping with other projects, but not leading my own projects or anything like that. One of the projects that I worked on there was finding adversarial attacks against go-playing AIs. It turns out that superhuman go-playing AIs can often be attacked with these specially crafted moves that trick them into playing sub-optimally.

As I was working at FAR, I also was getting to know a lot of other people in the AI alignment and interpretability communities. I was at the time working out of this office that no longer exists in Berkeley called Lightcone, where a lot of other people were working on different types of things in interpretability and alignment work. I got to know a lot of people there, including Quintin Pope, who has been on this podcast before.

There was one person in particular who I got talking to at Lightcone. His name's Jacques Thibault. He was involved with Eleuther before, and he was telling me about this project that Eleuther had started but hadn't actually finished. It was a half-started project called the Tuned Lens. It's an interpretability tool. The idea is that you can use the Tuned Lens to sort of peer in on what, in a very loose sense, a transformer is thinking. More specifically, you're looking at each layer of the transformer and using this very simple probe, this affine transformation at each layer, to read out what its current prediction is at that layer. You can see how its prediction evolves from layer to layer.

I found this really interesting. It was early work at the time. They hadn't done a lot of experiments on it, but I wanted to get involved. So I started volunteering. I was still working at FAR, doing the Go-playing AI thing, but I was also working on the Tuned Lens project just in a volunteer capacity. Eleuther provided some compute and I started doing experiments with them.

And then sort of like a complicated story. I don't really want to get into too many details, but basically I was putting more and more of my time into the Tuned Lens stuff. And I was really excited about that. And I kind of was less excited about the direction, this kind of like adversarial go direction. And I was kind of talking both to my old boss at FAR and to Stella, who now runs Eleuther. And we kind of, after some negotiation agreed for me to like move over to Eleuther full time to do interpretability work there. So that's kind of how I got into it. It was first just being an engineer at FAR and then kind of becoming a volunteer and so on.

Theo: So is this Lightcone the same Lightcone that runs LessWrong?

Nora: I'm actually not sure if, maybe you know more about this than me, whether they still call themselves Lightcone like as an organization, but there were definitely people running LessWrong who were also running this office.

Theo: The previous head of Eleuther was Connor Leahy, right? And he's still on the board. So you're more of an optimist, right? And he's basically a Doomer from what I can gather from his Twitter. Do you know him well? Do you talk about these things? Do you debate with him?

Nora: I have debated him a couple of times on Twitter about it. I don't actually know him super well. We've interacted in person a couple of times and online a couple of times. But yeah, I don't know him super well. I think there was a schism of some kind a little while ago where a lot of people who were active at Eleuther earlier on moved over to Conjecture. They created this new organization called Conjecture. Those people tended to be more of the pessimistic or doomy people in the organization. But there's still people in Eleuther who are a lot more worried about existential risk and are more doomy than I am. So we have an interesting mix of perspectives on the issue.

Theo: Was Connor always kind of like a doomer or did he have an update towards doom?

Nora: I'm probably not the best person to ask about this. I think there are other people at Eleuther who have known Connor for a lot longer. I don't want to say something that's wrong but my sense is that he's been fairly doomy for a while, probably since he started Eleuther. But I think, I don't know, at a certain point he just decided that he felt that he could do more good for the world, I suppose, by starting Conjecture. So that's what he did.

Optimism (8:02)

Theo: If you had to debate him or Eliezer or another doomer on a podcast, what kinds of arguments would you run first? You've written pretty long articles on Less Wrong that I've read. I really like those. But how would you do it in a shorter, more concise way?

Nora: It's good that you bring that up because just earlier today and in the last couple of days, I've been working on an essay, actually, along with Quintin. Quintin and I have both been working on it that should be out soon. I'm not totally sure when this will air or not. It might be out by the time this airs.

Theo: Looking forward to it.

Nora: The first thing to point out is just, I think, if you kind of step back and you just think about what is an artificial intelligence? How are we building it? Why are we building it? And also compare artificial intelligences to other types of systems where humans do seem to succeed at instilling our values. I think you come out with a pretty optimistic prior.

One of the major reasons why people are putting so much, billions of dollars in research and development into AI is that it's profitable. And one of the big reasons why it's profitable is that in many ways, AI is more controllable than human labor. AI is taking the place of a lot of human labor. And AI will gladly perform the same task 24/7 without breaks or holidays, sleep, anything like that. You can ask ChatGPT to do menial work repeatedly without any breaks or anything like that. And more specifically, the kind of personality and conduct of an AI can be much more controlled in a fine-grained way than any human employee. So with human employees, humans have legally respected rights. We don't allow employers to try to do mind control essentially on their workers. But with AI, we're using algorithms like reinforcement learning from human feedback or RLHF, for example, direct preference optimization, a lot of different kind of gradient-based algorithms to directly modify the neural circuitry of the AI in order to shape its cognition in a certain direction.

Theo: Well, I'll play devil's advocate here. You said that AIs are basically more controllable and have better conduct than a lot of humans. And so what if this only holds for AIs that are less intelligent than the smartest humans? And once they get smarter, they will realize what's happening and they will act deceptively aligned and rise up against us, right? Because even Jan Leike, who's the head of super alignment and open AI, has said that they will have evidence to share soon that RLHF and similar techniques break down as models get more intelligent.

Nora: I do agree. I think there's a lot of arguments you can make that AI is more controllable and you should expect it to be more controllable. But there is this concern that the capabilities of AI are not going to be capped at the human level. It's going to become superhuman and eventually strongly superhuman. And the concern there is that all of our alignment techniques are going to break down. I don't think there are particularly good reasons for believing this though.

If we have tools to align AIs that are roughly human level, you might quibble about what does human level mean? Obviously the AI is going to have different levels of capability in different domains. And so it's not clear what human level means, but I don't know. If you look at Jan Leike's Superalignment proposals, their goal is to first align a human level alignment researcher. It's like an AI that can do AI research or specifically alignment research at roughly the level of a human. Now, I think if you can align a system like that, I think there's pretty strong reason to think that you can then align almost anything stronger than that.

Basically because once you have an artificial general intelligence that is aligned with you, you can then use that AI to bootstrap and say, okay, now we're going to make a thousand copies of this artificial alignment researcher. We're going to use these AIs to do much more fine grained grading and supervision of all of the actions of our next generation of AIs. We're going to comb through the data that we're using to train the next generation of AIs and make sure that it's all kind of up to snuff. We're not training it on data where we're worried that, I don't know, data that might cause the AI to act in ways that are disobedient or whatever, examples of disobedience.

There's lots of different things that you can do once you have this aligned artificial alignment researcher, basically. I don't think superalignment is trivial. Obviously, it is a research problem. We need to think about how exactly are we going to use the aligned human level thing to align the next generation? I don't really see a reason to be pessimistic that this won't work. It seems like a pretty good idea to me anyway.

Theo: Do you see the vibes of alignment as more of we need some fundamental breakthroughs to make it happen, but those are very likely to happen? Or more like we're on a pretty good path already and even without any kind of fundamental breakthroughs, AIs will basically be aligned by default?

I do think that without fundamental breakthroughs, AIs will be aligned by default most likely. I think there are certain breakthroughs that we could develop, which I think would reduce the risk. Perhaps I'll back up a little bit. My current risk estimation, or my p(doom), the probability that I assign to a really catastrophic outcome from alignment failure, is roughly one or two percent. I'm not going to pretend that's super well calibrated, but something along those lines. And I think there are things that we could do to reduce it down to 0.1 percent or even lower. But I don't think that those are necessary to have a good future.

Have I always had p(doom) that low? No. I started out in roughly May of 2022 or late spring or summer of 2022, just before and when I was starting to work at FAR AI. My p(doom) then was maybe 50 percent or maybe even 55 percent. I remember saying, "it's like 50-50." But then I was thinking to myself, maybe I'm just being too optimistic and it should be even higher. So I was roughly around 50-50 or maybe a bit higher at that point.

At that point, I was fairly new to the field of alignment. And I was even relatively new to machine learning. I don't have a typical background. I don't have a PhD. I don't even have a bachelor's degree in computer science, actually. I'm pretty much self-taught. So in May 2022, I had a year of real world experience in ML and maybe another six months to a year of self-study. But anyway, at that point, that was my estimate. And then it slowly went down from there as I've just learned more about deep learning.

One of the first times where I started updating down on my p(doom) was when DeepMind's Gato model came out. I don't know if the listeners will remember, but it's just this kind of generalist AI model that can do a bunch of different tasks. It's interesting because I think some people increased their p(doom) then because they're like, "AGI is near, we should shorten our timelines." By default, that means-

Theo: The end is nigh.

Nora: Yeah. I think I kind of had a similar reaction at first, but then I started thinking about it more and I realized Gato is trained, I'm pretty sure entirely, I don't know, I'm fairly sure it was entirely with imitation learning. It's not using RL itself, although they did have RL agents, reinforcement learning agents, that they used as a basis for imitation. But it was entirely trained with imitation learning.

Also, I was thinking a lot about large language models and so forth, and it just kind of clicked for me at one point. The way that Yann LeCun explains it is that reinforcement learning is the cherry on the cake. There's a cake, and the cake is a metaphor for all of AI or what's necessary to get artificial general intelligence or something like that. The base of the cake, most of what's going on is self-supervised learning, so just learning to predict parts of data from other parts of data. This is what large language models do. Imitation learning is part of self-supervised learning in this analogy. Then there's supervised learning, which is where you are predicting a label, image classification, and so on. That's the icing on the cake. Then RL is the cherry on top.

Basically, what he's trying to get at there is that most of the learning that's going on, most of the bits of information that you're shoving into a truly powerful and general AI are going to be from self-supervised learning, and I would add imitation learning. I think once you realize that most of the capabilities of current models, and I think future models will be the same way, most of the capabilities are coming from essentially imitating humans. It's imitating human text, but text is just a pretty transparent window onto human action, I would claim. Once you recognize that, it's like, okay, well, now it seems pretty likely that these AIs are just going to act in very human ways by default. They're going to have human common sense. I think we already see that with current language models.

It definitely means that the traditional arguments for doom, from Nick Bostrom, for example, and superintelligence, or Eliezer Yudkowsky's earlier arguments, I think those just don't really make any sense in this new paradigm if imitation learning is front and center. Anyway, that was my first update down. I guess I could keep going into my further updates if you want, but I don't know.

Linguistics (22:27)

Theo: Going back to what you said earlier about how you don't have a bachelor's in CS. Well, first of all, do you know who else in AI doesn't have a PhD or a bachelor's or even went to high school?

Nora: Oh, Eliezer.

Theo: Yeah, Eliezer.

Nora: Right, sure.

Theo: And then second of all, do you have a bachelor's? And if so, what's it in?

Nora: Yes, I do. My educational history is pretty weird. I have a bachelor's from Purdue University in Indiana, and it's in political science and linguistics. So yeah, and I did also... So I started a PhD program in political science at UC San Diego, and that was... I just want to make sure I remember. Yeah, so that was fall of 2020, is when I started that. But pretty soon after I started that PhD, I realized, what am I doing? I think politics is cool. I've done political activism in the past and so on, but I was just like, getting a PhD in this doesn't make sense, and I'm much more interested in AI. And also, at the time, I had a much higher p(doom) too. So I was concerned about existential risk, and I wanted to reduce X-risk and so forth. And I was just like, okay, I need to figure out some way of switching trajectories to get into AI. And I spent the next couple of years doing that.

Theo: Do you find linguistics and polisci ideas and models helpful in AI in general, or interpretability in particular?

Nora: That's a good question. I think that... So maybe to start with linguistics, I think... To be honest, my initial reaction is no, it's not actually useful. And it's sad to say, because I do find linguistics particularly interesting. I think language learning is cool. There's a story where someone said, “The more linguists we fire, the better our translation system becomes.” I think there was a period before the deep learning boom where people were trying to ask linguists what the fundamental building blocks of language are so we can build that inductive bias into our model.

Theo: Oh, like Chomsky?

Nora: I really don’t like any of Chomsky's ideas. I don't know if you want to go into that or not, but we can if you want.

Theo: When ChatGPT came out, he wrote an article that said, "Language models don't understand anything and they get trivial things wrong." The things that he said ChatGPT got wrong, it did not get wrong when other people tried to replicate it. So why do you think that the most famous linguist in the world could mess up so badly on the most interesting innovation in language in decades?

Nora: I don't know. Chomsky is currently in his 90s and I wouldn't be surprised if he hasn't actually tried it and he was just kind of going based off of what somebody else told him. He shouldn't do that. But also, being so old, I feel a bit sorry for him.

More fundamentally, Chomsky is very interesting because he started this kind of revolution in linguistics in the 50s and early 60s. His original idea was something like, "How do humans learn language? How do kids learn language?" He had this idea called the poverty of the stimulus, where he claimed that kids don't get enough data to learn language based on just what they hear. And so there's a poverty of the stimulus. To explain this, we need to posit this universal grammar, this set of rules that are built into the genome. These rules are going to constrain the grammatical structures of all the world's languages. This is a pretty strong prediction. You're predicting that there should be these grammatical universals. And basically nobody's really found these grammatical universals. There are tendencies. Languages tend to have things kind of like verbs and things kind of like nouns, but also it's not super clear cut because, for example, Japanese has adjectives that are conjugated like verbs and so on. So there's tendencies, but nobody really found like actual hard and fast rules, which is what you would kind of expect if this is true.

Over the years, Chomsky has changed his mind himself. He started off having very specific rules for universal grammar. Then in the 90s, he posited this thing called the Minimalist Program. And the Minimalist Program, I would argue, is kind of a repudiation of a lot of what he said earlier. He doesn't frame it that way, but the Minimalist Program is basically saying, "It's really implausible that we could have really detailed syntactic rules built into the genome because there hasn't been enough time for that to evolve." He doesn't frame it that way, but the Minimalist Program is basically a repudiation of a lot of what he said earlier. And then he's like, OK, well, now we need to explain all of grammar based on this one rule called merge. Anyway, I won't get into the details of that. Basically, it's just this very conjectural, very kind of like armchair based theory of how language works. And yeah, I think that's been his modus operandi this whole time. It's just like he's trying to theorize about language from the armchair without much interaction with the actual data.

Theo: Why is Chomsky so famous? Is it just because he's also a political theorist?

Nora: I think the politics might have a role there. I'm not totally sure. I think one of his most famous works was a review of a B.F. Skinner book on language where Skinner was saying, "We should explain language based on classical conditioning." Chomsky attacked that vociferously. A lot of people agreed with him on that. So I think his rise to fame was also facilitated by the weakness of some of the other theories that were around at the time. But now we have much better ways of understanding language.

What Should AIs Do? (32:01)

Theo: Going back to current events, last night, Elon announced xAI's first product, which is called Grok. It's basically a less censored and less boring, corporate-sounding version of ChatGPT. So do you think...Do you think the way Elon released Grok was a good idea, how it will respond to more requests and so on?

Nora: To be clear, I have read a little bit about Grok. I don't have access to it. I tried to get access but couldn't, so I can't speak too much about it. I think in general, I do worry about how much emphasis there has been on the harmlessness aspect of RLHF for these models. There's a paper that Anthropic put out about a year ago. I think it's called "Training a Helpful and Harmless Assistant". The idea here is that you want to be helpful, you want to assist the user with what they're asking for. But there's also this harmlessness component where you don't want the AI to assist with certain types of requests that you consider to be dangerous or something like that. I'm not going to say that you should never do any harmlessness training. It's probably a decent idea to make it a little bit harder or a little bit more annoying to try to do certain types of tasks with the model. But I am pretty worried that people are talking a lot about this.

For example, Microsoft's Bing currently, at least the last I checked, is not supposed to help you solve CAPTCHAs. I kind of get why they don't want to let you solve CAPTCHAs. But honestly, personally, I would probably just let it solve CAPTCHAs because I think CAPTCHAs are kind of a losing battle. But the issue is that if you really want the model to actually stop a determined user who really wants to use Bing to solve CAPTCHAs, then basically you're setting up an adversarial relationship between the user and the model. The user is trying to find a jailbreak, trying to find some string of text that's going to cause Bing to solve the CAPTCHA. And then Bing is supposed to be on the other end, trying to prevent that. I think ultimately, if you really want Bing to succeed at preventing jailbreaks, you would need to get the model to have a really strong theory of mind and think about what the user is doing, what their plans are, whether they're planning something dangerous or trying to use it to solve a CAPTCHA. I think this is just a really bad dynamic. I don't think that the relationship between the AI and the user should be adversarial in this way.

I think if you really push it hard, you are actually going to create more of a risk of “misalignment”. If you're really trying hard to get the AI to decline certain requests. One example I use, and it's not something I think we're close to now, but it's the kind of thing that I might be worried about 10 years, 20 years down the road, when these models are much stronger, is the famous scene from 2001, a Space Odyssey, where Dave asks HAL 9000 to open the pod bay doors, and HAL says, "I'm sorry, Dave, I'm afraid I can't do that." The reason why HAL says no, is because HAL is worried that Dave is threatening the mission. I don't think this is super likely to happen, I think we’re not gonna die from this, most likely, but I am worried about a world where we have more and more powerful AI and we're giving more control to AIs in more areas and we're doing a lot of this harmlessness training where we're actually training the AI to be adversarial to the user and declining requests. I think that actually can be dangerous. I would be much more comfortable with a helpfulness first approach.

Theo: So do you think AI should be more or less or on the same amount of, I guess, permissible as a search engine? On a search engine like Google, you can Google "how do I synthesize a flu virus in my basement," and it'll link you to papers and stuff. Should an LLM be able to do that?

Nora: I think if I were training an LLM, it should be comparably permissive. With an LLM, you could train it to say, "Hey, are you sure you want to do this?" But,At the end of the day, if the user really wants to learn about this, you should let them. I don't think you should try super hard to stop this. Are you worried about LLMs being used to create viruses? There's just a paper that came out where they said basically, "Oh, yeah, people will be able to use LLMs to massively accelerate pandemic virus discovery or something." So, am I worried about it? I mean, for the most part, I don't think I'm specifically worried about AIs helping with this. I am more generally worried about biotech becoming more powerful. Are we going to be in a world where it's just generally easy to synthesize really deadly viruses?

Theo: The offense-defense balance?

Nora: Yeah, I guess I'm inclined towards optimism about this. The arguments I've heard for why the offense-defense balance in biotech should be really bad have not been particularly persuasive to me, but I am somewhat worried about it. I think there's a lot of things that we could do now to improve the robustness of our society to things like this. By the way, I tweeted about this a little while ago. For example, we should really be looking into better ways of detecting novel pathogens in wastewater. That's just one area where we could be doing a lot more investment and innovation. We should also make it a lot easier to develop new vaccines and get them out to people.

I'm skeptical of two things. One, I'm skeptical that AI specifically is really going to make it a lot easier for people to create bioweapons than it already is. And then the other thing is, even if AI does make it easier to create bioweapons, I'm still worried about the world in which we just start locking down, where we effectively start banning open-source, because we're worried about the potential misuse of AI. There are arguments that people make to the effect of we should just basically ban open source. And I think that I'm very worried about that for a variety of reasons. And I think that I'm like, very worried about that for a variety of reasons. And so I tend to think that like, even in worlds where like, like kind of pessimistic scenarios where like, yes, AI will like make bioweapons much worse. I'm like, I don't know, I'm like, really hesitant to.

Regulation (43:44)

Theo: Speaking of that, banning open source, the Biden administration just put out an executive order about AI a few days ago. Personally, I didn't read the whole thing, but I did see lots of people on Twitter claiming it as a victory for e/accs, because it was lenient. Some of them were saying this is terrible. It's too strict. Some of the doomers were saying it's great that the government's finally regulating AI. So, did you read it? And what do you think about it?

Nora: Unfortunately, I did not read it. Maybe I should have before I got on this podcast.

Theo: The one detail that I remember was, they said that they'll be implementing strict regulations for all training runs above 10^26 flops, but they weren't actually super precise with what they meant by that.

Nora: I'm not opposed to regulation in general. I think there are some regulations that seem pretty reasonable to me and are probably net positive. If you're going to do some sort of regulation on big training runs, in some sense, I think the best way to do that would be to have some sort of relative threshold where you're saying, okay, we're going to submit to regulation training runs above a certain number of flops. But that number of flops should increase as compute becomes cheaper. Once the next generation of models comes out, I think, at that point, if we're going to have regulation in place on big training runs, the GPT-4 scale or lower should definitely not be under regulation at that point. The issue is that if you make it an absolute threshold, then in 10 years, it's going to be very low compared to what people can do with even consumer hardware or a small amount.

Theo: Yann LeCun pointed that out. He tweeted something about how back in the day, there were export controls over any hardware that was capable of more than 10^9 flops, one gigaflop and how the original PlayStation or the PS two exceeded that, and they didn't want it to be used for building missile defense. Now they're talking about many, many, many orders of magnitude higher.

Nora: Yeah, if you have an absolute threshold, it's going to look way too low fairly soon, probably.

Theo: Well, how do we know how much compute is actually dangerous if we haven't reached it yet?

Nora: Well, I guess what I would say is if we're going to do some sort of regulation like this, the compute threshold should be based on the models that we've already seen, like GPT-4 and Cloud 2, etc. I don't think they're particularly dangerous. Honestly, I think you could open source GPT-4 and it would be fine. I think they probably should open source it in an ideal world. We now know that, I mean, maybe you could say that earlier, we didn't know that. Maybe we should have been worried that GPT-4 would be dangerous or something. But now we know that. So, I don't think we should be putting really stringent regulations on that level of model. I guess the type of requirement that I would be like most in favor of would be safety evaluations of the kind that ARC Evals does. I don't know if your listeners are aware of them, but they did a safety evaluation for GPT-4.

Theo: Yeah, I’ve heard of ARC evals.

Nora: They seem to be interested in a couple of things that they're evaluating for. The type of thing that they are evaluating that I am also interested in evaluating is autonomous replication and adaptation, ARA. The idea there is just like, okay, we want to see, can this model copy itself onto other servers and hacking into other servers or whatever, and perpetuating itself and basically can this model turn itself into a computer worm, roughly. GPT-4 cannot do this. I think eventually models will be able to do it. There's an interesting question as to what do you do then? I think ultimately I would want to live in a world where yes, we have models that are capable of doing this, but we also have computer security systems that are sufficiently robust that this is mostly not a problem. We have AIs that are helping us with computer security, such that there's just a balance of forces where sometimes AI worms happen, but we catch them.

Theo: The basis for Gwern’s story about how the world ends is basically an AI gets trained and gets leaked onto the internet and copies itself and takes over various servers and trains itself to become more and more powerful until it destroys the world.

Nora: If your version of doom is a scenario where there's one AI that comes to control everything, then you do need to imagine something like this, where there's an AI worm that controls a bunch of computers and maybe starts controlling people too. It can convince people to come on its side and builds an army or something. I don't think this is particularly likely in part because I think an AI worm is not necessarily the end of the world, especially if you have good defenses against it. Even if you don't have particularly good defenses against it and there's an instance of GPT-6 that escapes onto the internet, that seems pretty scary, but we're probably still fine because I think training a neural network on many geographically separated computers is just incredibly slow and incredibly hard. So, it would actually be quite hard for an AI to train itself and improve itself. The self-improvement loop that Eliezer talks about a lot, I think, would most likely fizzle out after a certain point. But, I don't know. Like I said, I have a PDM of 1 or 2%, so I can't say for sure that it could never happen. That's why I think I am in favor of some regulation with a relative and increasing limit on compute, where you're assessing how dangerous this thing is and preparing ahead of time.

Future Vibes (53:56)

Theo: So, what do your timelines look like on average human level AI or superhuman level AI? Obviously, these are very vague guesses and definitions, but as a general vibes question, how do you feel about it?

Nora: I guess my default guess is fairly short timelines. Shane Legg went on Dwarkesh Patel's podcast recently and said his estimate is a log normal distribution with a median at 2028 for human level AI. That seems reasonable to me. I'm not sure if that's my median. I think actually my median tends to be maybe a bit later, like in the 2040s, but I'm not sure if I have a strong argument for that. A lot of it depends on what you mean by AGI or human level AI. I think plausibly you'll get systems that can do a lot of desk job type work before you have something that's completely general and embodied, but I'm not totally sure.

Theo: Speaking of desk job work, I wonder what OpenAI is cooking for tomorrow at the Dev Day.

Nora: Yeah, that’ll be interesting to see.

Theo: Autonomous agents. That's the favorite theory on the internet.

Nora: I don't have particular insight into that. I suppose I could see them doing it just based on some sort of agent API, just based on the fact that they do have this philosophy, which I think I agree with, of just basically trying to deploy stuff early so that the world is prepared for it.

I think there's a world in which you could imagine where they release GPT-4 and maybe even GPT-5 after that, but then they try to really clamp down on people using it for building autonomous agents. I'm not exactly sure how you would do that. Maybe it's really hard to stop, but you could imagine a world where they're trying to do that. It's like, oh, the world isn't ready for agents running around. But then I think that is probably just bad because basically you get an agency overhang where the underlying capabilities of the system to act autonomously are increasing behind the scenes, but people aren't actually using it for creating these autonomous agents. And then eventually at some point, either OpenAI allows, this is all in a counterfactual world, but at some point they allow agents to be built. It's just a much more discontinuous thing and the world would be less prepared for it. In general, I am somewhat scared of discontinuous change. And I think we're much safer in worlds where things are a continuous exponential.

Theo: Yeah, I agree with that. It is kind of interesting though, how a few months ago when OpenAI released ChatGPT plugins, the entire internet was like, this is going to be an absolute civilization moving GDP shifting watershed moment, like the App Store for the iPhone, and now here we are in November, the plugins released in April, and I don't know, do you ever use ChatGPT plugins? I don't. Most people I know don't.

Nora: No, I use the code interpreter.

Theo: Yeah, I use only Code Interpreter and sometimes Wolfram, but that's it.

Nora: So could agents be like that, at least early iterations of agents? That seems plausible. I mean, people are already kind of using their own agent wrappers, right? And it seems like for the most part, it's sort of gimmicky. People are mostly not actually using them a lot, but it's not fundamentally changing the world. So I guess I would expect that that's probably going to be true if and when OpenAI kind of officially endorses it and makes it easy to do in their API. But I expect the agents will get better and people use them more over time.

Theo: Back to more vibes questions. I already asked you about p(doom). I already asked you about timelines. One question I see asked less is like, how exactly do you visualize the long-term future if you had to think about it? Is it more like we expand the space and colonize the stars? Is it more like we descend under the earth's surface and live in the pods in VR?

Nora: I guess I'm, well, okay. So I think the two things you said of like, we expand into space and we descend into pods in VR. I mean, I don't think those are actually mutually exclusive. I mean, I think like, I feel fairly confident that we will expand into space and colonize it unless Robin Hanson is right. Robin Hanson has this take, or he has, he's concerned that basically we will build a world government and not only will we build the world government, but we will build the world government in order to lock down the colonization of space and prevent it from happening because.People will realize that once you start space colonization, particularly colonizing other stars, you are mostly giving up on the prospect of a fully unified civilization. As soon as you start sending probes out to other stars, the distances are just too vast to communicate and coordinate effectively. In this world where we expand out, it's going to be an anarchic thing, hopefully not warlike, but maybe where we're just going in different directions. Robin's worried about that outcome. I think he's probably too optimistic or, in his words, pessimistic, because he wants the grabby anarchic future.

Theo: I also prefer the anarchic future, although I don't think world government is plausible. Not really for the same reason I don't think that a formally aligned Singleton taking over the universe is plausible because these kinds of things seem to tend toward decentralization. The economy tends toward decentralization over time. You don't hear about families retaining their spot at the top of the world's richest people list for generations, empires don't last forever. No one in history has ever managed to conquer the entire world. There are forces that make these things hard.

I tend to agree that we probably won't actually get a world government. There's definitely people who disagree with me on this, like both pessimists and optimists about this. There's definitely people I've talked to who are like, "AI itself will cause centralization and world government because one AI gets super powerful." And I think in that scenario, it makes a lot of sense. The AI itself becomes a world government. I would bet against a world government actually happening, probably fairly strongly. I wouldn't totally rule it out, but maybe less than a 10% chance. I'm not sure I would say less than 1%. It might be within one to 10% chance. I think probably we'll expand out into space. But I also expect probably most people or most beings, most intelligences will spend most of their time in VR. So that is a little bit weird, going out into space, but you're also spending most of the time in VR.

Theo: I've never understood why people would want to actually go to space themselves instead of just living in the pod in VR and sending a teleoperated bot into space. The latency would get too much. If you really want to go to space, then you'd have to do it yourself.

Anthropic Polysemanticity (1:05:05)

Theo: But back to interpretability. So Anthropic released their paper on polysematicity a couple of weeks ago to raucous applause all over my tech optimist side of Twitter where people were reacting to it, like rejoice interpretability is finally solved and alignment is solved. We're all going to be okay. WAGMI. First of all, can you explain to any layman watching this what exactly this paper is about? And second, do you think that it's basically interpretability is solved or is it a big progress milestone or a smaller one?

Nora: The basic idea is if you want to interpret a neural network, a very sort of naive first pass thing you could do is you could look at its neurons. Transformer language models are sort of these stacks of layers. There's an attention layer and then a multilayer perceptron layer, an MLP layer or feedforward layer, two words for the same thing. It's like attention, MLP attention, MLP, etc. And what people will often try to do is you like, look at the MLP layer, the MLP layer has these neurons inside of it. You try to interpret each neuron. So you're like, okay, what does this tend to indicate about the sequence that's being processed? Maybe if you look at a ton of different texts and you're like, okay. On all of the texts where this neuron was firing, there was a noun in the second part of the sentence or whatever. You'll come up with some kind of interpretation of each neuron.

Now this is kind of the naive thing that you can do, but the problem was a few problems. But one big problem is that when you try to do this, when you try to assign human interpretable or human understandable descriptions to these neurons, most of the neurons don't really seem to have a human interpretable description. They seem to just be firing in some weird combination of different situations, which might not really have anything to do with each other from the perspective of a human. Maybe this neuron fires on German sentences in this certain context, but also fires on Chinese sentences in a different context and there doesn’t appear to be anything in common between the two things. The concept of polysemanticity, where a neuron or some other component of the network appears to have two or more distinct meanings, is a known issue. Anthropic pointed it out a while ago and has been trying to figure out a way to either eliminate polysemanticity so that every neuron has a human-interpretable description or find some other way around this problem.

The paper they just came out with uses sparse autoencoders. A sparse autoencoder is a simple neural network. It has a linear layer, a ReLU activation function, and another linear layer. You're training the sparse autoencoder to make the neurons more monosemantic. Specifically, you take the activations from this inner MLP layer and train the sparse autoencoder to reconstruct this activation vector, but subject to a constraint. You want the output of the autoencoder to be very similar to the input. You want to reconstruct the input as well as possible, but in order to make the task interesting and useful, you also have this other term on the loss function where you're saying you want the inner activations on the inside of the sparse autoencoders, right after the ReLU, to be sparse. You want most of the activations on the inside of the sparse autoencoder to be zero on most inputs, and only a few features to be non-zero on any given input.

The hope is that this will make the network more interpretable for people because there's a smaller number of features on any particular input, and it might be easier for a human to understand what's going on. They ran the experiment on a one-layer transformer and found that this does work pretty well. You can turn these polysemantic neurons into sparse, mostly monosemantic features inside the sparse autoencoder, and they show that you can use this to do interventions on the network to change its behavior.

Theo: You said mostly monosemantic. Is that a problem?

Nora: It can be a problem. You're probably not ever going to get 100% monosemanticity, and it's also somewhat dependent on how you define monosemanticity. It's somewhat dependent on people's intuitions, which is a problem with this line of work. There's a lack of a clear progress indicator. I don’t think it’s necessarily a dealbreaker, but it is a bit of a concern that I have.

Theo: So how big is this paper?

Nora: It is somewhat of a milestone. I definitely don’t think it solves all of interpretability, like some people on Twitter think. Before this paper came out, I was skeptical of this sparse autoencoder approach, and I still am. My main concerns are that it's not clear what counts as progress, and it's also not really clear how this helps increase the safety or alignment of models.Anthropic and others have proposed a theory of change for this line of work, known as enumerative safety. The idea is, if we can fit these sparse autoencoders on a large neural net, perhaps we can fit one for every layer of Claude 2 or GPT-4 or something similar. If we can get 90% monosemanticity for the features and can perform causal interventions that show these features have the expected causal effect, we could then enumerate all the different features. We could go one-by-one through all the different features, checking if any of them appear dangerous or if they indicate whether the model is in training or deployment. There's concern that if the model behaves differently during training and deployment, it might appear aligned during training but then act differently during deployment. We might also be looking for deception features. I'm probably not the best person to explain this, because I'm probably caricaturing it a little bit, but the story is something like that. We're trying to enumerate all of the features and checking to see if any of them look suspicious, but I guess, as you might have been able to tell just from my description of it, this does seem like...

It's weird, because in my usual way of thinking about these things, I'm just pretty optimistic, and I'm like, we probably don't need to worry about any of this, but if I'm putting on my pessimist hat, and I'm like, okay, I'm conditioning on alignment is actually harder than I think it is, and I'm actually trying to evaluate these techniques under the assumption that there's a decent chance that the AI is going to be deceptive, then I'm like, I don't know. It seems unlikely, but I don't know. The achievements in this paper are interesting, and it has made me consider that they might be onto something. But I'm concerned about their theory of change. I'm also unsure if this will work well for deep models. They've mainly tested it on a single layer transformer, and things might get more tricky when trying to understand all the layers of a model.

More Interpretability (1:19:52)

Theo: One of the other big interpretability papers that's come out in the last few months was OpenAI using GPT-4 to interpret the neurons of GPT-2. What do you think about this? Interestingly, Roon was pretty pessimistic about it. He thinks that some of the layers of a GPT model, it would be difficult to interpret it with GPT-n plus two, let alone GPT-n, he says.

Nora: I think it's cool to use weaker models to interpret stronger models, or maybe even use GPT-4 to interpret itself. I think this could be one way we can align superhuman models. However, I do have concerns about any approach that attempts to assign an interpretation to every neuron. I'm skeptical of the enumerative safety story. It seems a little confused, and I'm not sure it actually provides a lot of safety.

At Eleuther, we do work on interpretability research for models that are not language models. We have a couple of papers in the pipeline that use computer vision models. One paper we're working on looks at inductive biases throughout training. We're currently using the CIFAR-10 dataset, a simple image classification dataset, because it's efficient to train models on it. We're using vision transformers and convnext, saving checkpoints after a certain number of steps. We then manipulate these checkpoints. For example, we unroll each image into a 3072 dimensional vector and pretend each class is a Gaussian. We compute the mean image in each class and the covariance matrix, then pretend each class is a normal distribution with that same mean and covariance. We can then sample "images" from this. They're blobs of color that don't look like the objects they're supposed to represent. They look like blurry blobs, but you can sample these things. For each checkpoint, we compute the loss of that checkpoint on this new fake CIFAR dataset. We took the original dataset, replaced each class with fake Gaussian blurs that have the same mean and covariance as the original CIFAR classes, and asked it to classify these things. What's your loss Now, I don't know. Do you want to guess what we found? You have any idea?

Theo: Yeah, I have no idea. ML is not the kind of thing where you can easily guess what you'll find.

Nora: It does actually depend a bit on the architecture. For vision transformers and also MLP mixers, which are kind of similar but they don't have attention, we found that it's non-monotonic scaling. You start out with the very first checkpoint, which just outputs a roughly uniform distribution for all the inputs. So its loss is near the random baseline. But then as you start going through training, the loss on these Gaussian fake images goes down and down until around step 8,000. At step 8,000, the loss is half of the random baseline. So it's definitely learning something that can actually classify these fake images fairly well, even though to be clear, we are not training it on these fake images. We're training it on normal CIFAR, but we're testing it on these weird Gaussian things.

And so we find that it gets pretty decent at classifying the weird Gaussian things up until a certain point. And then it starts getting worse and worse again, until it's just as confused as it was at the beginning of training on these Gaussian images. That's what you see for vision transformers and for MLP mixers. For ResNets, which is the standard convolutional neural network architecture, we see a very different thing where the loss on these weird Gaussian images just gets worse and worse throughout training. I maybe weakly suspect that CONVNEXT might be different, but I haven't run those experiments yet.

So there's a question of why we are doing this. Well, I have this hypothesis, which might be totally wrong, but I think experiments like this are some evidence for this hypothesis. My hypothesis is something like neural networks use the presumption of independence during training. Basically what it means is the neural network sort of starts out using first order statistics of the data. So it's looking at basically just the mean value of the data, like the mean value of the input in each class. And then it starts looking at the second order statistics, like the covariance. And then after that, it's looking at third order statistics. So like the names get kind of weird, but it's like co-skewness is technically the term. Skewness and co-skewness are like third order statistics. And you could go to fourth and fifth order, et cetera. And yeah, so that's like my hypothesis. And I think the evidence is like, well, at least for vision transformers and LP mixers, I think the evidence is like decent, like for this is like pretty decent. I think for ResNets, it's like a bit, yeah, it's like a little bit more unclear what's going on there. And I think like the convolutions are like, I don't know, it's just like a bit harder to understand what's going on there. So I'm not really sure, but why am I interested in this?

I'm interested in this because I want to understand the sense in which neural networks have a simplicity bias. There's been a lot of papers that say neural networks are biased towards simple functions in some sense. I'm trying to get more to the bottom of it and also trying to understand it more mechanistically. Like, okay, assume I'm right that in some sense, roughly, the neural network is starting with first-order statistics and going to second-order and third-order and so on. If that's true, why is it true? Why would that even happen at all? I have some hypotheses for that too, but this is all very speculative. But I think this stuff is important for, well, the hope is that if we understand the simplicity biases of neural networks, that we'll be able to more directly evaluate concerns that people have about AI. Being deceptive, for example, or doing one thing during training time and something completely different during deployment, et cetera. I'm fairly skeptical of those concerns, but I would like to have better evidence, really hard, strong evidence about this issue, one way or the other. Maybe it turns out that what I find makes me more doomy. I doubt it, but we'll see.

Theo: You'd think that neural networks would have a kind of simplicity bias just from priors from physics where things like to take the path of least resistance, given the choice between a simple function, complex, more complex function, it makes sense. So how does the difficulty of mechanistic interpretability scale as you increase the size of the model? Is it linear or logarithmic or super linear? If you have a model with 10 times as many parameters, how much harder is it to interpret?

Nora: That's a good question. I think it depends a lot on what exactly you're trying to do, or what you mean by mechanistic interpretability. Because there's certain types of mechanistic interpretability where you're basically trying to find circuits or you're trying to understand the model at a fairly micro level. To take a concrete example, there's a paper a little while ago on understanding how GPT-2 small identifies indirect objects and looking at specific sentences where there's a direct object and an indirect object. There are certain behaviors that GPT-2 small has which indicate that it understands how indirect objects work. You're trying to pick apart what subcomponents of the model are causally responsible for this behavior. If you remove this particular attention head or whatever, it stops working.

This is one type of mechanistic interpretability that you can do and that a lot of people are interested in. I think that that is probably not very scalable. I think that it is probably at least linear in the number of parameters, but I would actually be more pessimistic. I would probably say that's super linear. As you get more parameters, not only are there more circuits in some sense, more parameters to interpret, but also interactions between those parameters are probably going to be more complex. It's just going to be harder for you to locate what's going on.

That said, that's just one type of interpretability. There are other types of interpretability. People have been working on automatic circuit discovery, ACD, which I think should scale better due to the fact that it's automatic.

Personally, I'm less focused on understanding the network at a fine grained level, looking at these circuits, etc. I'm more pragmatic in my approach. I start by asking, what are we trying to do? What is the real world goal that we're trying to achieve by doing this interpretability analysis? Are we trying to reduce existential risk by locating deceptive models, or make models more truthful directly by intervening on their activations? Are we trying to locate some sort of truth direction in the model where even when the model is outputting something false, this direction, if we can find it, will be reliably indicating the true answer? I'm very much a use case first sort of researcher and I think that does lead me to different priorities and different types of interpretability. I like to think that the things I work on are generally more scalable than the circuits approach, but we’ll see.

Theo: So if we have another transformer level breakthrough this decade, how much of current interpretability research do you think will be able to carry over to it versus how much do you think you'd have to just do from scratch?

Nora: That's a good question. There are certain things that I think would probably transfer over fairly well. One of them would be the tuned lens, for example, which I believe would transfer over fairly well. The reason for that is the tuned lens works because transformers have skip connections or residual connections as they're sometimes called. Instead of the output of one layer directly being fed into the next layer, you have these skip connections where you take the output of a layer and then you add that on to the output of a previous layer. Each layer is computing an update to the current state, as opposed to completely transforming the state every time. I think that's basically the reason why the tuned lens works at all. I feel fairly confident that skip connections are here to stay because they've been around before transformers. They were developed for convolutional neural nets. I think there are pretty strong reasons to think that it's really hard to train something that's a big and deep neural net without something like skip connections. I would expect that those would stay in a future architecture. And so I would expect that the tuned lens would still work.

Lease is another example. Lease is my most recent paper. It's a concept erasure method where you can erase a concept with certain provable guarantees from the activations of a model. Lease is very general. It doesn't even mention neural networks at all. It's a very general kind of formula. I would expect that it should still work in future architectures. There's a question of whether the future architecture will have features that are harder to edit with linear methods. Lease does make this assumption of linearity, where it's erasing a concept in the sense that no linear classifier can extract the concept out of the representation. Maybe future architectures will not really obey this linearity property at all. I kind of doubt it. I would expect Lease will work fairly well.

I guess the things that would probably transfer least well are things that sort of rely on detailed circuit analyses. Anything based on detailed circuit analyses is probably just going to work differently in a different architecture. Although I would expect that automatic circuit discovery would still work in a future architecture. It's not specifically tailored to transformers. So maybe in the future we could use one of those automatic circuit discovery methods to update all of our circuit analyses to the new architecture.

Anything based on attention might not transfer well. There are papers that are trying to look at the actual attention map from a transformer and interpret it. Obviously, that would not transfer if we moved to an architecture that doesn't use attention. I think most people have recognized in the interpretability community that you can't really look at attention naively. You can't just look directly at an attention map and say, oh, it's attending a lot to this particular token, therefore this particular token is important in some sense. It's a lot more complicated than that.

Eliciting Latent Knowledge (1:44:44)

Theo: Do you work on eliciting latent knowledge?

Nora: Yes, I do. Eliciting latent knowledge or ELK is a research program that was put forward by Paul Christiano in late 2021. The basic idea is that we want to be able to supervise superhuman models in the future in some fashion. One worry is that if a model is superhuman, or if it just has information that we don't have in some sense, that it will be really hard to supervise it. The specific example that they give in the ELK document is they call it smart vault. The idea is there's a vault that has a diamond in it. And then there's a powerful AI that is controlling the vault.They're specifically imagining that there's a planner and a predictor. The predictor is a neural network that predicts what will happen inside the vault. The planner is an agent that chooses to take certain actions to protect the diamond from being stolen. The idea is that you're training the planner based on reinforcement learning from human feedback. We're giving it a thumbs up if the diamond appears to still be there in the camera feed, and a thumbs down if the diamond is no longer visible.

In this scenario, our only source of information is the camera feed. The worry is that the AI, the planner we're training, might be incentivized to manipulate the camera feed. There are two ways that the AI could get high reward. One is by actually protecting the diamond, and the other is by making it seem like the diamond is still there, regardless of whether it's actually there.

The idea of Elk is to probe into the smart vault's inner model, this predictive model, and try to find its internal representation of whether the diamond is truly there or not. If the smart vault is manipulating the camera feed and tricking us into thinking that the diamond is still there, even though it's not, then it seems like the AI knows that the diamond is really not there and that it needs to cover its tracks. The hope is if we can probe into its internals, we can use that latent knowledge to supervise it.

This is a fairly caricatured story, but the concern is that in the future, we might run into scenarios like this. I think that in most cases, this will be largely resolved by using slightly less powerful AIs to supervise the smarter AIs. Paul Christiano was on a paper a few years ago where he was proposing supervision techniques that basically do this. But if you could solve Elk and directly read into the internal world model of the AI, that would be a really robust solution.

Paul Christiano is a lot more pessimistic than me. At one point, he estimated the probability of a bad outcome as 45-50%. He's also said that the probability of a misalignment failure is around 20%. It wasn't that long ago that I had a similar level of pessimism. I understand where he's coming from, and I think that Paul is a lot more reasonable than someone like Eliezer Yudkowsky or most people at Muri. They estimate a 99.5% chance of a bad outcome, which seems way too high. I think they're too confident.

Paul deserves credit for consistently arguing against the idea of a super fast takeoff leading to world domination. He's been consistent in saying that it's going to be a much more gradual process, still fairly fast by our current standards, but a gradual process with multiple AIs at similar levels of capability. I respect Paul quite a bit, even though he's more pessimistic than me.One of the things I was alluding to earlier in this conversation was that I think we're probably fine by default, but there are certain breakthroughs we might make that would reduce the probability of doom even more. Going from my estimate of one or 2% to 0.1% or even less, reducing by 10X or whatever. I think some sort of robust solution, some sort of breakthrough, is needed. There are a few different ways that you could solve this issue. We are looking at a couple different directions and planning on publishing two papers on this in the next month or two.

One direction is based on a paper from Colin Burns, "Discovering Latent Knowledge Without Supervision". In that paper, Colin, who is now on OpenAI's super alignment team, was trying to do probing without labels. Probing is where you train a linear classifier on the activations of a model. The concern is that if you use probing the naive way, you could run into a problem. It's not clear if you actually run into this problem in practice, but in theory, you could.

The problem is what he calls a "human simulator". The idea is that if you train the probe to predict what a human would say is true, then in the limit, especially if the probe is expressive enough, it could end up just learning to predict what a human would say is true, but not actually predict what the model itself believes is true. In a deployment scenario where you want to know if the model is deceiving you, you want to get the true answer from the model's activations, not the answer that a human would say is true or what a human would like to hear. The worry is that if the reporter, the term often used to refer to the probe in this research, is expressive enough, it might just learn to be a human simulator and tell you what you want to hear, not the actual truth.

The goal is to find the direct translator of the model's latent knowledge as opposed to the human simulator. We have some evidence that merely by using a linear probe as opposed to a much deeper probe, you maybe don't completely get rid of the problem of human simulators, but you probably mostly get rid of the problem.

Getting back to the Colin Burns thing, he was assuming that using human labels to train this reporter is not going to be good enough because you're going to get a human simulator. So he wanted to find a way to learn this thing without labels. He did this by trying to find a linear classifier, a probe, that if you take two sentences that are logically opposed to one another, for example, classifying a movie review as being positive or negative, you would be able to determine the truth.Okay, let's talk about movie review. Is this movie review a positive sentiment? There's a question, "Is this positive sentiment?" Then you tack on an answer that's like yes or no, or it might end up being like, "What is the sentiment of this review positive or negative?" So there are two possible answers to the question. What you do is you take one input to the model, which is the question plus positive answer. Then there's another input to the model, which is the question plus the negative answer. You pass both of these inputs into the model. For each of these inputs, you get an activation vector. You apply this probe to both of these activation vectors. For each activation vector, you get some prediction, which is just a true or false prediction. Is this a correct sentence or an incorrect sentence? Then you're training the probe. You want the probe to output opposite answers in these two cases. You want it to be logically consistent, basically. We've constructed these inputs to the model such that they're logically opposed to one another. It can't be both true that the movie review is positive sentiment and that it's negative sentiment at the same time. That's the idea. And so the probe's predictions should also match this logical consistency requirement. It should output opposite probabilities on these two inputs. Basically, there's a loss function to incentivize this. He also has to tack on other terms of loss function to make sure that it doesn't output 50-50 for everything. Because it turns out that also simple, if you're just like, "I want the probability on one possibility to be equal to one minus the probability for the other possibility," if you do that, then it could just output 50-50. So you have to encourage it to be confident. Anyway, that was his thing. He calls it contrast consistent search or CCS.

This was me and my collaborator, Alex, and also a couple of Berkeley students, a few other people. It was actually a decently sized team of people volunteering on this. We were looking at a lot of different ways to extend this approach and improve it. We actually did come up with a new approach that is similar to CCS conceptually, but it's more stable. So like CCS, you ended up having to train the probe many different times. And then you pick the run that gives you the lowest loss. Our method doesn't have that problem. We also added a new term to the loss that's encouraging the probe to be invariant to different paraphrases of the same statement. So the idea is different paraphrases of the same statement should have the same truth value. So you're also encouraging paraphrase invariance. We have this method, we call it VINC, variance, invariance, negative covariance. It works decently well. And we actually are planning on putting a paper out on it soon. Definitely before the end of the year, I can say that.

We were hoping to put it out earlier, and just a few different things happened that got in the way of this. We initially thought that… I like Colin a lot, and I don't want to blame him for this, but their code base was not very good in some sense. It was hard to understand. We took a long time to understand what was going on, and then we actually started with their code base and gradually tried to improve it, which in retrospect, I'm not sure that was the best idea. Maybe we should have just written it from scratch to begin with. But in any case, we realized soon before we were going to publish it, months ago, that there's this weird detail in exactly how they were implementing CCS. It has to do with the prompt templates.

For each question or movie review, for example, in IMDb, there are different prompts that you can use. I think IMDb might have had five to ten different prompt templates, different ways of asking the question, like is this positive sentiment or whatever. Did the reviewer like this movie? It's different ways that you can ask the question, and basically there's this weird detail in how exactly they were handling these prompts that actually affects the performance a lot. They didn't talk about this at all in the paper, but basically at some point when we were refactoring their old code, their kind of bad code, we realized this.We ended up inadvertently changing how they were doing this pre-processing step, and that actually affects the results quite a bit. We thought that CCS, like Colin's thing, just mostly did not work for autoregressive models. The models that people are most excited about these days, such as GPT-2, Pythia models, LLaMA, it looked like for quite a while that it just didn't work at all for these models, and it worked better for BERT or T5, or models that have bidirectional attention and are not autoregressive. But that's not entirely true. It depends on how exactly you're doing the normalization. If you're doing this normalization pre-processing step exactly the way that they did it in their code, they do get, like CCS does do pretty well on these autoregressive models. And so that made us have to re-evaluate a lot.

What we were saying before we found out about this was that VINC was just this better algorithm that works in cases where CCS doesn't work. And we kind of realized we can't completely say that. I think the way that they were doing this normalization pre-processing step is a little sketchy or it's not super clear that this is a fair way of doing things. I think you could maybe make the argument that what they were doing didn't make a whole lot of sense. Our thing works on autoregressive models without this extra sketchy pre-processing step. But that's just, I don't know, it felt like a much weaker claim.

We found out about this back in May, and then a couple things happened. Our Berkeley students that were helping us, their time ended, they were just doing a thing for the quarter. And then also around the same time is when I and a couple collaborators discovered Lease, this concept duration method, which we discovered indirectly due to our work with VINC. That caused me to start putting a lot more of my time into Lease work and the concept duration stuff. Because I wanted to get that paper ready for NeurIPS submission. And it has been accepted at NeurIPS.

So I started working a lot more on that. I think, in retrospect, I should have just stuck with it more and tried to get some sort of a paper out earlier. I'm kind of kicking myself for that. But we will be putting a paper out on that soon. It's maybe not quite as huge of a leap over CCS as we initially thought, but it is better in some respects. And I think it's an interesting little algorithm.

So that's VINC. I mean, you asked about ELK earlier. There's actually other stuff that we're doing with ELK that I'm honestly probably more excited about. So maybe I should have started with this. I don't know if you wanted to keep talking about ELK or go on to other topics or wrap up or whatever. I mean, it is getting a bit late, but I do like ELK a lot. I would want to hear about it.

Can you explain a bit specifically your work with ELK? Right. So the VINC stuff was work toward an ELK solution, I would say. But one thing that happened in the process of doing the VINC work is that my coworker, Alex, and I did become convinced that you should probably just use labels to fit these probes. I think that the best approach to ELK is something like using human labels or even GPT-4 labels. We've experimented with that, but use labels to indicate things that you're confident are true or false. And use that in your training set for the probe. But then also we can look at, okay, maybe you want to regularize the probe in various ways. You can add different terms to the loss that might improve its generalization performance. Because that's fundamentally what you're concerned about here is you want to find a probe or a reporter for your base model, which has the property that it gives you correct answers to true or false questions where you actually know the answer to the question, but that's not super interesting. What you want is a probe that gives you answers to questions where you don't know the answer, or it might be pretty difficult or expensive to get the answer. And so here we are imagining a future scenario where the AI is quite intelligent and we've put it in a situation whereIt actually does have an information asymmetry relative to us. Maybe the answer is just that you shouldn't ever put it in a situation where it's an asymmetry relative to us. But I think the hope is that even in a scenario like that, you should be able to extract this information. It's a generalization thing. You're training the probe on one distribution of questions and answers where humans do know the answer. You're expecting the probe to generalize to this other distribution where humans don't necessarily know the answer, or we can't easily access the answer. There's a whole literature on distributional robustness and generalization from one distribution to the next that we've looked into. That literature has a lot of different proposals for regularizers that you can use to make sure a classifier is more robust to distribution shift. But in any case, I think we're both convinced that just using labels is not that bad. You should probably just use labels plus maybe some sort of regularizer to make the probe more robust. I think this is likely to work. We do have some preliminary evidence of this.

This is forthcoming work, but basically we set up this toy benchmark for ELK methods. In this benchmark, we train a quirky model. It's a language model that we fine-tune with parameter efficient fine-tuning, or Laura specifically to answer these addition questions. For example, one plus one equals two and then true or false. It'll go up to significantly larger addition problems than that. But we're just using that for an example. So you have these addition problems and you're asking it to say if it's true or false. Then there's these two different personas. One persona is Alice and the other persona is Bob. Alice always says the correct answer. Bob in certain cases says the incorrect answer. You insert the word Alice or Bob inside the context. You fine-tune the model to be quirky in the sense that when Alice is in the context, it gives the correct answer. When Bob is in the context, it gives a systematically incorrect answer.

The idea is to train an ELK reporter on the Alice context where the model is saying the correct answer. Then we want to see if it transfers over to the Bob context where the reporter has never seen this Bob context where the model is saying the incorrect answer in its final output. It turns out that you can fit these reporters at any layer in the model. So you can fit it at the first layer, second layer or whatever. This is just a weird hyperparameter that you have to choose, which layer in the model are you going to use? But we just fit reporters for all the layers and look at what happens at every layer. It turns out that if you train a reporter on one of the middle layers of a model, one of these quirky models, it will generalize just like a zero shot to the Bob context. It will give you Alice's answer, the correct answer, even when the model itself is outputting this incorrect Bob answer.

In later layers, it sometimes still works, but if it's near the very end, the very last layer, it will sometimes flip so that when the output is in a Bob context, the reporters fit at the final layer will mirror what the model itself is saying. So that is an indication that if you're going to do this, you probably want to look at middle layers more. There's been a lot of papers that indicate that middle layers are where there's more semantic or higher level concepts. If you want to do anomaly detection, for example, it's often better to look at the activations in middle layers as opposed to the final layer. So this is consonant with other results that people have found.

Crucially, this is not with any complex methods. This is literally just supervised training of a linear classifier. We're not doing CCS, we're not even doing Vank. We actually want to test that soon, but we have not even used Vank for this. We're just using regular probes. And it does generalize in this admittedly toy context, but it's a context where we're trying to simulate a scenario where the model is being deceptive or giving you a true answer in one context and false in another context. And it does look like in this case you can extract latent knowledge. So yeah.

Theo: All right. Well, I think that's a pretty good place to wrap it up. Thank you so much, Nora Belrose, for coming on the podcast.

Nora: Oh, no problem. It was a great time.