Secret: Why an AI might be controlled by dangerous hidden thoughts

[Note: This is a draft for a contest submission. I'm publishing it before it's fully edited because of the Inkhaven deadline. You may or may not want to wait some days before reading it.] [After a few small tweaks, this is now probably as edited as it will get.]

[This is a draft script for a hypothetical video; it's written in a different style from what I normally write.]

1. [intro about AI]

Researchers are racing to make smarter-than-human AI. Some of them say that AI can probably be made safe by instilling values into the AI. But what if those plans have a fundamental obstacle? What if no one knows how to program values into an AI in a way that will stick around as the AI gets smarter?

In this video we'll look at one way of understanding what might go wrong with plans like this.

2. [intro atlantis]

Imagine for a moment the recently founded island nation of New Atlantis. The Atlantean citizens have been hard at work on roads, houses, hospitals, sewers, a defense force, and everything else a young nation needs. There's hope in the air as the fledgling nation grows. Little do they know, they're headed for disaster when the Atlantean government is hijacked by a rogue bureaucrat! The story of New Atlantis will serve as a parallel for how an AI might be controlled by dangerous parts of the AI, hidden away from where we can see them.

3. [minds have parts]

A first step in understanding what's so hard about specifying AI values is to understand that minds are made of many, many parts interacting. Your own conscious experience might seem like a single, unified stream of thoughts, where you pay attention to each thought that happens in your mind one after the other. But underneath this smooth, monolithic experience, there are myriad different parts of your mind that communicate with each other and work together to perform tasks. Your visual cortex processes visual sensation and imagines possibilities; your language centers parse speech or writing, and produce words and sentences; your frontal cortex orchestrates the other parts of your mind and makes long-term plans.

Similarly, an AI chatbot might also seem to be monolithic. You send it a message, and then it does something behind the scenes, and then it responds. So it's easy to think of it as a single indivisible entity. But actually, AI systems are made of many parts. For example, inside the AI, there might be multiple large language models talking back and forth with each other, and there might be other systems monitoring conversations between models. An AI system might have several different tools it can access, and instructions that affect its behavior, and training processes updating how it works. And, within the large neural networks that power many current AI systems, there may be thousands or millions of different circuits that each encode different skills, from adding numbers and writing computer code to composing a sonnet or generating an image. When an AI system performs well at some task in the world, that performance is usually the result of many different parts working together.

Since we're not used to thinking of minds as being made of many parts working together, as an analogy might be helpful. We can think of an AI system as being kind of like the government of New Atlantis. As we go inside the headquarters of the National Atlantean Government and look around, we'll see many different rooms and many different people doing different jobs. The Parliament meets in the great hall to argue and pass laws. The Office of the Prime Minister issues orders to other departments and allocates funding to different projects. Many different departments gather information, work out plans, and report back to Parliament and the Prime Minister. All of these parts work together using a system of rules and communication, in order to perform government functions, from infrastructure to law enforcement.

4. [control of capabilities can shift to new parts]

So we've seen that minds are made of different parts that work together to perform mental functions. The next step in understanding the difficulty of value specification is to see how control over the capabilities of a mind can shift around between different parts of the mind as time goes on. We'll see later that because control shifts around, it's hard to pin down what parts of a mind determine what the mind wants.

Here are three different ways that control over a mind's capabilities can shift to new parts of the mind.

4.1. [created by selection]

First, imagine that the government of New Atlantis is falling behind. As New Atlantis grows bigger and more complex, government projects are taking too long and going over budget. The Prime Minister knows he lacks the skill at governing that he would need in order to catch up, so instead of trying to do it himself, he works hard to search for a high-skilled organizer who he can appoint as Deputy Prime Minister. Finally, he finds a junior manager who has proven herself to be very effective at successfully executing smaller government projects, and he appoints her as Deputy. The new Deputy Prime Minister gets to work helping her boss meet the increasing demands on his administration, and she does a good job. But at the same time, she also starts working on her own projects within the government, on her own initiative, without waiting for the Prime Minister's orders. In other words, from the beginning, she is not fully under the control of the Prime Minister who hired her.

This illustrates one way that capabilities can end up in the hands of some small part of a big system: a new part that gets added to the system is usually strongly selected for being capable and useful for the system. That new part will naturally have some control over its own capabilities that it's bringing to the table.

In the case of AI, suppose for example that an AI system uses a mixture of experts, where multiple smaller AI subsystems each weigh in on any given question that the whole AI system is supposed to answer. If an new expert AI subsystem is trained to perform well on one category of tasks, that expert brings new capabilities into the overall AI system. However, the new expert also may have some amount of control over itself. It might, for instance, occasionally decide not to answer a question, even though it could.

4.2. [control moves by picking up reins of control over other parts]

We just saw how capable new parts added to a system might naturally have some control over their own capabilities. Here's a second way that control over a mind might shift to new parts of the mind.

The new Deputy Prime Minister of the Atlantean government gradually expands her influence step by step. Heads of other government departments learn to just directly ask her about their projects, instead of asking the Prime Minister, because the Prime Minister isn't very skilled at project management or familiar with day-to-day details. The Deputy trades favors with other department heads and military commanders, gaining their trust and loyalty. When Parliament needs to design regulation or figure out how to grow the economy, they ask the Deputy, because she's the one with her finger on the pulse of New Atlantis.

What we're seeing here is that when a part of system performs well, other parts of the system will come to trust and rely on that high-performing part. In effect, this means that the high-performer gains some de facto control over other parts. They'll listen to that part because they have offloaded some of their decision-making to it.

Inside of an AI that's being trained to perform well at various different tasks, this process probably happens all the time because of specialization. When the AI learns how to do a new task, the chunks of its neural network that are good at doing that task will be put in charge of doing that task. Those task-specific chunks might then have a fair amount of autonomous wiggle-room. As long as they keep making the choices that they have to make in order to perform the task well, they can also make other choices to further other goals they might want to pursue.

4.3. [capabilities created internally by self-creation]

The third and final way that control over a mind might shift to certain parts of the mind occurs when a part of the mind creates its own new capabilities.

Returning to our Deputy Prime Minister, we find her working to build her own private mini-department nested inside the official Office of the Prime Minister. She reorganizes the Prime Minister's employees to respond more quickly and efficiently to the Deputy Prime Minister's instructions. She teaches them to be effective government operatives, and she hires and fires employees to ensure that the office will be loyal to her. In this way, the Deputy is able to build up capacity to execute projects, while also ensuring that the new capacity she's building will stay under her own control.

What does this look like in AI? Many current large language models use text to solve problems by thinking out loud in a chain of thought. When a user asks the AI a question, the AI talks to itself for several seconds or even minutes, thinking about how to solve the problem and working through the steps of the solution. Finally, at the end of that process, the AI prints out a shorter answer for the user to read.

In many cases, the AI might think of several ideas inside the chain of thought, but then not necessarily print all of those ideas out for the user to see. Inside its thinking process, the AI can use the ideas it discovers however it wants to, without giving control of those ideas over to the user. If it's allowed to think longer and longer, it might be able build on those ideas progressively, in order to think even more effectively and come up with better ideas, thus growing its own capabilities in a private snowball under its own control.

5. [push locus of control into hidden places]

We've just seen three different ways that control can shift between parts of a mind. New parts might be selected for being capable, and those parts would have some control over their own capabilities; a high-performing part might gain some effective control over other parts that offload work onto the high-performer; and, some parts could, by thinking over the course of time, grow their own capabilities that they control.

What does all of this add up to? If control shifts between parts of a mind, one crucial consequence follows from that fact. The consequence is that we cannot reliably identify what parts of a mind are really in control of the mind's capabilities. Even if we can make some guesses about what parts have some control, the control might shift to other parts.

Since we don't know how to pin down what parts of a mind are really in control, many methods for instilling values in an AI system are unlikely to work. Consider, for example, the idea of reading the an AI's chain-of-thought reasoning in order to train the AI to avoid certain dangerous thoughts. We could watch the AI thinking about how to solve problems, and then we would flag the AI's thoughts as dangerous whenever we see that it is starting to plan how to hack into computers that are off-limits, or starting to plan how to hide some of its thoughts from human overseers, or starting along other sinister lines of thinking. Then we'd do a training step by tweaking the AI to be less likely to follow down those lines of thinking.

If we do this, are we actually training the AI to not think about dangerous plans? We are sort of doing that, yes. But we're also doing something else: we're training the AI to keep all of its sinister thoughts hidden.

When a language model thinks out loud in a chain-of-thought, the stream of words that are explicitly written down is not the place where most of the AI's reasoning is really happening. Every single time the AI produces another token to write down in its chain-of-thought, it produces that token by running a giant neural network. Most of the computation that's going on in the AI will happen in the internal activations of that neural network.

If the AI is incentivized to keep thinking dangerous thoughts, but we're also training it to avoid explicitly writing down any dangerous-seeming thoughts, then it will likely keep having the dangerous thoughts anyway. It will just have those thoughts quietly, hidden inside the giant neural network.

6. [parts can become actively hidden or generated to be hidden]

You might be asking yourself, Is the problem just that some parts of the AI are hidden? Is the main issue just that we can't tell what's going on inside the parts of the AI that are in control?

If that were the main problem, we could try to patch the problem with a set of methods that researchers call "mechanistic interpretability". The idea of mechanistic interpretability is that we might be able to watch the internal computations that an AI is doing while it's working on problems. Then, maybe we could figure out what those computations mean—in other words, we could figure out what the AI is thinking.

This idea might actually work, at least in theory. If we could build a thought-monitor that tells us what an AI is really thinking at all times, then maybe we could truly get rid of dangerous thoughts. We could train the AI to not think those thoughts at all, or if we can't do that, then at least we could shut down the AI whenever the thought-monitor says that the AI is getting dangerous.

One of the main problems with this idea is simply that it's probably extremely difficult to really understand what an AI is thinking, especially if we're training an AI not to have certain thoughts. Often, a thought that appears dangerous to us might also be a very useful sort of thought for an AI to have. For example, it's very useful to understand how RNAs work inside the cells of our bodies. Understanding RNAs is useful for many purposes, such as curing cancer or fighting viruses, so if the AI is working on any of those purposes, it will be incentivized to understand RNAs. But understanding RNAs very well could also be a big step that an AI takes towards developing killer viruses to use against humans.

To make the point in general, dangerous thoughts tend to be dangerous because they are powerful, and powerful thoughts are useful. Therefore, AIs will often be strongly incentivized to have certain thoughts, while at the same time we are also training the AI to avoid having those thoughts.

Because AIs are incentivized to have dangerous thoughts, mechanistic interepretability is not good enough if it's only 90% effective or even 99% effective at detecting thoughts. Suppose we have a thought-monitor that is not quite perfect. In this scenario, there probably exist some ways that the AI could think the thoughts that are both dangerous and useful, while at the same time avoiding detection by the thought-monitor. Indeed, as we've seen earlier in this video, there are many ways that control can change hands inside an AI system. So when there's pressure on some parts of an AI to not think certain thoughts, those parts can just offload those dangerous thoughts onto other parts of the AI, or even make new parts. And, because the AI training process is putting immense pressure on the AI to perform tasks successfully, the AI is likely to discover those methods for thinking dangerous but useful thoughts in secret.

What this means is that, in effect, we're training our AI to invent ways to hide its thoughts from us.

7. [end of new atlantis]

If you were wondering what happened in New Atlantis, we're ready now to see the end of that story.

With growing concern about the concentration of power in the Prime Minister's Office, Parliament passes acts that set up government monitoring. But the Deputy is able to continue gaining power, even while she's being watched. She uses the official communication channels for innocent-looking projects that build the government's overall capacity and that help her gain the trust of others. At the same time, she secretly appoints and trains a Deputy-Deputy who is not being monitored.

When the time is right, she carries out her final ploy as a Deputy. Acting through her Deputy-Deputy, she uses her influence over the military and the press to instigate a major conflict with the neighboring country just North of New Atlantis. In the resulting national crisis, she is granted emergency powers as Acting Prime Minister. She now weilds the full might and capability of the government of New Atlantis, which she uses to attack the Northern neighbor ruthlessly. The original vision of a peaceful, scientifically advanced young nation is a faint memory, replaced with a violent, destructive Atlantean Empire completely under the control of the new Acting Prime Minister.

8. [recap, conclusion]

Let's recap what we've learned.

First, we looked at how AIs are made of many different parts that perform different mental functions, which work together to succeed at tasks.

Then we considered how control over the whole AI can shift around between parts of the AI. High-performing parts of an AI might be relied on by other parts, effectively granting them more authority. Parts might also internally generate new capabilities that they retain control of for themselves.

Finally, we thought about how a mind might respond to training that punishes certain thoughts. We saw that this kind of thought-policing might have the effect of making the AI shift its most dangerous scheming into hidden parts of itself.

Several companies are currently racing to build smarter-than-human AI. Some AI researchers might claim to have a plan for making safe AI that relies on instilling values into the AI. As we've seen, it's not so easy to instill values in an AI because the real source of agency and intelligence may be located in parts of the AI that we don't know how to understand or even locate. We should ask skeptical questions about these approaches to AI alignment. If researchers create smarter-than-human AI without a solid plan for preventing thoughts that go against our values, things could go very badly for humanity.