An anthropomorphic AI dilemma

Either generally-human-level AI will work internally like humans work internally, or not. If generally-human-level AI works like humans, then takeoff can be very fast, because in silico minds that work like humans are very scalable. If generally-human-level AI does not work like humans, then intent alignment is hard because we can't use our familiarity with human minds to understand the implications of what the AI is thinking or to understand what the AI is trying to do.

1. Terms

Here generally-human-level AI means artificial systems that can perform almost all tasks at or above the level of a human competent at that task (including tasks that take a long time).

Here intent alignment means making artificial systems try to do what we, on reflection and all things considered, would want them to do. A system that doesn't have intents in this sense——i.e. isn't well-described as trying to do something——cannot be intent aligned.

Here "how a mind works" points at the generators of the mind's ability to affect the world, and the determiners of [in what direction to push the world] a mind applies its capabilities.

Let A be some generally-human-level AI system (the first one, say).

(Here "dilemma" just means "two things, one of which must be true", not necessarily a problem.)

2. The anthropomorphy dichotomy

Either A will work internally like humans work internally, or not.

Well, of course it's a spectrum, not a binary dichotomy, and a high-dimensional space, not a spectrum. So a bit more precisely: either A will work internally very much like how humans work internally, or very much not like how humans work internally, or some more ambiguous mix.

To firmly avoid triviality: to qualify as "very much like" how humans work, in the sense used here, it is definitely neither necessary nor sufficient to enjoy walking barefoot on the beach, to get angry, to grow old, to have confirmation bias, to be slow at arithmetic, to be hairy, to have some tens of thousands of named concepts, to be distractible, and so on. What's necessary and sufficient is to share most of the dark matter generators of human-level capabilities. (To say that, is to presume that most of the generators of human-level capabilities aren't presently understood explicitly; really the condition is just about sharing the generators.)

3. Anthropomorphic AI is scalable

Suppose that A does share with humans the dark matter generators of human-level capabilities. Then A is very scalable, i.e. can be tweaked without too much difficulty so that it becomes much more capable than A. The basic case is here, in the section "The AI Advantage". (That link could be taken as arguing that AI will be non-anthropomorphic in some sense, but from the perspective of this essay, the arguments given there are of the form: here's how, starting from the human (anthropomorphic) baseline, it's visibly possible to make tweaks that greatly increase capabilities.) To add two (overlapping) points:

  • Simple network effects might be very powerful. If ideas combine at some low rate with other ideas and with tasks, then scaling up the number of idea-slots (analogous to cortex) would give superlinear returns, just using the same sort of thinking that humans use. Imagine if every thought in the world was made available to you, so that as you're having your thoughts, the most especially relevant knowledge held by anyone brings itself to your attention, and is already your own in the same way as your knowledge is already your own. Imagine if the most productive philosophical, mathematical, and scientific quirks of thought, once called forth by some private struggles, were instantly copyable to ten thousand minds.
  • Direct feedback about abstract things. Since humans age and die, no one human mind gets much feedback on contexts (tasks, questions, problems, domains) that require whole lifetimes to deeply engage with. Humanity as a whole gets feedback on some of these contexts, but always through noisy lossy bottlenecks.
    • For example, coming up with good scientific hypotheses for questions that aren't easily answered, and choosing between large-scale routes of investigation, are skills that give only sparse feedback.
    • For example, large organizations working on a complex novel artifact, such as a large computer program or a spaceship, learn a lot about how to architect the thinking process to work with many people. This structure is in many cases inexplicit, contextual, and hard to share, so the magic gets lost.

I suspect that this sort of scalability implies that takeoff would be fast in this alternative. (I also think takeoff will be fast, in some senses, in any case; but I think the implication (anthropomorphic -> scalable -> fast takeoff) might be more clear than more general arguments.) Note that this is a strong form of scalability: an implication of the claim of scalability is that there will be very "superlinear" rather than "linearish" returns in capabilities from raw resources invested.

4. Non-anthropomorphic AI is hard to align

Suppose instead that A does not share with humans the dark matter generators of human-level capabilities. Then A is hard for humans to understand.

Gemini modeling with alien contexts is hard

It seems to me that the way humans normally understand each other rests to a great extent on gemini modeling. Put simply, a coffee cup means pretty much the same to me as it means to you, so your behavior regarding coffee cups pretty much makes sense to me; when you try to go to the store, the resources of any kind at your disposal in that pursuit are similar to the resources of any kind available to me if I try to go to the store. More abstractly: if a human mind has some mental element $E$, such as a goal, concept, skill, or skeletal pattern of overall thought, then that mental element $E$ is given its implications by a context within that human mind. A second human is able to understand $E$ as being like the second human's mental element $E'$, because ze has available a mental context sufficiently similar to the mental context of $E$. In that mental context held by the second human, $E'$ can be given its implications analogously to how $E$ is given its implications in the first human, so that the second human's mental element $E'$ can be a twin of the element $E$.

As far as I know, we don't have a good enough understanding of how elements are given their meaning——how they get used by the mind to affect the world, and how they control in which directions the mind pushes the world. (Good enough to give good descriptions, robust to the voyage of novelty, of what effects a mind will have on the world.) In other words, [the generators of the contexts that humans share with each other and so can use to understand each other] is included as a subset inside of [the dark matter of intelligence supporting human capabilities].

If the above is right, and if A doesn't share much dark matter with humans, then it's hard for humans to interpret A's thought and action: the meaning (the implications for future thought and action) of a given mental element is unclear. If it's hard for humans to interpret A's thought and action, then it's hard for humans to predict what will be the directions of A's effects on the world, let alone to specify them.

To put it another way: alien generators provide alien contexts for mental elements. So if A has many alien generators, then A has many alien contexts. Elements in an alien context are hard to gemini model, and hence hard to model in the way relevant to understanding A's use of them. So A has many elements that are hard for humans to understand.

This could be incorrect in cases where the aspects of a context that are relevant for gemini modeling are not very affected by most of the generators of the mental elements involved. Normativity in general (e.g. coordination, law, negotiation, proof, agentic coherence, convergent instrumentality) might provide examples.

Context shifts invalidate inexplicit understanding of alien generators

Alien generators aren't possessed by humans. That means humans can't have the sort of inexplicit understanding of something that one can have by possessing it without understanding it explicitly. That seems to leave only black-box understanding: a familiarity with behavior, and some tentative inductive generalizations, which tend to be shallow.

Black-box understanding of a generator doesn't survive a context shift (that is, a qualitative expansion of how the generator is applied). Shallow regularities won't hold; differences between a shallow model and the real thing will be amplified; incoherences and internal conflicts within and between generators will resolve one way or another; the generators will settle into more coherent configurations that generate novelty in a more extended and extreme way, taking things further.

It can be sufficient for alignment if the human inexplicitly understands a dark matter generator G of A. One way is if the context that A is operating in doesn't shift. Then a black-box familiarity with G's functional behavior could be enough for the human to reliably handle G's consequences. Another way is if the human also possesses the generator G, and can follow A through the context shift. Then the human's version of G, active in the new context, provides the mental context to gemini model whatever in A's thought and action comes from A's version of G.

For A to be a strong mind, it probably has to go through many context shifts. If A goes through a context shift, then since by assumption of this section A contains generators of human-level capabilities not shared with humans, neither of the two conditions above hold.

Alien generators are hard to explicitly understand

It might be possible for the human to gain explicit understanding of the generators of A's capabilities. But since the generators are alien to the human, understanding them is harder than understanding non-alien generators. The human trying to understand an alien generator doesn't possess the generator in themself, so has only narrow access to the generator in its history, components, inputs, outputs, indexing, and applications. Gaining explicit understanding of even non-alien generators is hard, seeming to require long reflection.

Or not?

This reasoning might all be incorrect though. Maybe there can be something that is generally human-level, and easy for humans to understand, but that doesn't share much of the dark matter generators of human capabilities. What would that look like? Or maybe there can be something that is not easy for humans to understand, but is feasible to align. What would that look like? (Normativity in general is a way out, where behavior is somehow constrained even though the generators of the behavior are arbitrary. But without understanding the generators (which maybe contain the effect-determiners?), I don't see how to have specifiable normativity in a mind that grows.)

Also, suppose that a non-anthropomorphic system can't be feasibly intent-aligned. Maybe there can still be something with these properties: it's generally human-level; it's by necessity difficult for humans to understand well; but nevertheless it's safe (doesn't kill everyone); and also it can be used to end the acute risk period. E.g., something that (somehow) possibilizes, for humans, enough so that humans can end the acute risk period, without actualizing enough to be dangerous. Though, when the humans use it, the full force of the optimization power "stored up" in the possibilization is felt, and strains whatever safety properties the system has. And, all this possibilization that's been done but left without a keeper is kindling for a forest fire of strategicness. I.e. it may go from being a fairly generally capable, only loosely integrated collection of services, to instead trending towards strategically applying its capabilities to continually push the world in specific directions. It's not clear whether and how it's possible to have a system that has large effects on the world but isn't trying to do something.

5. Aside on anthropomorphy

What's been discussed here relates to a somewhat hidden trap of the anthropomorphy heuristic. Believing that an alien mind will most likely exhibit fear, or will engage in wishful thinking, or will pursue connection with other beings, is (in the absence of other supporting arguments) fairly clearly a strong application of the anthropomorphy heuristic. On the other hand, if one were to assume that an alien mind will most likely (to pick a random example) form plans by iteratively refining explicit high-level plans that look promising assuming angelic semantics, one would be also using a sort of anthropomorphic heuristic, even though that property is more abstract and better supported.

The heuristic is less likely to suggest incorrect guesses for mental properties that are more abstract (hence more canonical) than things like the detailed emotional action-complex of fear, and for mental properties that have other reasons to be present (such as efficiency bounds in some context). But still, these more abstract structures are on average less canonical than they appear, from a perspective fitted to a distribution on minds that only includes humans.

So these more generatorward, more dark matter aspects of minds are prone to be incorrectly taken as describing alien minds. These aspects also tend to shape the mind a lot, and probably shape the effects on the world of a mind a lot.