Human wanting
We have pretheoretic ideas of wanting that come from our familiarity with human wanting, in its variety. To see what way of wanting can hold sway in a strong and strongly growing mind, we have to explicate these ideas, and create new ideas.
1. Human wanting
The problem of AGI alignment is sometimes posed along these lines: How can you make an AGI that wants to not kill everyone, and also wants to do some other thing that's very useful?
What role is the idea of "wanting" playing here? It's a pretheoretical concept. It makes an analogy to humans.
2. The meaning of wanting
What does it say about a human? When a human wants X, in a deep sense, then:
- X has a good chance of actually happening, and if it doesn't happen that's because making X happen is difficult, or very "costly" in some sense;
- novelty (new knowledge, understanding, ideas, skills, mindsets) that the human gains will be put to the use of bringing about X, and won't be put to the use of destroying the potential for X, and also won't terribly trample over other things that humans care about——the power contained in the human's mind is channeled, circumscribed, so that the power isn't applied towards ends other than those chosen by [that which wants X];
- if the human can't achieve X directly, ze will recurse on creatively finding ways to become the sort of agent who can achieve X;
- the human will interpret the meaning of X in a good, reasonable, sane, intended way, including when X was given in an ambiguous way;
- the human won't pursue X in an extreme way that renders X no longer good as X;
- the human won't merely pretend to pursue X and then at the last minute replace the potential for X with something else;
- the human, if put in a context where negotiation or conflict with other agents is appropriate, will stand up for X;
- these facts will persist, still holding true of the human as ze goes through life, learns, reflects, gains capabilities, and undergoes deep revisions of mental elements;
- the human wants these facts to persist, and will notice and correct when zer growth has threatened or eroded these facts.
And, it is at least sometimes feasible for a human to choose to want X——or even, for a human to choose that another human will want X. Wanting is specifiable.
3. Baked-in claims
The concept of wanting is, like all concepts, problematic. It comes along with some claims:
- Agents (or minds, or [things that we'll encounter and that have large effects]) want; or at least, we can choose to make agents that want.
- All these features apply to the way the agent wants.
- All these features can possibly coexist.
- What the agent wants is something that can be specified.
- This sort of wanting is the sort of thing that's going on in a human when we say that the human wants something.
- Wanting is a Thing; it will reveal a dense and densening region of internal relations upon further investigation.
These claims are dubious when applied to human wanting, and more dubious when applied to other minds wanting.
4. The variety of human wanting
If we follow the reference to wanting in humans, we find a menagerie: wants that are:
- small-scope (get out of these wet socks; cultivate a nice garden) or big-scope (have no one go hungry; create a thriving intergalactic civilization), spatially or temporally or on some other dimension;
- context-dependent (e.g. wanting to talk to so-and-so when they are around, but not seeking them out; or wanting to be aggressive and cutthroat when lawyering but not at home) or context-independent (e.g. wanting to be honest, deeply thoroughly everywhere always with everyone; or wanting to understand; or wanting to serve God; or wanting to not be exploited);
- easily achieved (e.g., for the healthy, going for a walk) or difficultly achieved (e.g. going to Mars or deciding the Collatz conjecture);
- logically consistent (e.g. wanting to eat a bagel) or logically inconsistent (e.g. wanting to be of the very highest status, and also spend time with status-betters; or wanting to live in a world with other completely independent free beings who can completely freely choose, and also wanting there to definitely be no torture anywhere; or wanting there to be a set of all sets; or wanting to never have wanted anything);
- stable (e.g. ice cream is always good; cruelty is never good; caring about a child forever no matter what) or changing (e.g. candy is a bit nauseating now; falling in and out of love with techno music; adopting a new moral framework; choosing virtues to aspire to; a heartfelt desire to serve the voters, betrayed);
- universal (wanting no suffering anywhere) or existential (wanting oneself to have blueberries; wanting someone to read your poem; wanting this specific person to have some fun);
- mentally localized, as in a module (e.g. a specific gland regulating hunger), or mentally diffuse, e.g. curiosity arising from any specific mental content given full sustained attention;
- referring to high-level things like apples and people, or referring to low-level things like electric fields; referring to physical things like food and buildings, or referring to abstract or spiritual things like math, understanding, countries, symbols;
- being created (such as a new pottery hobby) and being destroyed (such as a hope for innocent thriving destroyed by betrayal);
- very ambiguous (what does "thriving civilization" mean?), relatively ambiguous (how big does a paperclip have to be? does it have to be used for clipping paper?), or relatively specified (diamond is such and such a configuration of carbon atoms, though ambiguity remains and would be made an issue by optimization);
- being freely chosen or created (such as who to love, who to be, what creative expression to devote oneself to), or having been built in (such as a desire for salt), or having been copied from or imputed to or externally imposed by other humans or coalitions (such as what language to think and express values in; or what social norms to uphold; or what society's long-term aspirations are), or having been already there but hidden (e.g. an aspiration that was left quiet or implicit, hooked in as only a pointer);
- being inexplicit (e.g. having a gut feeling such as not wanting to eat this food or go down that dirty alleyway, without how to speak the reason not to), or implicit (e.g. to want to paint three rows of five apples is to implicitly want to paint fifteen apples), or unconscious or hidden (e.g. wanting to exploit someone by pretending to want to work with them; or wanting to overthrow a political regime but hiding it because the regime persecutes potential rebels; or wanting to mock a rival but not thinking of oneself as being a mocker); or on the other hand being discovered (e.g. finding a kink), or being held explicitly (e.g. wanting everyone to accept Christ into their hearts and arranging military logistics to make that happen);
- being partial, i.e. not giving an opinion about some comparisons of worlds (e.g. I can want to listen to Shostakovich or want people to have freedom of speech without having to want anything one way or the other about what happens to the Andromeda galaxy next century), or being supposedly complete (e.g. the greatest good for the greatest number across the multiverse);
- being about something already well enough grasped (e.g. wanting to ride a bike), or being about something only proleptically (pre-grasped-ly) held, such as liking a subculture without a clear idea of what a subculture is, who or what or where this subculture is, or what to do with the liking;
- being about something clear and fixed enough, like playing chess, or about something that seems to flit around and change, with some commonality but no one clear throughline, e.g. playing a wide variety of video games which become more and less fun in turn;
- being about something with a specification that is at least relatively objective and not in need of translation or interpretation (e.g. proving or disproving $\mathcal{P} \ne \mathcal{NP}$), or on the other hand being about something that requires the whole human to further unfold what the wanting is about (e.g. "expressing yourself through art" requires you to interpret what expression and yourself means);
- not about states of the external world (such as wanting beautiful buildings), but about mental states such as suffering or integratedness;
- not giving recommendations for external actions (e.g. walking towards a desired object), but giving recommendations for non-cartesian activity (e.g. doing math or calling a stance into question);
- not about mental states (e.g. pain or calm), but about mental processes (e.g. considering all sides of an issue; not rationalizing; empathizing) or properties (e.g. being earnest, careful, or not resentful);
- not about mental processes in the sense of taking the processes as object (e.g. "reprogramming oneself", as in suppressing an emotion, or calling a stop to any thoughts that mention certain topics), but about mental processes in a naturalized sense that is self-referential, so that the want wants something about how it wants itself (e.g. deciding according to the categorical imperative, which applies to its own application, saying that such and such way of applying the categorical imperative can or can't be willed to be a universally following way of applying it; or a self-endorsing spirit of FDT; or not wanting to coerce oneself into being kind because [the thing doing the coercion] isn't trusted and is less trusted when coercing and is unkind);
- about wants themselves (e.g. wanting to be attracted to people regardless of how they look; wanting to have an appetite for healthy food; wanting to not have an appetite for social media; wanting to want what your friends want; wanting to want what your leader or master wants you to want; wants about processes for resolving conflict or ambiguity in wants, or wants that choose or construct wants, or wants that want to want what some other mind wants (e.g. imitating a leader));
- endorsed (wanting to go rock climbing and wanting to want to go rock climbing), or unendorsed (wanting to avoid talking to people, but not wanting to want to avoid talking to people);
- not about something (things, states, or processes, whether or mental or not) for its own sake, but about something for the sake of how it affects something else (e.g., caring about the velocity of the basketball as it leaves the hand, for the sake of the ball going through the hoop later), where the something else may be non-specified (e.g.: process values like favoring a parliamentary system, so that budget allocation will be to some as-yet-unspecified good thing; symbolic values like waving a flag or building a monument, to influence coordination points for future agents);
- suitable for pursuing itself for its own sake (e.g. a fun dance), or suitable for possibilizing for the sake of another want (e.g. wanting to make a hammer for the sake of being able to hammer things);
- agent-like (e.g. picking out a city to live in, saving up to make the move, looking for a job there, applying for residency, buying a car), or not agent-like (e.g. urge-like, as in a craving for chocolate; or reflex-like, as in a need to sneeze; or opportunistic as in pilfering a pack of gum when the clerk is distracted; or chaotic as in an alchemistic technologist building up a repertoire of methods and apparatuses that possibilizes greatly without already actualizing any overarching plan);
- mere yearning which waits for opportunities to present themselves, or active pursuit which recurses into unboundedly rich search (e.g. Francis Bacon);
- pursued competently (doing the right things) and effectively (succeeding), or incompetently and ineffectively;
- pursued through a range of channels of optimization that's narrow (e.g. trying to win the chess game by playing the objectively soundest chess moves) or wide (e.g. trying to win the chess game by predicting the opponents blindspots, by bribing the opponent, by faking a call of distress from the opponent's spouse, by injecting the opponent with a sedative, by hacking the computer system running the game to set the variable of who won, by making a social media campaign to win popular support to declare you the winner, by doing internal mental science to figure out how to think better to figure out how to take over the planet and force the judges to declare you the winner, by constructing a beacon to call aliens to your position, by precommitting to make many ancestor simulations in which the rules of chess have always secretly been different than they appear such that you are in a winning position);
- supposed to be about Things with an eventual canonical meaning (e.g. wanting to understand algebraic topology), or not supposed to be about Things (e.g. wanting to feel happy);
- provisional (properly treated as being open to revision) or non-provisional;
- being treated as-if provisional or not;
- open to external correction or not;
- self-amplifying and powerseeking, or self-limiting and deferential;
- self-endorsing, as in Gandhi's preference against people being murdered, or self-negating, as in an antimeme;
- conflicting, dominating, annihilating;
- multiplexed (so the human pursues food, and then pursues shelter, as separate modes that cede control to each other), or anapartistic (so that the human at a given moment thinks of themselves as a delegate of their past and future whole self, and knows that trying to pursue their current pursuit too strongly would be goodharting, and that trampling over other values and installing the currently-pursued value as the sole criterion would be bad); or totalizing (as in an authoritarian coalition, or as in a pathological obsession with self-image or with a scientific question or with meditation, or as in a gaslighting abuser who breaks down other perspectives that would threaten zer frame of values);
- multiple or singular (as with someone who has weighed and compared between all their values, making tradeoffs between them and ironing out inconsistencies);
- real (e.g. really liking jazz) or unreal (that is, where what's pursued isn't really what's wanted; e.g. faked, as in pretending to like someone or professing to care about starving people; palliative, as in pica; goodharting, as in eating candy; hyperstitious, as in a basilisk, or motivating oneself by "believing" that the plan will work);
- actually wanted, or on the other hand something we'd be disappointed with if we got it (e.g. a dog catching a car; a kid getting a new toy, only to toss it aside after a minute; stepping foot on Proxima Centauri b might both have great symbolic value and also not actually be useful or fun);
- able to survive ontological crises (e.g. caring about consciousness should survive the transition to uploads running on silicon rather than natural people running on wet neurons), or not able to survive ontological crises (e.g. the process of building shared values that used to be called God has died, with little to replace it).
5. The role of human wanting
There are two roles played in AGI alignment by human wanting:
Human "wanting" is a network of conjectural concepts
Our familiarity with human wanting suggests hypotheses for concepts that might be useful in describing and designing an AGI.
Our familiarity with human wanting can't be relied on too much without further analysis. We might observe behavior in another mind and then say "This mind wants such and such", and then draw conclusions from that statement——but those conclusions may not follow from the observations, even though they would follow if the mind were a human. The desirable properties that come along with a human wanting X may not come along with designs, incentives, selection, behavior, or any other feature, even if that feature does overlap in some ways with our familiar idea of wanting.
That human wanting shows great variety, does not in general argue against the use of any other idea of wanting. Both our familiar ideas about human wanting, and our more theoretical ideas about wanting, might prove to be useful starting ideas for creating capable minds with specifiable effects.
Human wanting is half of an aligned AGI
It's the wants of humanity that the AGI is supposed to help bring about, so the AGI+human system has to accommodate the human wanting.
Human wanting is proleptic, ambiguous, process-level, inexplicit, and so on. Human wanting is provisional. Because human wanting is provisional, the AGI must be correctable (corrigible). The AGI must be correctable through-and-through, in all aspects (since all aspects touch on how the AGI+human wants), even to the point of a paradox of tolerance——the human might to correct the AGI in a way that the AGI recognizes as ruining the correctable nature of the AGI, and that should be allowed (with warning).