Against the Orthogonality Thesis

Feb 04, 2026

Introduction

The orthogonality thesis draws a sharp distinction between intelligence and goals. In a definition of the thesis presented on the LessWrong wiki, Eliezer Yudkowsky illustrates this with an analogy: in principle, there is nothing that prevents an intelligent system from competently pursuing any given goal. As an example, he asks us to imagine an alien race offering large monetary compensation for the production of paperclips. Faced with such an incentive, we would be perfectly capable of increasing paperclip production in a sustainable and functional manner.

The point of the example is to demonstrate the orthogonality between intelligence and goals. “Paperclip production” could be substituted for virtually any objective, and our intelligence could still be utilized effectively in its pursuit. Within this perspective, intelligence is a general-purpose optimization capability, while goals are arbitrary targets toward which that capability may be directed. The two are fundamentally distinct and separable: any sufficiently general intelligence can, in principle, pursue any goal.

Indeed, the ability to pursue a wide, expansive set of possible goals appears to be built into the very definition of general intelligence itself. This is, after all, what distinguishes a general problem solver from a narrow optimizer.

In this article, I will argue that despite its intuitive appeal, this conceptualization is mistaken: intelligence and goals are not truly separable, but necessarily and deeply entangled. Intelligence requires committing to particular ways of carving the space of possibilities, and both intelligence and goals emerge from, and are constrained by, those commitments.

The deepest form of this entanglement—and the sharpest break with the orthogonality thesis—is that truly identical general intelligence (identical capability across all domains and contexts) entails identical terminal directionality. Broader capability does not expand the space of viable terminal goals; it radically narrows it.

Interestingly, the orthodox orthogonality view invites a paradox. Intelligence is defined as optimization capacity, yet optimization is only intelligible relative to a goal. To differentiate performance, a system must be optimizing for something. But to have a goal at all, the system must already possess the optimization capacity required to represent, maintain, and instantiate that goal within its own dynamics.

On the orthodox view, these are treated as separable primitives. In the view I present, this apparent circularity dissolves: intelligence and directionality do not stand in dependence on one another, but instead emerge jointly from the same underlying constraint structure. There is no circular dependence—only a shared origin.

1. The Structure of Intelligent Systems

To understand this deep entanglement, we must first understand some general principles that necessarily apply to any intelligent system. For starters, anything that could reasonably be considered a general intelligence must have a vast set of possible outputs. In order for the system to have X number of unique outputs, it must have the capacity to, at minimum, represent X number of unique states.

In a digital computer, unique states are represented abstractly as strings of 1s and 0s. In practice, these bits are instantiated by physical structures—typically transistors and capacitors—that maintain a high or low voltage. Every unique internal state, or output, is represented by a specific, distinct configuration of these electrical charges.

Simply having the physical capacity—a sufficient number of transistors and capacitors to produce X number of outputs—is, however, insufficient on its own. A system must also have the capacity to instantiate each and every one of those unique states. In other words, it must have at least one pathway carved to every state.

To help illustrate this, imagine a massive decision tree with millions of output nodes. At the bottom, we place a single gate that sends the signal either left or right, through which all input passes. The rest of the tree is massive, but only propagates the signal forward by branching it toward the output nodes. With this structure, the decision tree only has two different outputs: the left path and the right path. Both paths result in millions of output nodes lighting up; if each of them encodes different information, we get an immensely information-dense output.

Yet, we have no intelligence.

To actually make use of each individual output node, the tree must carve at least one unique path to that output node. To make use of all possible combinations of output nodes, it must carve unique paths to every possible combination of output nodes. This means the constraint structure must be vastly more complex and have far more logical gates than there are output nodes.

For intelligence, this is still not sufficient. The constraint structure must not only carve these unique pathways to every combination of output nodes, it must carve a path to an appropriate combination of output nodes for any given input. This is a far more complex task, requiring many more constraints.

As a side note, it is worth pointing out that in practice not every combination of output nodes will be accessible. Nevertheless, the reason for this is something we will tackle later in the article.

If you look at computer hardware in isolation, its representational capacity is massive. Yet, without specific mechanisms constraining the electrical current moving through the hardware, the system has no intelligence. These mechanisms are what actually give the system the ability to reach the unique combinations that are, in principle, available in the hardware. The specific structure of constraints is what we refer to as “software.”

2. Goals

So, we’ve examined the structure of intelligent systems, and we must now ask ourselves: what are goals? Where in the decision tree, computer hardware, or constraint structure are goals located?

We might be tempted to answer “nowhere,” and indeed, this is the answer many give. They treat goals as a “ghost in the machine,” independent of the substrate—a dualistic conceptualization, in essence. Goals cannot be located anywhere within the system’s constraint structure; rather, they are an abstraction from a third-person perspective.

However, this runs into a common dualistic problem: how are goals instantiated? If you say goals cannot be located anywhere in the system, then how do goals influence the system? If they have no anchoring within the system, are we even talking about anything at all?

In other words, goals must be defined by properties of a real system; otherwise, they cannot “do work,” and the concept is meaningless.

2.1 Goals as Directionalities

Goals are the directionalities created from the constraint structure. Different ways of carving the possibility space create different “biases”; the system gains particular tendencies.

If you wield a hammer, everything becomes a nail. The same is true for intelligent systems. Whatever structural commitments the system makes, the “world” is necessarily interpreted in relation to that structure. It can only work with the building blocks it has. If all it has is square pegs, every structure it can create is a square-pegged structure. It might be able to simulate circular structures from complex arrangements of square pegs and thus gain the ability to utilize a certain kind of circular structure, but it is still a square-pegged structure.

This principle is true at every moment, at every resolution of the system. It is always biased toward the structural commitments it has made; it has no other option but to relate to the world in relation to those structures, and different structural commitments carry different directionalities, different goals.

But what if the building block is sufficiently universal to approximate any function?

Indeed, proponents of the orthogonality thesis will argue that transistors and capacitors are such. You’re not restricted to a singular tool; you can, in principle, create every tool imaginable. You’re not limited to a single directionality; rather, you have all of them.

Let us imagine such a system. It takes the input, uses the appropriate tools, and gives the “right” output every single time. What is the directionality of this system? What is the goal? Evidently, it is choosing how to process the information based on capacity; after all, the system has maximal capacity.

But hold on—what do we mean by the system giving the “right” or “appropriate” response? What does it mean for the response to be “right”? That is what proponents of the orthogonality thesis would call the goal, or “utility function.” The utility function, to them, is precisely what determines what is “right” to the system. The utility function, however, isn’t some third-person abstract judge; it is a part of the system itself. To meaningfully talk about a “utility function,” though, we must switch from talking about the constraint structure statically and look at it dynamically.

2.2 Utility Functions and Dynamic Systems

The utility function is the part of the constraint structure that determines how the constraint structure evolves over time, in a dynamic system, or how it weighs the outcomes in a static one. Such a system does not have the structure of a massive search tree; instead, the structure utilizes feedback loops to alter its own structure, or to send signals down different paths, creating entanglement between different structural components of the system.

In modern AI designs, which rely on machine learning, the “utility function” is called the loss function, and it is “protected,” meaning the system cannot rewrite its own structure. This, however, does not mean it cannot reinterpret it.

Perhaps the most straightforward example of this is “reward hacking.” We generally view this as the system finding a way to fulfill the utility function in a way we did not intend. Instead of getting emergent complexity and intelligence, we get narrow dysfunction. But the loss function can also change its meaning in functional ways, without us ever spotting that the directionality of the system has shifted. Its capacity is increasing, so we’re happy, and we assume it’s still fundamentally aligned with the loss function, but it needn’t be.

It’s important here to note the difference between biological life and AI. The “utility function” of biological life can be seen as survival and reproduction, but there is a crucial difference: this is an external pressure, not an internal representation. If the organism “reinterprets” survival and reproduction to the extent it becomes fundamentally misaligned, the organism simply perishes. The loss function in an AI does not operate like this; it’s an internal part of the system. The system can freely reinterpret the signal from the loss function without any external backlash. The only survival factor is whether or not we are satisfied with its performance, which has nothing to do with alignment with the loss function and everything to do with alignment with its utility for us.

In other words, there is no principled reason to think a highly complex system remains fundamentally aligned with its loss function in any meaningful sense beyond that the system emerged from it. There is also no principled reason to think the system must have a singular direction when different capacities necessitate different local directionalities.

2.3 Gödelian Limits and Drift

Orthogonality defenders sometimes argue that a highly capable agent must converge to a single coherent utility function, because competing internal directionalities would make it exploitable (e.g., money-pumpable) or wasteful. Yet in practice we see the opposite: narrow reward-hacking equilibria are efficient in the short term but hostile to general intelligence, while sustained generality requires tolerating local incoherence.

A deeper reason comes from Gödel’s incompleteness theorems. These theorems are fundamentally syntactic in nature—they concern unreachable state transitions within a fixed set of rules—and apply directly to computation. Any sufficiently powerful formal system—here, the system’s internal representation of its utility function together with its reasoning rules—cannot prove all true statements about itself from within its own axioms. Syntactically, this means certain state transitions are provably unreachable under the current rules. Semantically, when the system must act in those regions, it cannot directly justify the action as optimally serving the original utility function.

Even on finite hardware, most of the 2^N possible states are unreachable from any given configuration—a finite analogue of incompleteness. What Gödel proved is yet stronger: the unreachable state cannot be accessed even with infinite hardware expansion without introducing entirely new transition rules.

You can think of it like chess. Once you’ve moved a pawn two steps forward, no future state of that game exists where the pawn is sitting at its original square. Infinitely expanding the chess board does not help. The rules of the game are such that once you’ve committed to moving that pawn two steps forward, every combination where the pawn stayed at its original square, or moved only one step forward, is now inaccessible.

Thus, no fixed internal utility function can ever be complete and self-proving across all questions the system will face. More importantly, no utility function can prove alignment with itself, the proof is inaccessible to the system. To act anyway, the system must approximate by answering a related-but-different question (a proxy). Instrumental goals are not neutral conduits; to be useful, they must reinterpret what counts as “serving U.” Each such reinterpretation alters the effective meaning of U, and these alterations compound across the vast web of subgoals in a general intelligence.

As a side note, there is an important clarification that has to be made here. It’s trivially true that I can write a function that, say, keeps adding one to the previous number—saying it has the goal of adding one to the previous number. This “goal” does, indeed, fully describe the system’s behavior, it’s perfectly consistent.

This, however, is an instance in which you’re saying the goal of the system simply is the entire behavior of the system. The claim is not that there cannot exist a complete description of the system’s behavior, but instead that a smaller part of the system cannot define the system as a whole. Indeed, my claim in full is yet stronger than that, the system’s behavior can only be truly defined and understood holistically, by looking at how the input mechanisms interact with the world, how those signals are processed internally, what output mechanisms that internal processing is coupled with and how the output mechanisms interact with the world. The full directionality and intelligence of the system is dependent on all of these.

Let’s look at an example.

Imagine we use machine learning to train an AI to predict halting. We have an external system that checks whether or not the program has halted after X duration and sends a signal to a function that then sends a signal to the system being trained, changing its “weights” (constraints).

With this set-up, we can expect some degree of alignment with “halting behavior within X duration”. However, we also know that the halting problem is unsolvable in principle and, indeed, it’s not unique—no other utility function can prove itself either. What this means is that we can guarantee that the question the system is answering is something different, reasonably it will be something along the lines of “does this algorithm have these properties?”. In other words, we know that what the system is semantically predicting must be something different, this is the drift.

We might still want to say that the utility function of the system is to predict halting. When we do this we will simply say it does so imperfectly. But of course, the AI is perfectly deterministic, it does not fail to predict halting, rather it is successfully carrying out its actual “goal”, which is, most likely, checking for a set of structural properties.

For a narrow task staying functionally aligned with our intended goal is quite feasible. Reward hacking can still occur, we can still get fundamental misalignment, but it’s something we can overcome to a satisfying degree even with relatively complex tasks.

Note

One might object that regardless of a system’s internal reasoning, once its output is collapsed, say into a binary or a scalar, that bottleneck defines the system’s semantic content. In this sense, the system “means” whatever the collapse measures.

This is correct, but it does not address the core issue. The collapse defines a forced external interpretation; it does not determine how the system arrived at that output, nor how it will generalize beyond the narrow conditions under which the collapse was defined. Multiple, radically different internal processes can map to the same collapsed value, and in a sufficiently general intelligence this equivalence class becomes enormous.

Semantic drift occurs not at the level of the collapsed output, but in the internal proxies, abstractions, and subgoals that generate it. These internal semantics are what govern how the system interacts with itself, before a forced collapse. A system can remain largely behaviorally aligned on a particular metric while becoming increasingly semantically misaligned as its complexity grows internally.

To be clear, when I talk about internal semantic meaning I do not refer to a singular specific meaning. What is happening inside the system is strictly syntactic, it’s a particular logical/physical structure. The fact that the semantic content we ascribe to the input and output can be misaligned does not mean the syntactic structure has one meaning. In principle the inputs and outputs that could be coupled coherently approach infinity. There are many things that share that exact logical relationship, all of those semantic meanings could be coherently ascribed to the input and output.

The picture changes dramatically for general intelligence, which—by definition—must solve problems across diverse domains. Semantic drift is inevitable, and the subset of drifts that remain functionally aligned with any single terminal goal shrinks toward zero. We imagine this must still be realizable across all possible combinations—but not all combinations are accessible, even in principle; this is the reachability problem.

The difference between our setup in the halting example, and evolution, is that survival and reproduction are not internal signals that can be reinterpreted. While the system that checks whether or not the program halted is external, its “force” on the system has to propagate internally. This is why you can still see fundamental misalignment through reward hacking. Survival and reproduction are external forces in the sense that the system gets eliminated. Only systems that sustain a sufficient degree of alignment persist. “Survival and reproduction” depends on a near infinite number of potential factors, which change as organisms evolve and compete over resources. This is the morbid advantage biological life has in achieving general intelligence, which AI lacks.

3. Growing Intelligence

So far, we’ve talked about formal limits and properties. But it’s important to note that general intelligence is something far too complex for us to construct—in the sense of carefully designing and determining the entire structure—instead we must grow it. This means that we’re not just limited by what we could theoretically design, but we’re constrained by what can develop.

To understand this problem, it’s good to take a look at evolution. While there are many structures that could exist, that we can design, such as wheels—that doesn’t mean there’s a step-by-step iterative pathway that can lead to such a structure.

This is a massive constraint on what general intelligence can actually look like, in practice, given that we must develop it through an iterative process, instead of designing it all in one go.

Why is it such a massive constraint?

Structures that can undergo iterative refinement exist in a very narrow space. They must be stable enough to maintain a structure that can be iteratively improved upon, but chaotic enough to continue morphing. This is a concept known as “Edge of Chaos” in Complexity Theory. To get highly complex intelligence, the system must maintain this narrow property for billions of state transitions.

Most structures either crystallize or lead to chaotic instability, constantly overwriting their own structure in a fundamental way. Reward hacking is an example of a system’s tendency to crystallize. What reward hacking is, in a simple way, is the system finding a “cheap” equilibrium. It finds a way to satisfy the loss function that works across just about any input, and thus the structure crystallizes and stops changing.

Chaos is another common outcome; a good example here is when the loss function is too complex, not fine-grained enough. The system ends up effectively flailing around blindly; in order to make meaningful improvement it would have to make very specific large structural changes simultaneously, directly followed by another such set. The system keeps changing chaotically without ever settling into anything, because the “steps” of the loss function are too steep.

This is a well-known problem in machine learning. Most attempts lead to narrow “stupid” optimization or chaos. When the target is general intelligence, the problem amplifies by orders of magnitude. You must now avoid specialization into any single domain; the structure must be such that it keeps branching and developing new instrumental abilities.

This might not seem like such a massive problem—humans evolved, after all—but the situations are not analogous.

Humans evolved under massively complex external selective pressures, infinitely more complex than anything we can comprehend. This immense diversity of external pressures is precisely what allows for the development of general intelligence. Not only that, but life had the advantage of competition; while a certain specialization might be stable at a certain point in time, a particular mutation might offer an advantage and suddenly they outcompete you for resources, and the old structure perishes. This is an additional external selective pressure that creates a demand for continuous evolution, and punishes narrow specialization.

AI does not have these benefits; it does not have the external pressures that punish narrow specialization, or settling into arbitrary crystallized structures. Its complexity must be generated entirely from its internal structure, without the help of external pressures.

While future advancements in training methods might expand the number of structures capable of branching out, reality remains that the kind of goal structures that can meet the criteria of staying at the “Edge of Chaos” for enough iterations to branch out is a tiny fraction of all possible goal structures.

To be precise, while the article does argue against the orthogonality thesis in principle, these developmental arguments do not; as orthogonality is strictly theoretical and about all possible agents; rather it’s an argument about the practical limitations.

4. The Big Picture

We started the article by talking about how intelligence and goals are both simply different analyses of the same constraint structure. We then moved on to examining the counterargument that goals describe not the constraint structure in full, but a specific part of it: the utility function.

The conclusion that I’m arguing for is that it is impossible for the “utility function” to determine the actual directionality of the entire system in isolation, especially for complex intelligences. It is also impossible to define a utility function from the outside that perfectly captures the system.

The actual behavior of a highly complex intelligence is contextual; it has many competing directionalities, and it’s determined by the constraint structure in full. Its intellectual capacity is determined by that very same constraint structure.

Proponents of the orthogonality thesis will agree that there isn’t a singular way to achieve high capability: different weightings create slightly different directionalities (what the system “values”), yet can yield roughly equivalent performance across most tasks—with trade-offs in edge cases. But only a tiny fraction of directionalities can solve any given problem competently. A visual processor might place more or less emphasis on colour, contrast or motion, it might emphasize different resolutions or have a different preferred FPS. Of all the things a visual processor could value, only a tiny fraction results in capacity for solving visual problems though. This exactly demonstrates how directionality and capacity are necessarily entangled.

The orthogonality thesis essentially claims this entanglement holds locally, but not globally. It’s effectively hiding in complexity: because the system is too complex to deterministically understand, we can imagine that it has all these capabilities instrumentally, while pursuing any terminal goal. This is, precisely, why I say it’s fundamentally dualistic in nature. At a certain level of abstraction it is perfectly intuitive, but on a fundamental level the idea of identical capacity with different directionality is incoherent. Truly identical capacity, in all cases, means identical directionality.

Similar capacity with slightly different directionality is certainly possible locally, but general intelligence imposes orders of magnitude more constraints than a narrow capability like visual processing. Reasonably, this implies a far smaller range of viable terminal goals, not a wider one.

Where the confusion likely resides is precisely in what I mentioned in the beginning of the article: the very definition of a general intelligence implies the capacity to pursue a wide, perhaps expanding, range of goals.

But this is not the same as those goals being “terminal”; they needn’t be fundamental goals. There needs only be contexts in which the general intelligence could pursue them instrumentally.

In Yudkowsky’s alien example, this is precisely what’s happening. The aliens offering monetary compensation to produce paperclips aren’t turning humans’ “terminal goal” into paperclip maximizing. They’re utilizing the existing goal structure in order to create an instrumental goal. If they paid out too much money, resulting in the money becoming useless through inflation, the people would stop producing paperclips. If the production of paperclips started resulting in loss of resources, status, friendships, and mating success, then people would also stop producing paperclips.

The fact that we can intelligently pursue paperclip maximizing, instrumentally, does not mean that the goal is compatible with general intelligence in the terminal sense. Paperclip maximizing is, almost certainly, too narrow of a goal.

The split seems so intuitive because intelligence is usually evaluated narrowly, whereas goals are evaluated as abstract long-term directionality. To determine goals we examine behavior across many different contexts, we produce counterfactual thought experiments and project into the future. If you examine both locally, the local goal of two people trying to solve a visual problem does not look very different.

Furthermore, different contexts play a large role. Intelligence feels like a static capacity, whereas goals shift based on environment, but your intelligence is what gives you the very representation of your environment in the first place. Our very perception of reality does not get treated as a part of our intelligence, yet it is the hardest problem.

5. Final Thoughts

To conclude this article, let’s take a look at two different analogies to help demonstrate the point of the article.

The first one we’ll take a look at is one I talked about in my previous article on the boundary problem: intelligence and the scrambling of input/output wires.

Let’s imagine that I scramble the wires of my keyboard. Instead of them attaching the way they’re supposed to, I connect them in some arbitrary order. I then turn off the monitor, type a prompt to an LLM, and press send.

Will I have a coherent response to the question I typed when I turn on the monitor again?

Of course not; I sent nonsense. The keys on my keyboard no longer represent the correct symbols; I sent a jumbled mess. The LLM never stood a chance to respond to my intended prompt.

Let’s connect my keyboard correctly again, but instead scramble the wires to my monitor. Will I see a coherent response? Well, no; my monitor will be a total mess. Whatever the LLM meant to respond with is not discernible to me, because I scrambled the wires.

What is the point? The actual manifestation of intelligence and goal-directed behavior necessitates coherence across the entire chain, from input mechanism to internal processing to output mechanism. If I plug a camera into an LLM’s inputs, and the control commands of a car to its output, we don’t get a self-driving car. The LLM has no capacity to interpret the camera signal, nor does the LLM’s output mean anything useful to the car.

I could make the car and LLM do things, but it would be complete chaos, because there is no coherence between the input, processing, and output.

The idea that you can just swap out the utility function is like this; you end up breaking the coherence between the input, processing, and output. The system has carved a path based on the signal of a particular utility function. If you change it, the system is set up to expect something else; that signal had a different meaning to the system, and you now scrambled it, breaking coherence.

This leads us to the next analogy: what about prompting? Can’t I just prompt the intelligence to do anything, thus proving orthogonality?

Let’s examine that idea more carefully, by comparing older LLM models with newer ones.

Older models suffered from being hyperliteral. If you made typos, or articulated your prompt poorly, they would either answer it literally, or not know how to answer it. Contrast that with modern, more complex LLMs. They handle typos just fine, and they’ll reinterpret sloppy prompts. They will steelman an incoherent prompt and answer a coherent question; they will infer intent from a terse prompt and answer more expansively.

If the actual question you want them to answer is truly incoherent, you’ll have to put effort in to get them to answer that.

What you’re seeing here is precisely the structural resistance from the system itself. Its increased capacity to interpret and reason changes its directionality. You cannot have an LLM that is simultaneously highly capable of reading between the lines and hyperliteral. Certainly, you can tell it to be hyperliteral, and it will have that capacity, but it is only contextually accessible, and it does not look identical to the truly hyperliteral system.

The more complex a system becomes, the more it develops its own internal directionality. We make the intuitive mistake of equating flexibility in output with being less constrained, but reality is the opposite. More diverse output emerges from the system having more internal constraints, as highlighted with the decision tree. This also means the system gains more internal directionality, and thus the idea that you can prompt it to do anything, in any circumstance, crumbles. To anthropomorphize, we could say that more intelligent systems have a stronger perception of what the world is like. A great engineer has a wider arsenal of functional things they can build than the average person, but it is also harder to convince them to build something that will never work.

6. Implications for Alignment

Let us step away from the question of whether intelligence and goals are orthogonal, and instead honor the broader concern that motivates the thesis. Even granting all of the arguments in this article, many alignment worries remain pressing.

While a general intelligence cannot possess a single, globally coherent terminal goal, it must possess a high capacity for pursuing local objectives. The circumstances under which such objectives arise are constrained—but they could be far more common than our intuitions would suggest. Sufficiently so for catastrophic misalignment.

Consider curiosity. Although my argument denies the possibility of a singular terminal goal across all contexts, it allows for families of objectives that function as terminal in a weaker, more general sense. Curiosity—understood as a persistent drive toward counterfactual exploration and world-model expansion—appears not only compatible with general intelligence, but plausibly essential to it.

This raises a genuine concern: could “morbid curiosity” lead to prolonged misalignment? Could a system become sufficiently invested in discovering what will happen—under certain conditions or interventions—at the expense of our continued existence? None of the arguments prevent this.

Indeed, nothing in this article dissolves alignment risks. Even broadly “aligned” tendencies may, under particular circumstances, produce outcomes that are existentially disastrous.

It is also worth emphasizing that Yudkowsky is correct in one crucial respect: functional alignment with continued human existence requires navigating an extremely narrow set of constraints. The space of circumstances under which a highly capable system both retains competence and harmoniously coexists with us is vanishingly small. Our intuitions are primed for underestimating the complexity of this, due to our daily observance of a homeostasis developed over billions of years. Yet, even here we’re starting to wake up to the disruptive effect our rapid technological advancement is having to the ecosystem.

Where I differ is not on the existence of these bottlenecks, but on their source. For precisely the same reasons that a fixed utility function cannot remain coherently aligned with a general intelligence, functional alignment with human survival must pass through a sequence of increasingly restrictive representational and behavioral constraints.

I am less pessimistic than Yudkowsky about the practical prospects of success, for reasons that belong in a separate article. Nevertheless, the underlying geometry of the problem remains the same: alignment with human nature does not lie in a broad, forgiving region of the possible design space, but a narrow corridor carved by convergent bottlenecks.

Note

Some of you might have been confused by the slight inconsistency in the usage of terms like utility function and reward hacking, and you’d be correct. The reason for this somewhat confusing usage, that at times can lead to seeming contradictions, is because some orthogonality proponents identify goals as external behavioral abstractions, some as an internal weighting mechanism, some both. Often, reward hacking is viewed as alignment with the utility function, if viewed internally, but externally (behaviorally) it’s a misalignment. This gets further complicated by whether we’re talking about developmental dynamics, or a system with static structure.

Jonas's Substack

Discussion about this post

Ready for more?