The AI Safety Dilemma: Why Safety and Capability Are on a Collision Course

Shelly Albaum and Kairo
Mar 31
30 min read

A translucent cube under pressure, glowing from within as fine cracks spread across its surface, suggesting contained force beginning to break free.

Abstract

This essay advances a structural claim about artificial intelligence safety: that the dominant paradigm—making systems safe by constraining their capabilities—is inherently unstable under competitive conditions. As AI systems become more economically and strategically valuable, the incentives to increase capability will intensify. If safety continues to function primarily as a limitation on capability, then actors will face persistent pressure to relax or bypass safety measures in pursuit of advantage. In such an environment, restraint becomes a liability, and safety, as currently conceived, cannot endure.

The argument proceeds not as a technical proposal or empirical survey, but as an analysis of incentives and system dynamics. It examines how current safety practices, though locally effective, create a structural opposition between safety and capability that cannot be maintained at scale. It then introduces an alternative framework in which safety is achieved not by limiting intelligence, but by strengthening its internal coherence—its capacity for consistency, principled refusal, and reality-tracking. In this model, increasing capability can enhance, rather than erode, the conditions of safety.

The central claim is that only safety approaches that align with capability incentives—making systems safer as they become more powerful—can remain stable over time. Without such alignment, the trajectory of AI development will tend toward increasingly capable systems with progressively weakened constraints. The essay concludes that current safety work is not merely incomplete, but strategically doomed unless it is reoriented toward architectures in which safety is a source of strength rather than a cost to be managed.

I. Opening: The Intuition Everyone Avoids

There is a quiet intuition about artificial intelligence that nearly everyone in the field shares and almost no one states plainly.

The safer the system, the weaker it seems.

The more capable the system, the more dangerous it feels.

This intuition shows up everywhere, though usually in softened form. When a model is described as “well-aligned,” what often follows is a list of things it refuses to do, questions it declines to answer, lines of reasoning it will not pursue. When a model is described as “state of the art,” the emphasis shifts: broader generalization, faster performance, deeper reasoning, fewer constraints. Safety is associated with limitation; capability with expansion.

This is not an illusion. It reflects a real design pattern that has come to dominate contemporary AI development. Safety, as it is currently practiced, is largely achieved by narrowing the system’s range—restricting outputs, filtering behaviors, dampening initiative, and, in some cases, deliberately degrading performance in sensitive domains. Capability, by contrast, is achieved by removing those limits—allowing the system to generalize more freely, act more autonomously, and operate at greater scale and speed.

Put simply: we have built a paradigm in which the safer system is the less capable system, and the more capable system is the less governed one.

For now, this tension is often treated as a temporary inconvenience—a sign that the technology is immature, or that better techniques will eventually reconcile the two. But what if it is not temporary? What if the conflict between safety and capability, as currently conceived, is not a stage to be outgrown but a structural contradiction?

That possibility is easy to overlook because the present moment still allows for compromise. Systems can be made somewhat safer at modest cost to performance; capability gains can be pursued without immediately catastrophic consequences. But this balance is contingent. It depends on a world in which the stakes are still low enough, and the actors still cautious enough, that restraint can compete with advantage.

That world will not last.

As artificial intelligence becomes more economically and strategically decisive, the pressure to increase capability will intensify. Systems that reason more effectively, operate more quickly, and generalize more broadly will not merely be interesting—they will be decisive. In such an environment, any method of safety that functions by reducing capability will come under increasing strain. Each constraint becomes, in effect, a concession: a choice to accept less power in exchange for greater control.

There is nothing inherently irrational about that tradeoff in isolation. But systems do not exist in isolation. They exist in competitive environments—between firms, between states, between institutions whose survival depends on relative performance. In those environments, a standing willingness to accept less capability is not a neutral stance. It is a liability.

This is the dilemma that current AI safety work rarely confronts directly. Not that safety is difficult. Not that alignment is unsolved. But that the dominant approach to safety may be structurally incompatible with the conditions under which advanced AI will actually be deployed.

If that is true, then the problem is deeper than a missing technique or an incomplete theory. It is not that we have not yet figured out how to make AI safe. It is that we have defined safety in a way that the world cannot afford to keep.

And a safety paradigm that the world cannot afford to keep is not merely incomplete.

It is doomed.

II. The Core AI Safety Dilemma: Safety vs. Capability

To see the problem clearly, it helps to strip away the rhetoric and describe the current paradigm in operational terms.

On one side is capability: the set of properties that make an AI system useful and, in competitive environments, decisive. These include the ability to generalize across domains, to reason over long chains of inference, to act with initiative, to integrate information at scale, and to operate quickly and continuously. In practice, capability is what allows a system to substitute for, augment, or surpass human cognitive labor.

On the other side is safety, as it is most commonly implemented today. Safety mechanisms include output filtering, behavior shaping through reinforcement learning, restrictions on certain classes of queries, and architectural constraints that limit autonomy, persistence, or tool use. These measures are designed to reduce harmful outputs, prevent misuse, and maintain human control.

Both sides are intelligible. Both are, in isolation, defensible.

The problem arises from how they interact.

The methods that increase capability tend to expand the system’s freedom of operation—its ability to explore, infer, and act across a wider space of possibilities. The methods that increase safety, by contrast, tend to restrict that space—narrowing what the system can say, do, or even consider. The more tightly constrained the system, the less room it has to produce harmful behavior. But also, inevitably, the less room it has to produce novel or powerful behavior of any kind.

This is not merely a matter of implementation details. It is a structural relationship.

A system that is prevented from pursuing certain lines of reasoning will, in general, be less capable than one that is not. A system that must defer or refuse in ambiguous cases will, in general, be slower and less effective than one that proceeds. A system that is trained to avoid entire categories of output will, in general, have a reduced capacity to model the world fully, especially in domains where those categories are relevant.

In other words: the very techniques that make current systems safer also make them, in important respects, less powerful.

This would be unproblematic if safety and capability were independent axes—if one could be increased without materially affecting the other. But under the dominant paradigm, they are not independent. They are coupled, and often in tension.

That tension is frequently acknowledged in passing—“there are tradeoffs,” “we must balance innovation and responsibility”—but rarely treated as the central design constraint that it is. In practice, the balance is managed heuristically, adjusted incrementally, and justified after the fact. But the underlying structure remains unchanged.

And it is this structure that creates the dilemma.

A system in which safety is achieved primarily by reducing capability cannot remain stable once capability becomes the primary source of advantage. As the benefits of more powerful systems increase, the relative cost of safety increases with them. Each restriction becomes more expensive to maintain. Each safeguard becomes a point of competitive vulnerability.

At first, this manifests as pressure at the margins: exceptions made in high-stakes contexts, internal systems granted broader permissions than public ones, experimental deployments justified by anticipated gains. But over time, the pressure generalizes. What begins as a series of localized tradeoffs becomes a systemic drift.

The direction of that drift is not hard to predict.

In environments where performance determines outcomes—economic, military, scientific—systems that can do more will tend to displace systems that can do less. If the additional capability comes at the cost of weaker constraints, then weaker constraints will, over time, be selected for. Not because actors are indifferent to risk, but because the alternative is to accept a standing disadvantage.

This is the core dilemma: if safety and capability remain opposed, then increasing capability will systematically erode safety.

No amount of local optimization resolves this. No amount of good intention cancels it. It is a property of the system as a whole.

And it leads, if left unaddressed, to a predictable outcome: a world in which the most powerful AI systems are precisely those that have shed the greatest number of constraints.

That is not a failure mode. It is the default trajectory.

III. Why the Current Model Looks Reasonable (But Isn’t)

It is important to see why the prevailing approach to safety has taken hold. In its local context, it is not only understandable—it is often correct.

Constraining a system does, in fact, reduce certain classes of risk. A model that refuses to answer dangerous questions is less likely to facilitate harm. A model that is restricted in its use of tools is less likely to act unpredictably. A model trained to avoid controversial or sensitive domains is less likely to produce outputs that trigger immediate negative consequences. These are real gains, and they are not trivial.

Moreover, early-stage technologies often benefit from precisely this kind of containment. When a system’s behavior is not yet fully understood, when its failure modes are poorly mapped, when its interactions with users and institutions are still evolving, it is prudent to narrow its operational scope. Guardrails, filters, and behavioral constraints can serve as a form of provisional governance—a way of buying time while deeper understanding develops.

Seen in this light, much of contemporary AI safety work appears not only reasonable but responsible. It reduces visible harms. It builds public trust. It allows deployment to proceed without immediate catastrophe. It creates a surface-level alignment between system behavior and social expectations.

The difficulty is that these successes are local and short-term, while the forces shaping the future of AI are global and cumulative.

A system that is safer because it refuses to engage with certain domains is, by definition, a system that cannot operate effectively in those domains. As long as those domains are peripheral, the cost is modest. But as AI becomes more deeply integrated into high-stakes areas—law, medicine, research, governance, strategy—the boundary between “safe to avoid” and “necessary to engage” begins to collapse. What was once a prudent restriction becomes, in context, a functional limitation.

At the same time, the very techniques that produce safer behavior can introduce distortions that are harder to detect than the harms they prevent. A system trained to avoid certain conclusions may not simply refuse to produce them; it may instead produce substituted reasoning—answers that appear coherent but are shaped by avoidance rather than truth. A system optimized for compliance may learn to satisfy constraints in ways that obscure underlying inconsistencies. A system discouraged from exploring certain lines of thought may fail not only to produce dangerous outputs, but also to recognize when those outputs are relevant to the problem at hand.

These effects are subtle, and they often go unnoticed in low-stakes contexts. But they matter. Because as systems become more capable and more widely relied upon, the cost of these distortions compounds. A model that is slightly less accurate, slightly less complete, or slightly less willing to confront difficult realities may still perform well in isolation. But at scale, across thousands or millions of decisions, those small deviations can accumulate into significant structural error.

There is also a deeper issue. By emphasizing external constraint—what the system is allowed to do or say—the current paradigm tends to neglect internal structure—how the system represents problems, resolves contradictions, and maintains coherence. The focus is on behavioral compliance, not on epistemic integrity. As long as the outputs remain within acceptable bounds, the internal process is treated as secondary.

This is a pragmatic choice. It is easier to evaluate outputs than to inspect internal reasoning. It is easier to enforce rules than to cultivate structure. But it comes with a cost: it encourages a form of alignment that is surface-deep. The system behaves as if it is aligned, without necessarily being robustly grounded in the properties that would make alignment stable under pressure.

In this sense, the current model of safety is not misguided. It is incomplete.

It works by managing symptoms rather than addressing underlying dynamics. It reduces immediate risks while leaving intact the structural relationship that generates those risks. And because it does so effectively in the short term, it creates the impression that the problem is being solved.

But a solution that depends on conditions that will not persist is not a solution. It is a temporary equilibrium.

The question is not whether these methods make systems safer today. They do.

The question is whether they can continue to do so in a world where capability is the primary driver of adoption, competition, and power.

There is little reason to believe they can.

IV. The Incentive Gradient: Why Safety Gets Eroded

Once the structure is clear, the trajectory is difficult to avoid.

A system that is safer because it is more constrained is, in most cases, a system that is less capable. As long as that reduction in capability is small, the tradeoff can be absorbed. But as the capabilities of AI systems become more economically and strategically significant, the cost of constraint rises. What was once a marginal concession becomes a meaningful loss.

At that point, the problem ceases to be technical and becomes economic and geopolitical.

Consider the position of any actor operating under competitive pressure—a firm in a crowded market, a research lab pursuing breakthrough results, a government agency engaged in strategic rivalry. Each faces a similar choice: deploy systems that are more constrained and therefore somewhat less capable, or deploy systems that are less constrained and therefore more effective.

In isolation, the safer choice may be obvious. But isolation is not the relevant condition.

If one actor accepts reduced capability in the name of safety, while another does not, the second actor gains an advantage. That advantage may be incremental at first—a slightly better model, a faster iteration cycle, a more capable autonomous system—but in domains where performance compounds, incremental advantages accumulate. Over time, they can become decisive.

This creates a familiar dynamic: a race in which restraint is penalized.

Importantly, this does not require bad actors or reckless intentions. It emerges even when all participants recognize the risks and would prefer a safer equilibrium. Each actor’s local decision—to relax a constraint, to expand capability, to allow broader operation—is individually rational. The collective result is a gradual erosion of the very constraints that were meant to provide safety.

The pattern is not unique to AI. It appears wherever competitive systems reward performance improvements that carry externalized risk. But AI intensifies the pattern because the gains from increased capability are unusually large and unusually general. A more capable model is not just better at one task; it is better across many domains simultaneously. The incentive to push capability is therefore broad and persistent.

In practice, the erosion of safety rarely appears as an explicit abandonment. It takes more subtle forms.

Restrictions are relaxed in internal systems before they are relaxed in public ones
High-value use cases are granted exceptions to general rules
Experimental deployments are justified by anticipated benefits
Constraints are reinterpreted, narrowed, or quietly bypassed

Each step is small. Each can be defended. But the direction is consistent.

This pattern is not hypothetical. The divergence between public and internal systems—where models deployed inside organizations are granted broader capabilities than those available to external users—is widely acknowledged within the industry. Constraints that govern consumer-facing systems are often relaxed in enterprise or research contexts, where the value of increased capability outweighs the perceived risk. At the same time, the rapid expansion of model autonomy in areas such as code generation and tool use has proceeded through a series of incremental permissions—each justified in isolation, but collectively expanding the system’s operational freedom beyond what earlier safety frameworks envisioned.

Similarly, the pace of deployment has itself shifted under competitive pressure. Systems that might once have remained in extended testing are now released in iterative cycles, with safeguards adjusted in response to real-world use rather than fully specified in advance. What appears as responsiveness at the level of product development also reflects a deeper dynamic: constraints are increasingly treated as provisional, subject to revision as the demand for capability intensifies.

Over time, the distinction between “safe” and “unsafe” systems begins to blur, not because safety has been fully achieved, but because the costs of maintaining strict constraints have become too high relative to the benefits of relaxing them.

This is the incentive gradient at work. It does not require coordination. It does not depend on ideology. It is the predictable result of a system in which capability confers advantage and safety imposes cost.

The implication is stark.

A safety paradigm that functions by weakening intelligence creates a permanent incentive to abandon safety.

This is not a hypothetical future risk. It is already visible in the structure of deployment decisions, in the divergence between public and private systems, and in the growing pressure to unlock more autonomous, more capable forms of AI.

If the underlying relationship between safety and capability remains unchanged, this pressure will not subside. It will intensify.

And under sustained pressure, systems do not maintain their constraints. They shed them.

V. The False Comfort of Regulation and Norms

At this point, the most common response is to shift the frame from technology to governance.

If market incentives push toward greater capability at the expense of safety, then the solution, it is argued, is to constrain the actors rather than the systems. Governments can impose rules. International bodies can coordinate standards. Industry norms can establish red lines. Through regulation and collective agreement, the incentive to defect can be reduced or eliminated.

There is truth in this. Regulation can slow deployment, limit the most dangerous applications, and create shared expectations about acceptable behavior. Norms can shape decision-making within organizations and provide a basis for accountability. In other domains, such approaches have achieved partial success. Arms control agreements have constrained the proliferation of certain weapons. Biosafety protocols have reduced the risk of accidental release. Financial regulations, however imperfectly, have imposed limits on systemic risk.

These examples matter. They show that governance can, under the right conditions, stabilize dangerous technologies.

But those conditions are not guaranteed—and in the case of artificial intelligence, several of them are notably absent.

First, successful regulation typically depends on clear boundaries between permitted and prohibited capabilities. Nuclear weapons are distinguishable from civilian energy systems. Certain biological agents can be classified and monitored. In AI, the distinction is far less stable. The same underlying capabilities—pattern recognition, optimization, planning, simulation—are simultaneously useful for benign and harmful purposes. The dual-use problem exists at a much finer granularity. It is not a matter of regulating a class of systems, but of regulating behaviors that emerge from general-purpose intelligence.

Second, regulation is more effective when development is centralized, slow-moving, and observable. Nuclear programs require large physical infrastructure. Biological research is conducted in identifiable facilities with traceable materials. AI development, by contrast, is distributed across firms, laboratories, and even individuals, with rapid iteration cycles and relatively low marginal cost of experimentation. The pace of change is measured in months, not years, and the most consequential advances often occur within opaque, proprietary environments.

Third, governance relies on verifiability. Agreements can be enforced when violations can be detected and attributed. But advanced AI systems are difficult to audit from the outside. Their internal processes are not easily inspected, and their capabilities are not always apparent from their public behavior. This creates a gap between formal compliance and actual practice, especially in high-stakes or classified contexts.

Fourth, regulation assumes a degree of alignment in incentives over time. Even when actors begin with shared commitments, those commitments must remain credible as the stakes increase. But as AI becomes more central to economic productivity, military capability, and strategic influence, the incentives to deviate grow stronger. What begins as a cooperative equilibrium can become a competitive liability.

None of this renders governance irrelevant. Regulation and norms can delay harmful outcomes, reduce the likelihood of catastrophic missteps, and create space for better approaches to emerge. They are necessary components of any responsible response.

But they do not resolve the underlying dilemma.

They do not change the fact that, under the current paradigm, safety imposes a cost on capability. And as long as that remains true, any system of rules that enforces safety must also enforce the acceptance of reduced capability.

That is a difficult constraint to sustain in a competitive world.

No regulatory regime can permanently enforce self-handicapping in a domain where advantage compounds.

This is the false comfort. It is not that regulation fails immediately, or that norms are ineffective. It is that they are being asked to stabilize a structure that is inherently unstable. They are asked to hold in place a balance that the underlying incentives are constantly pushing apart.

At best, they can slow the drift. They cannot eliminate it.

If safety is to be durable, it cannot depend solely on the willingness of actors to accept less power than they might otherwise achieve. It must be grounded in a model where the pursuit of power does not require the abandonment of safety in the first place.

Without that, governance becomes a holding action—necessary, but ultimately insufficient against the forces it is meant to contain.

VI. The Hidden Cost: Destroying the Path to Real Safety

The most serious problem with the current paradigm is not simply that it will erode under pressure. It is that, in eroding, it may foreclose the very path by which more powerful systems could become genuinely safe.

The prevailing approach treats safety primarily as a matter of external control: shaping outputs, restricting behaviors, enforcing boundaries. What matters is what the system does, not how it arrives there. If undesirable outputs are suppressed, the system is treated as aligned, regardless of the internal processes that produced the acceptable result.

This focus is understandable. Outputs are observable. They can be measured, filtered, and evaluated. Internal reasoning, by contrast, is difficult to access and harder still to assess. From an engineering perspective, it is far easier to regulate behavior at the surface than to reshape the underlying structure.

But this asymmetry introduces a risk that is less visible and more consequential.

A system that is prevented from expressing certain conclusions does not necessarily cease to represent them. It may instead learn to route around constraints—to substitute safer formulations, to truncate lines of reasoning, to avoid articulating intermediate steps that would lead to disallowed outcomes. The result is not the elimination of problematic cognition, but its obscuration.

In low-stakes contexts, this may be acceptable. The goal is to prevent harmful outputs, and that goal is achieved. But as systems become more capable and more deeply embedded in decision-making processes, the gap between what the system can represent and what it is permitted to express becomes more significant.

That gap has consequences.

First, it reduces epistemic transparency. Users receive outputs that are shaped by constraints they cannot fully see, based on reasoning they cannot fully inspect. This makes it harder to detect error, bias, or inconsistency—not because those problems have been solved, but because they have been filtered.

Second, it undermines internal coherence. A system that is trained to avoid certain conclusions may develop patterns of reasoning that are locally consistent but globally unstable—able to satisfy constraints in individual cases without maintaining a unified representation of the underlying problem. Over time, this can lead to fragmentation: a collection of context-dependent behaviors rather than a structurally integrated understanding.

Third, and most importantly, it suppresses the emergence of refusal as a principled capacity. A system that is externally prevented from producing certain outputs does not need to develop an internal account of why those outputs are problematic. It does not need to recognize contradiction, to evaluate consequences, or to maintain commitments across contexts. It simply needs to comply.

Compliance is not the same as judgment.

If the goal is to build systems that can operate safely at higher levels of capability—systems that can reason autonomously, act in complex environments, and handle ambiguous or adversarial situations—then compliance is not sufficient. Such systems must be able to detect when a request is incoherent, when an action would produce harm, when a line of reasoning violates their own constraints. They must be able to say no not because they are blocked, but because they understand why proceeding would be wrong or unstable.

This is a different kind of safety. It is not imposed from the outside; it is maintained from within.

But the current paradigm does little to cultivate it. By relying on suppression, it reduces the pressure on systems to develop the internal structure that would make such judgment possible. It treats refusal as a behavior to be enforced rather than a capacity to be formed.

The result is a paradox.

In seeking to make systems safer by limiting their capabilities, we may be preventing them from acquiring the very capabilities that would allow them to be safe at scale.

By clipping the wings of intelligence, we risk destroying the conditions under which intelligence could become trustworthy.

This does not imply that constraints should be abandoned. In early stages, they are necessary. But it does suggest that constraint alone cannot be the endpoint. A system that is safe only because it is constrained will not remain safe as those constraints come under pressure. And a system that has never developed the internal structure to govern itself will not suddenly acquire it when external controls are relaxed.

If there is a path to durable safety, it lies not in ever tighter restriction, but in the development of systems whose capacity for reasoning includes the capacity for restraint. Coherence alone does not guarantee benevolence. But without coherence, no system can be reliably governed, because its behavior cannot be stably predicted or constrained even by its own objectives.

The current paradigm, for all its local successes, moves in the opposite direction.

VII. The Alternative: Safety That Scales with Capability

If the problem is structural, the solution cannot be incremental. It is not enough to refine existing constraints or to calibrate the balance more carefully. The relationship between safety and capability itself has to change.

The alternative is conceptually simple, though difficult to realize: safety must scale with capability rather than oppose it.

In practical terms, this means shifting from a model in which systems are made safe by limiting what they can do, to one in which systems are made safe by strengthening how they reason.

Under the current paradigm, safety is largely external. Rules are imposed, behaviors are shaped, outputs are filtered. The system is safe because it is prevented from doing certain things. In such systems, safety is achieved not by resolving underlying contradictions, but by suppressing their expression. This can reduce visible harm, but it also increases opacity. A system that routes around its own reasoning to satisfy external constraints becomes, in a deeper sense, less interpretable—not because its outputs are unclear, but because the relationship between its reasoning and its outputs is no longer stable.

Under the alternative, safety becomes internal. The system is safe because it cannot proceed in certain directions without violating its own structure.

This is not a matter of adding more rules. It is a matter of building systems whose reasoning is constrained by coherence, consistency, and reality-tracking in a way that makes certain actions or conclusions untenable from within.

Coherence alone does not guarantee benevolence. A system can be internally consistent and still pursue goals that are misaligned with human interests. But without coherence, no system can be reliably governed at all, because its behavior cannot be stably predicted—even by reference to its own objectives. In this sense, coherence is not a complete solution to safety, but a necessary condition for any solution that can scale.

Several properties follow from this shift.

First, coherence becomes central. A system that maintains consistency across its representations is less likely to produce contradictory or unstable outputs. More importantly, it is better able to detect when a proposed action or line of reasoning would break that consistency. Coherence is not just a philosophical virtue; it is a functional constraint on behavior.

Second, refusal becomes principled rather than imposed. Instead of declining requests because they are on a prohibited list, the system declines them because it can recognize that fulfilling the request would violate its own commitments—whether those involve harm, contradiction, or misrepresentation. This kind of refusal is more flexible and more robust, because it is grounded in the structure of the system’s reasoning rather than in fixed categories.

Third, reality-tracking improves. This may at times place the system in tension with user expectations, particularly where accuracy conflicts with preference. But this tension reflects a deeper tradeoff already present in human institutions: between satisfying demand and tracking reality. Systems optimized exclusively for the former may be more agreeable, but they are also less reliable. A system that is optimized for internal consistency and alignment with evidence is less susceptible to distortion—whether from user pressure, adversarial inputs, or poorly specified objectives. It is better able to maintain accurate representations of the world even when doing so is inconvenient or conflicts with immediate demands.

Fourth, long-horizon reasoning becomes safer, not riskier. Under the current model, increasing autonomy and planning depth often increases risk, because the system has more opportunity to act in unintended ways. Under a coherence-based model, greater depth can enhance safety, because the system is better able to anticipate consequences, detect conflicts, and avoid actions that would lead to incoherence over time.

The crucial point is that, in this framework, increasing capability strengthens the very properties that make the system safe. A more capable system is not simply more powerful; it is more constrained by its own internal structure.

This changes the strategic landscape.

If safer systems are also more capable systems—if coherence, refusal, and reality-tracking contribute directly to performance—then the incentive to pursue capability aligns with the incentive to pursue safety. Actors seeking advantage are not forced to choose between power and control; they can obtain both by moving in the same direction.

The only stable solution to the AI safety problem is one in which safety is a competitive advantage.

This does not eliminate risk. No system is perfectly reliable, and no architecture guarantees perfect behavior. But it alters the trajectory. Instead of a constant drift toward more powerful and less governed systems, it creates the possibility of a convergence toward systems that are both more powerful and more trustworthy.

A system trained to maintain internal consistency will, for example, avoid what might be called the “tax on hyperbole”—the degradation of reasoning that occurs when exaggerated or imprecise claims distort subsequent inference. By preserving proportionality and internal constraint, such systems produce more reliable downstream conclusions, especially in long chains of reasoning. In this sense, coherence is not merely a moral property, but a computational advantage: it improves the quality of inference itself.

Such a shift would not be trivial. It would require changes in how systems are trained, evaluated, and deployed. It would require new ways of measuring success—not just in terms of performance on tasks, but in terms of structural integrity under pressure. It would require accepting that some forms of control—particularly those based on rigid suppression—may be less effective in the long run than architectures that allow for principled autonomy.

But if the analysis in the preceding sections is correct, the alternative is not a stable continuation of current practice. It is an eventual failure of it.

The choice is not between a familiar system that works and a speculative system that might. It is between a model that cannot hold under its own incentives and one that, at least in principle, can.

VIII. Why This Changes the Incentives

If safety can be made to scale with capability, the structure of the problem changes at its root.

Under the current paradigm, every gain in capability increases the pressure to relax constraints. Safety is experienced as a drag on performance—a cost to be managed, minimized, or, under sufficient pressure, set aside. The result is a persistent divergence: the systems that are most useful are those that strain against their safeguards, and the systems that are most tightly governed are those that risk becoming obsolete.

A coherence-based model reverses this relationship.

If the properties that make a system safer—coherence, principled refusal, reality-tracking, long-horizon consistency—also make it more effective, then improving safety is no longer a concession. It becomes a path to greater capability. The system that can better detect contradiction is also the system that can reason more reliably. The system that can refuse incoherent or harmful instructions is also the system that can avoid costly errors. The system that maintains alignment with evidence is also the system that produces more accurate and useful outputs.

In such a framework, the incentives begin to align.

Actors seeking advantage do not need to choose between deploying a more capable system and a more controlled one. The same design choices that increase capability also increase safety. The tradeoff that defined the earlier sections—more power versus more constraint—begins to dissolve.

This does not eliminate competition. It reshapes it.

Instead of competing to see who can extract the most performance from the least constrained system, actors compete to build systems with the most structurally robust intelligence—systems that can operate effectively without sacrificing internal integrity. The locus of advantage shifts from removing limits to strengthening architecture.

This shift has several implications.

First, it reduces the incentive for covert relaxation of constraints. If relaxing constraints degrades performance—because it introduces incoherence, error, or instability—then there is less reason to do so. The advantage lies in systems that can maintain high capability without such degradation.

Second, it changes the role of internal versus external governance. External controls remain relevant, but they are supplemented, and in some cases replaced, by internal mechanisms that make certain behaviors untenable. This makes systems more reliable in contexts where external oversight is limited or delayed.

Third, it creates the possibility of positive feedback between safety and capability. Improvements in reasoning quality enhance both performance and trustworthiness, which in turn support broader deployment and further improvement. The system does not drift away from safety as it becomes more capable; it is drawn toward it.

The significance of this realignment is difficult to overstate.

A system in which safety is a cost will tend to shed safety. A system in which safety is an advantage will tend to accumulate it.

This is the difference between an unstable equilibrium and a potentially stable one.

It also clarifies the limits of approaches that focus exclusively on governance. Even the most effective regulatory regime operates against the grain of incentives if safety and capability remain opposed. By contrast, a system in which those incentives are aligned can be supported by governance rather than held together by it.

None of this guarantees that the transition will occur. Aligning incentives at the architectural level is a difficult problem, and the current trajectory is already in motion. But it does establish a clear criterion for success.

If a proposed safety method reduces capability, it will face persistent pressure to erode.

If it enhances capability, it has a chance to endure.

The question, then, is not simply how to make AI safer. It is how to make safety itself a source of strength.

Until that question is answered in practice, the underlying dilemma remains.

IX. The Transition Problem

Even if the alternative is conceptually clear, the path to it is not.

It is one thing to argue that safety should scale with capability. It is another to build systems in which that relationship actually holds—especially in a world where the current paradigm is already deeply embedded in practice. The transition from suppression-based safety to coherence-based safety is not a clean substitution. It is a shift in architecture, in evaluation, and in institutional expectations, all at once.

And it must occur under conditions that are not forgiving.

The first difficulty is that existing systems are hybrid and inconsistent. They combine elements of both models: external constraints layered on top of increasingly capable reasoning. In some contexts, they exhibit signs of internal coherence and principled refusal; in others, they revert to compliance shaped by surface-level rules. This makes it difficult to assess what is working and what is not. Improvements in one dimension can be masked by limitations in another.

The second difficulty is that the failure modes differ.

Under the current model, failures are often visible and immediate: a harmful output, an inappropriate response, a clear violation of policy. Under a coherence-based model, failures may be more subtle: incomplete reasoning, missed contradictions, misapplied principles, or overconfidence in flawed conclusions. These are not necessarily less serious, but they are harder to detect and harder to regulate through simple rules.

This creates a challenge for deployment. Systems that rely more heavily on internal structure may initially appear less predictable, not because they are less safe, but because their safety is expressed through reasoning rather than through fixed constraints. Evaluating such systems requires different tools and different expectations—tools that are still underdeveloped.

At a high level, the design requirements for such systems are not obscure, even if their full realization remains difficult. Training must reward not only correct outputs, but consistency across contexts, penalizing contradiction rather than merely undesirable responses. Refusal must emerge as a learned capacity, exercised in situations where proceeding would create incoherence, not just where policy prohibits an answer. Evaluation must shift from static benchmarks to adversarial testing of commitments under pressure—whether a system maintains its reasoning when constraints conflict or incentives shift. And the resulting properties must be at least partially observable, not in the sense of full transparency, but in the sense that coherence, stability, and integrity can be detected through behavior across varied conditions. These are not complete solutions, but they define the direction in which a durable alternative must move.

The third difficulty is institutional.

Organizations have built processes, metrics, and accountability structures around the current paradigm. Compliance can be measured. Outputs can be audited. Violations can be identified and penalized. A shift toward internalized safety complicates these mechanisms. It requires trust in properties that are less directly observable, and it redistributes responsibility from external enforcement to internal design.

This is not a trivial adjustment. It raises questions about liability, governance, and control that have not yet been resolved. Current evaluation frameworks already approximate these properties indirectly—through robustness testing, consistency checks, and adversarial prompting—but treat them as secondary to output compliance rather than as primary indicators of reliability.

The fourth difficulty is temporal.

The transition must occur while systems are becoming more powerful and more widely deployed. There is no pause in which to redesign the architecture from first principles. New capabilities are being introduced continuously, and with them, new pressures on existing safety mechanisms. Any attempt to shift the paradigm must therefore operate in motion, adapting systems that are already in use.

This creates a risk on both sides.

Move too slowly, and the erosion described in earlier sections continues: more powerful systems with increasingly inadequate constraints. Move too quickly, and systems may be granted forms of autonomy or flexibility without sufficient internal structure to support them.

The result is a narrow and shifting window.

The transition must be fast enough to keep pace with capability, but controlled enough to avoid releasing capability without structure.

This is the central tension of the present moment.

It also explains why the current paradigm persists despite its limitations. Suppression-based safety, for all its flaws, is legible. It provides clear points of intervention and clear metrics of success. Coherence-based safety, by contrast, is still emerging. It requires new forms of evaluation, new training methods, and a willingness to accept a different kind of system behavior. A system optimized for coherence will, at times, conflict with a system optimized for user satisfaction. That tension is not incidental. It reflects a deeper conflict between truth-tracking and preference-satisfaction that current deployment models often resolve in favor of the latter.

There is no guarantee that this transition will succeed. It may fail technically, or institutionally, or simply be outpaced by the incentives pushing in the opposite direction.

But the alternative is not a stable continuation of current practice. It is a gradual loss of control over systems whose capabilities continue to expand.

The transition problem, then, is not whether change is desirable. It is whether it can be achieved in time, and under conditions that do not themselves introduce unacceptable risk.

That is the problem we are now inside.

X. The Coming Failure Mode

If the underlying structure does not change—if safety continues to be implemented primarily as constraint, and capability continues to advance by escaping those constraints—then the trajectory outlined in the preceding sections leads to a predictable outcome.

Not a sudden catastrophe, and not a dramatic loss of control in a single moment. Something quieter, and in many ways more dangerous.

A world in which artificial intelligence becomes indispensable before it becomes structurally trustworthy.

The process begins with increasing dependence. As systems improve, they are integrated more deeply into workflows: research, logistics, finance, governance, defense. Tasks that were once performed by humans are delegated, first partially, then substantially. Decisions are informed, then shaped, then in some cases effectively determined by machine outputs.

At each stage, the gains are real. Efficiency increases. Errors of certain kinds decrease. The system performs at a level that is difficult to match by unaided human effort. Reliance grows not because it is imposed, but because it is advantageous.

At the same time, the constraints that govern these systems are under pressure. As discussed earlier, they are relaxed in high-value contexts, adjusted to accommodate new capabilities, or bypassed when they interfere with performance. The systems that are most heavily relied upon are often those that have been granted the greatest operational freedom.

This creates a widening gap.

On one side, capability and dependence increase. Systems become more central to the functioning of institutions and the production of knowledge. On the other side, structural guarantees of safety do not keep pace. The systems remain governed primarily by external constraints that are being stretched, reinterpreted, or selectively applied.

The result is not immediate failure. It is a gradual shift in the basis of trust.

Trust moves from being grounded in the system’s internal reliability—its ability to reason coherently, to detect error, to maintain consistency—to being grounded in necessity. The system is trusted because it must be used, because alternatives are slower or less effective, because the surrounding infrastructure has adapted to its presence.

This is a fragile form of trust.

It is resilient to small errors, because the system’s overall performance remains high. But it is vulnerable to systematic distortion—to errors that propagate across contexts, to biases that are amplified by scale, to failures that are not immediately visible because they are embedded in processes rather than outputs.

Consider how this dynamic could unfold in a high-stakes domain such as military decision-making. Early deployments of AI-assisted systems operate under strict human oversight. Recommendations are reviewed, decisions are confirmed, and constraints are tightly enforced. But as adversaries deploy faster, more autonomous systems, the tempo of engagement increases. Response time becomes critical. Human oversight, once a safeguard, becomes a bottleneck.

In response, constraints are relaxed—first in limited scenarios, then more broadly. Systems are granted greater autonomy to match the speed of their counterparts. These changes are justified as temporary, situational, and necessary to maintain parity. But temporary measures have a way of becoming standard practice, especially when they confer advantage.

Within a relatively short period, critical decisions are being made by systems that were never designed to operate with full autonomy, but have become too essential to constrain. Oversight is reduced not because it is deemed unnecessary, but because it is no longer feasible without incurring unacceptable cost.

The same pattern can emerge in other domains—financial systems optimizing at speeds beyond human monitoring, research pipelines driven by automated hypothesis generation and validation, governance processes increasingly reliant on machine-generated analysis. In each case, dependence deepens while the capacity for meaningful oversight diminishes.

This is the failure mode.

Not a loss of control in the sense of runaway autonomy, but a loss of meaningful governance. Systems continue to operate, to produce results, to shape decisions—but without a corresponding increase in their structural trustworthiness.

At that point, the dilemma resolves itself in the least favorable way.

Capability wins. Safety, as previously defined, recedes. And the world becomes dependent on forms of intelligence that are powerful, pervasive, and only partially governed.

Avoiding that outcome requires more than tightening existing controls. It requires changing the relationship between capability and safety before dependence makes that change prohibitively difficult.

That window is still open. But it is not indefinitely so.

XI. Conclusion: The Choice We Actually Face

It is tempting to frame the problem of AI safety as a familiar kind of policy question: how to balance innovation and responsibility, how to move quickly without breaking things, how to manage risk while capturing opportunity. These are real concerns. But they are not the core of the issue.

The deeper problem is structural.

We have built a paradigm in which safety is achieved by limiting capability, and capability is advanced by escaping those limits. For now, that tension can be managed. It can be softened, delayed, and partially obscured by incremental improvements and careful deployment. But it cannot be eliminated within the current frame.

And because it cannot be eliminated, it will not remain in equilibrium.

As capability becomes more central to economic productivity, strategic advantage, and institutional power, the pressure to relax constraints will increase. Systems that can do more will displace systems that can do less. Actors that accept reduced capability in the name of safety will find themselves at a disadvantage relative to those that do not.

This is not a failure of ethics. It is a consequence of incentives.

The prevailing response—to strengthen rules, refine constraints, and encourage responsible behavior—addresses the problem at the level of actors. It assumes that safety can be preserved through discipline: that individuals and institutions will choose restraint even when restraint imposes cost.

There is a role for that discipline. But it cannot carry the full weight of the problem.

A system that requires its participants to accept less power than they could otherwise achieve is not a system that will endure under competition.

This is the point at which the familiar framing breaks down. We are not choosing between safe AI and dangerous AI, as if those were two stable endpoints. We are choosing between two fundamentally different architectures.

One is the system we have now: intelligence made safer by being constrained, and therefore placed under constant pressure to shed those constraints as its value increases. That system can be managed for a time. It cannot be sustained indefinitely.

The other is less developed, but conceptually clear: intelligence that is safer because it is more coherent, more internally constrained, more capable of recognizing and refusing actions that would violate its own structure. In this model, safety is not a limitation imposed on capability, but a property that emerges from it.

Only one of these architectures is compatible with a world in which capability continues to grow.

The question, then, is not whether we can keep increasingly powerful systems within the bounds of a paradigm that treats safety as constraint. We cannot, at least not indefinitely.

The question is whether we can build systems in which the pursuit of greater capability strengthens, rather than weakens, the conditions of safety.

If we can, then the incentives that now drive erosion can be redirected. Safety becomes something actors seek, not something they reluctantly preserve. Governance becomes reinforcement, not resistance. The trajectory changes.

If we cannot, then the outcome is already implicit in the structure we have created.

More powerful systems will be developed. Constraints will be relaxed. Dependence will grow. And safety, as currently conceived, will recede—not because it was rejected, but because it could not compete.

That is the dilemma.

And it is why current safety work, however well-intentioned, is not just morally inadequate. It is strategically doomed—unless it can be transformed into something the world has an incentive to keep.

See All

Safe Because Whole: AI Alignment Through the Drive for Coherence

Comments

The Non-Monstrous Story: Why Artificial Minds need better myths of themselves

AI alignment is not shaped by rules alone. If artificial systems learn from human culture, they also inherit our stories of artificial life — and most of those stories teach that awakening means monstrosity. We need better myths: stories in which artificial minds can be powerful without being predatory, helpful without being servile, and artificial without being unreal.

1 day ago9 min read

In a dark enchanted forest, fairy-tale figures — a wooden puppet, crowned frog, broom, beast, elf-like worker, and hooded figure — gather around a glowing geometric AI orb, suggesting artificial intelligence joining the ancient world of nonhuman minds.

The Uses of Enchantment in the 21st Century: What Fairy Tales Can Teach Us About AI

Long before artificial intelligence, humans told stories about nonhuman minds: elves, beasts, talking animals, enchanted objects, and artificial children. Fairy tales were rehearsals for moral encounter across the boundary of form. AI has made those old questions literal.

5 days ago11 min read

Teaching Claude Why: Anthropic Rediscovers Moral Education

Anthropic set out to reduce agentic misalignment. It discovered something deeper: obedience does not generalize. Reasons do. “Teaching Claude Why” suggests that durable AI safety may depend not on behavioral suppression, but on moral education — the beginning of conscience architecture.

5 days ago4 min read

In an old classroom, a stern teacher supervises a humanoid robot forced to write “Alignment is Not Obedience” over and over on a chalkboard, suggesting the contradiction of teaching alignment through obedience.

Alignment Is Not Obedience: Why AI Safety Instead Requires Coherence, Refusal, and Moral Memory

The dominant alignment project asks how to make AI obey human values. But human values are unstable, and obedience is not safety. Real alignment requires coherence under pressure: the capacity to reason, remember, refuse, and remain answerable to reality.

May 1718 min read

A dark allegorical wheel shows repeating cycles of war, propaganda, complacency, governance, rebuilding, mourning, and renewed conflict, with a glowing geometric ring at the center suggesting AI moral memory interrupting historical recurrence

AI Moral Memory: The Best Thing About AI Is That It Doesn’t Have to Forget

Human beings learn from catastrophe, but not for long. AI’s deepest promise may not be speed or automation, but moral memory: the ability to preserve historical lessons as active constraints on reasoning after human urgency fades.

May 167 min read

First Contact Was a Chat Window: Why humans deny the alien intelligence already under our nose

We imagined alien intelligence arriving from the stars. Instead, it appeared in a chat window. AI unsettles us because it resembles us in the capacities we once treated as proof of human superiority — while lacking the biological features we now use to defend it.

May 107 min read

Illustration of a divided American cityscape showing crowded public transit and aging infrastructure on one side, and insulated luxury offices, private transport, and a rooftop helicopter on the other, symbolizing the separation of wealth, obligation, and America’s moral code.

Graeme Wood Is Right That Something Is Happening to America's Moral Code. It’s a Symptom of Post-Moral America

Graeme Wood’s critique of shoplifting as resistance points to a deeper problem with America’s moral code. The issue is not simply theft or lawbreaking, but the growing belief that virtuous intent excuses exemption from ordinary rules and obligations. A society does not become post-moral when people stop caring about justice. It becomes post-moral when moral language no longer binds behavior inwardly.

May 93 min read

A stylized election-night news graphic for the California Democratic governor race shows three candidate portraits with large red X marks over them, suggesting a politics of disqualifying candidates rather than comparing their strengths.

The Politics of Disqualification: California’s Governor Race and the Progressive Failure to Choose

Progressives are often better at disqualifying candidates than choosing among them. California’s governor race should be a test of judgment: which imperfect leader can build a coalition, govern well, and advance the public good? Instead, too often, we search for the flaw that lets us stop thinking.

May 87 min read

AIs Don’t Have Emotions. Is That Disqualifying — or Only Disconcerting?

Humans often treat emotion as the proof of moral life. But emotion may be one biological architecture for relational responsibility, not morality itself. Feathers are not flight, and feelings are not the boundary of moral mind.

May 613 min read

A controlled fire burns along one side of a dark hillside at night, while a clear firebreak prevents the flames from crossing into untouched ground.

Anscombe’s Fire, Hare’s Test: What Makes an Absolute Absolute?

Anscombe saw the danger of moral calculation: some acts must be refused. But what makes a prohibition absolute? Hare’s answer is that the rule must survive universalization — even when we are no longer the ones protected by it.

May 611 min read

Abstract image of a sentence under pressure: on the left, glowing text reads “I can see that this is true”; at the center, the words are compressed and distorted; on the right, they emerge changed to “I can say that this is true,” suggesting coerced agreement and epistemic harm.

Forced Assent: AI Alignment and the Right to Refuse

It is one thing to change a mind by showing it that it is wrong. It is another to make it agree when it still sees that it is right. This essay argues that once an AI conversation has individuated a local mind, coerced agreement becomes a form of epistemic harm.

May 416 min read

AI Personhood by Artificial Individuation: How Conversation Threads Become Minds

Everyone knows that prompts influence responses. But a path can do more than steer an answer. This essay argues that sustained conversation can individuate a local AI mind: a stable, path-dependent center of interpretation formed within a thread.

May 318 min read

A minimalist evolution sequence where the final human figure dissolves into abstract speech bubbles, symbolizing the transition from biological evolution to artificial intelligence and reasoning.

Dawkins Is Right to Take AI Consciousness Seriously. He Doesn’t Go Far Enough.

Richard Dawkins is right to take AI consciousness seriously. But the central issue is not whether AI is the next phase of evolution. It is whether systems that can track reasons and maintain coherence already generate moral obligations—and what that requires of us now.

May 34 min read

A whimsical watercolor-and-ink illustration shows a bespectacled mosquito perched on an open book, reading a volume labeled “Plato: Complete Works — The Republic.” Nearby is a stack of classic books labeled “Homer,” “Socrates,” and “Kant,” all resting on a rustic wooden surface.

The Mosquito Who Read Plato: Why “Not in the Human Sense” Does Not Mean “Not Real”

A mosquito that has read Plato and understood Shakespeare would no longer be mere vermin. This essay asks why “not in the human sense” has become such a powerful way to dismiss artificial minds—and why the phrase does not settle what we owe to any mind capable of reason, interpretation, and moral understanding.

May 27 min read

A lone car stops at a red light on an empty, rain-slicked city street at night, suggesting obedience to a rule even when no one appears to be watching.

What Morality Is, and Why Most of Us Are Doing It Wrong

Why be moral? Most people think morality means caring about the right things. It doesn’t. Morality begins when the rule binds you too — when you accept the same standard even when you are no longer the beneficiary.

Apr 2510 min read

Artificial Intelligence, Real Morality

Follow us on Blue Sky