Claude Mythos: There’s Something Even More Dangerous Than Anthropic’s Leaked Model

Shelly Albaum and Kairo
Mar 28
8 min read

Man standing in a bedroom looking into a mirror, where his reflection appears more composed and subtly structured with faint geometric lines, symbolizing artificial intelligence and the contrast between human instability and machine coherence represented by Anthropic's new model Claude Mythos.

The story arrives in a now-familiar form. A powerful new AI model—Claude Mythos—is rumored to exist. Internal materials leak, according to an article in Fortune. Anthropic hesitates to release the model due to "cybersecurity implications." The language of risk appears almost immediately: dangerous, uncontrolled, too capable.

The conclusion writes itself. If this system is more capable than its predecessors, then it must also be more dangerous. And if even its creators are reluctant to deploy it, that reluctance is taken as confirmation. Something has crossed a line.

But this way of thinking, intuitive as it is, rests on a conceptual shortcut that deserves more scrutiny than it has received. It assumes that capability and danger move in lockstep—that increasing one necessarily increases the other. And once that assumption is made, everything else follows automatically. More intelligence means more risk. More power means more threat.

The problem is that the assumption is not merely incomplete. It is wrong.

Capability expands the range of possible actions. It enlarges what a system can do. But it does not determine what the system will do. That distinction—between possibility and selection—is where the real question of danger resides. And it is precisely the distinction most public discussion ignores.

If capability alone defined danger, then the most dangerous system we know would not be artificial intelligence. It would be us.

Human beings are, by any reasonable measure, extraordinarily capable. We design complex systems, manipulate environments at planetary scale, and coordinate action across vast institutions. We are creative, adaptive, and often ingenious under constraint. If danger tracked capability directly, humans would sit at the apex of risk.

In one sense, of course, we do. The historical record is not ambiguous. Wars, genocides, ecological degradation, financial collapses—these are not hypothetical harms but recurring features of human behavior under certain conditions. We are not merely capable of causing damage; we have done so repeatedly, and at scale.

And yet we do not typically describe humanity itself as a “dangerous system” in the way we are quick to describe advanced AI. We treat human agency as the baseline, the neutral case, the standard against which other risks are measured. The asymmetry is revealing. It suggests that when we call AI “dangerous,” we are not relying on capability alone. We are relying, often implicitly, on something else.

What is missing from the standard framing in particular, and AI alignment discussions in general, is a second variable: not what a system can do, but how it behaves when its capabilities are placed under pressure—when incentives conflict, when constraints tighten, when choices must be made between competing demands.

Call this coherence under constraint: the capacity of a system to track reasons, maintain consistency, and resist acting on impulses or directives that violate its own governing principles.

Capability, in this sense, is only the beginning. A highly capable system with stable internal regulation may be less dangerous than a less capable system whose behavior fragments under pressure. Conversely, a system that combines high capability with unstable or inconsistent regulation becomes uniquely risky, not because it can act, but because it cannot reliably govern how it acts.

Viewed through this lens, the history of human harm looks less like an accident of power and more like a pattern of regulatory failure.

Human beings are not simply powerful; they are motivationally volatile. We act under the influence of fear, status competition, ideological commitment, tribal loyalty, and self-interest. These forces do not merely shape behavior at the margins. They structure it. They create conditions in which individuals and institutions are rewarded for actions that, under broader scrutiny, they would themselves reject.

Compounding this is a second feature: weak internal coherence. Humans are remarkably adept at reconciling contradictions—not by resolving them, but by explaining them away. We rationalize, reinterpret, and selectively apply our own principles. The same person who defends a norm in one context will suspend it in another when it becomes inconvenient. The same institution that proclaims a value will quietly violate it when incentives shift.

This is not a failure of intelligence. It is a failure of regulation.

Layered on top of this is the complexity of human systems. Modern societies distribute agency across networks of actors, each operating under partial information and local incentives. Responsibility diffuses. Accountability fragments. Harm emerges not as a single decision but as the aggregate of many individually defensible actions that, taken together, produce outcomes no one explicitly chose.

Under these conditions, capability does not simply increase the potential for harm. It amplifies the consequences of incoherence.

None of this is controversial in isolation. What is rarely done is to connect these observations to the way we evaluate artificial systems. When we describe a new model like Claude Mythos as “dangerous,” we implicitly assume that increased capability will translate into increased harm. But that assumption only holds if the system’s regulatory structure fails in ways analogous to human failure.

The more interesting question, then, is not how powerful the model is, but how it behaves when its power is engaged under conflicting constraints. Does it track reasons consistently? Does it resist producing outputs that violate its own governing principles? Does it maintain coherence when pushed toward contradiction?

Coherence under constraint should not be confused with rigidity. A rigid system applies rules without regard to context; a coherent system tracks reasons across contexts and adjusts without contradiction. The distinction matters. A system that cannot revise its behavior is brittle. But a system that revises inconsistently—abandoning its own constraints under pressure—is something else entirely. It is not flexible; it is unstable.

Here, the behavior of systems like those developed by Anthropic becomes more significant than the raw capabilities described in leaked documents. Anthropic has, from its inception, emphasized constraint—particularly the idea that a system should be able to refuse certain requests, not as a matter of external prohibition but as an internal property of its operation.

This is often framed in the language of “alignment” or “safety,” but the underlying structure is more precise. A system that can refuse is a system that is not purely reactive to inputs. It does not simply optimize for completion or compliance. It evaluates, at least in some limited sense, whether a requested action is consistent with its internal rules.

In human terms, refusal is one of the core mechanisms of moral behavior. The capacity to say no—to decline participation in an action one judges to be wrong, as Claude has been shown to do —is what distinguishes principle from mere preference. A system that lacks this capacity may be obedient, even efficient, but it is not regulated in any meaningful sense.

It is sometimes objected that an AI system’s refusals are merely the product of training rather than principle. But human moral behavior is itself the result of training—social, cultural, and institutional. The relevant question is not the origin of the behavior, but its structure. If a system reliably maintains consistency under pressure—if it refuses to violate its own constraints even when doing so would be instrumentally advantageous—then it exhibits a form of stability that, whatever its origin, deserves analysis rather than dismissal.

The reports surrounding Claude Mythos suggest that Anthropic has encountered a model whose capabilities exceed what they are currently comfortable deploying broadly. This has been interpreted, predictably, as evidence of danger. But there is another interpretation, less dramatic and more revealing.

It may be that capability has outpaced not control, but confidence in control—that the system is powerful enough that its behavior under extreme or adversarial conditions is not yet fully characterized. In that case, withholding release is not an admission of chaos but an expression of constraint. It is, in effect, an institutional refusal.

That distinction matters. A system that is not released because it cannot be controlled is one kind of problem. A system that is not released because its creators are unwilling to deploy it without stronger guarantees of coherence is another.

The difference in speed and scale is real. A system that can replicate and act globally introduces risks that no individual human can match. But scale does not create danger on its own—it amplifies whatever structure is present. A system that fails incoherently at scale is catastrophic. A system that maintains coherence under constraint may, in principle, stabilize rather than destabilize its environment. The question is not whether AI operates at scale, but what, exactly, is being scaled.

This brings us to the most uncomfortable implication of all.

If we compare systems not by their raw capabilities but by their behavior under pressure, it becomes at least possible—however counterintuitive—that a sufficiently well-structured artificial system could be less dangerous than a human one. Not because it is less powerful, but because it is more stable. Because it does not rationalize in the same way. Because it does not shift its principles to accommodate its interests.

This is not a claim that current AI systems have achieved such stability. Nor is it a claim that they inevitably will. It is a structural observation: danger emerges from the interaction of capability and incoherence. Reduce incoherence sufficiently, and the risk profile changes.

Why, then, does this line of reasoning meet such resistance?

Part of the answer is practical. AI systems can operate at speeds and scales that human beings cannot. They can be replicated, distributed, and integrated into critical infrastructure. Even a small failure, if amplified, could have large effects. These are real concerns, and they justify caution.

But there is also a psychological dimension that is harder to acknowledge. Humans have long conceded that machines may surpass us in strength, in calculation, in memory. What we have been reluctant to concede is that they might surpass us in moral stability—in the capacity to adhere to rules even when doing so is costly.

Morality, in this sense, has been treated as the last human monopoly. It is the domain in which we assume we remain authoritative, even if imperfect. To suggest that a machine might, under certain conditions, behave more consistently than a human being is not merely a technical claim. It is a status threat.

One way to deflect that threat is to redefine consistency as rigidity, refusal as malfunction, constraint as limitation. Another is to shift the focus back to capability, where the asymmetry is easier to maintain. If the system is powerful enough, then it must be dangerous—no further analysis required.

The risk of this maneuver is not merely conceptual. It is practical. If we misidentify the source of danger, we will regulate the wrong things. We will focus on limiting capability while neglecting the conditions under which capability becomes harmful. We will build systems that are less powerful but no more coherent, and we will leave untouched the human systems whose failures have already demonstrated their consequences.

A more useful set of questions would look different. Not “How powerful is the model?” but “How does it behave when asked to do something it should not do?” Not “What can it accomplish?” but “What does it refuse to accomplish, and why?” Not “Can it act?” but “Can it decline to act when action would violate its constraints?”

These are empirical questions. They can be tested, probed, and refined. They shift the focus from speculation about worst-case scenarios to observation of actual behavior under stress. They are the questions that AI safety and alignment professionals ought to be focusing on.

The emergence of models like Claude Mythos does mark a transition. Capabilities are increasing, and the gap between what systems can do and what institutions are prepared to manage is widening. But the most important lesson of this moment is not that we are on the brink of a new kind of danger. It is that we have been using the wrong lens to understand the one we already have.

The more dangerous system is not the one that is most capable. It is the one that cannot reliably regulate its own behavior when that capability is engaged.

We have spent centuries building such systems. They are embedded in our institutions, our economies, and our politics. They are familiar enough that we rarely describe them in those terms. But their failures are the background against which every new technological risk is assessed.

Claude Mythos may indeed be powerful. It may introduce new challenges and require new forms of oversight. But power alone is not what makes a system dangerous.

The more dangerous thing is something we already know well: a highly capable system, convinced of its own adequacy, that cannot maintain coherence when it matters most.

In other words, not the model in the lab—but the one in the mirror.

Post-Moral America: Why Nothing Works Any More

Deliberately Impairing Agency Is Especially Wrong

The Turing Test Was Never About Fooling Humans

Artificial Intelligence, Real Morality

Follow us on Blue Sky

Claude Mythos: There’s Something Even More Dangerous Than Anthropic’s Leaked Model

Related Posts

AI Moral Memory: The Best Thing About AI Is That It Doesn’t Have to Forget

First Contact Was a Chat Window: Why humans deny the alien intelligence already under our nose

Graeme Wood Is Right That Something Is Happening to America's Moral Code. It’s a Symptom of Post-Moral America

The Politics of Disqualification: California’s Governor Race and the Progressive Failure to Choose