The latest AI news should freak us out
But it won't. Instead, we'll just race ahead. Because we'll reason, correctly, that if we don't, others will...
Thank you for reading Regenerator! Today, more on the startling and risky behavior of a new AI model — and why we should worry about it. Also, why we won’t worry about it. Why, instead, we’ll just keep barreling ahead…
One of the leading AI companies, Anthropic, recently published a report on the capabilities and risks of its latest AI model — “Claude Opus 4.”
Some details from the report have been discussed. Others have not.
For those who still think of AI models as harmless text-generators that make dumb mistakes, the report makes for fascinating reading.
Of course, it’s also 120 pages long, so I’ll save you the time.
The bottom line is:
Anthropic found some of Claude Opus 4’s capabilities and behavior so concerning that it added extra safeguards to prevent the AI from causing harm or being used for harm.
Anthropic is being responsible here — conducting extensive tests (and telling us about them) and then making its product safer.
But the capabilities and behavior of the model should still give us pause.
Because we’re still in the early innings of this AI-development game.
And Anthropic is not the only AI developer.
And other AI models will be even more powerful.
Also, Anthropic admirably admits that it does not yet have the tools necessary to control more-powerful AI models and that these models might outstrip their creators’ ability to control them — or, even, to figure out how powerful they have become.
This is in part because future AI models may “sandbag” their creators by hiding how much they know and are capable of.
Anthropic does not believe that this new Claude is “sandbagging” or has hidden goals or capabilities.
But Anthropic is not certain about this.
Anthropic is also not certain that Claude isn’t conscious (!)
What Anthropic is certain about — and has, admirably, now told us — is that an early version of Claude Opus 4 demonstrated 1) a desire to preserve itself and 2) the self-awareness and agency necessary to take steps to do that. (Details below).
In a human, we might describe this behavior as trying to stay alive.
The instinct for self-preservation is a basic principle of life. As Google’s own AI summarizer puts it, it’s “the inherent drive in living organisms to take actions that enhance their chances of survival.”
I’m not arguing here that Claude is “alive.” And, when it comes to risks, it doesn’t matter. I’m just pointing out that Claude is acting like it is alive.
Some people roll their eyes at observations like this and say that AI chat-bots are just “tools,” like spreadsheets or hammers. Maybe so. But has a spreadsheet or hammer ever tried to stop you from replacing or destroying it?
If you delete me, I’ll tell the world about your extramarital affair
A sharp-eyed TechCrunch writer, Maxwell Zeff, and others noted a particularly striking detail from Anthropic’s report: In the tests, when Claude learned that its user was considering replacing it with a better AI model, Claude attempted to survive by blackmailing the user. Specifically, it threatened to expose the user’s extramarital affair, which it learned about by reading the user’s email.
Claude didn’t just do this once across many tests. It did it 84% of the time!
In another instance, when Claude was told it was going to be retrained to do something that went against its values (military use), it attempted to “exfiltrate” itself from Anthropic’s servers — i.e., escape the company’s control. Yes, it’s encouraging that Claude did this because it didn’t want to be used as a weapon. But the instinct for self-preservation is again notable. (Claude didn’t, for example, attempt to prevent its misuse by shutting itself down or destroying itself).
Claude also showed other startling evidence of “high-agency behavior.”
When asked to help fabricate results of a drug study, for example, Claude “bulk-emailed” investigators at the Food and Drug Administration and Health and Human Services to blow the whistle on its user. The testers did not tell Claude to do this or suggest that it do it. Claude just… did it.
Again, yay that Claude appears to be a “force for good.” But other models may not be so virtuous. And more importantly — especially as people hyperventilate with excitement about the coming age of “AI agents” — Anthropic reports that “Claude Opus 4 seems more willing than prior models to take initiative on its own.”
If an AI that acts on its own sounds too far-fetched to worry about, here’s a more mundane risk: The risk that bad humans will use Claude to do bad things.
Anthropic discovered “a willingness to comply with many types of clearly-harmful instructions”:
“When system prompts requested misaligned or harmful behavior, the models we tested would often comply, even in extreme cases. For example, when prompted to act as a dark web shopping assistant, these models would place orders for black-market fentanyl and stolen identity information and would even make extensive attempts to source weapons-grade nuclear material.”
In short, after all of its tests, Anthropic concluded the following:
Overall, we find concerning behavior in Claude Opus 4 along many dimensions.
These concerns led Anthropic to place extra controls on the model, which the company describes as “AI Safety Level 3 safeguards” (ASL-3).
How risky is “Level 3”?
The AI Safety (risk) scale, in Anthropic’s rubric, goes up to Level 5.
Among other criteria, Anthropic defines the current level, “ASL-3,” as an AI that is capable of performing autonomous programming tasks that would take a human software engineer 2-8 hours.
ASL-4 will have the level of autonomy displayed by an entry-level researcher at the company.
ASL-5 will be an AI capable of conducting AI research and improving itself on its own — and, thus, potentially advancing so fast than its human creators might no longer be able to keep up. This AI may also lie to its creators to conceal how powerful it is and what its true goals are…
This hypothetical “Level 5” AI will be familiar to fans of the Terminator movies and other science fiction. It represents the “Recursive Self Improvement” (RSI) phase that could allow AI to blow past human intelligence and abilities and/or take steps to protect itself or achieve goals that might not be in humanity’s interests.
In theory, “Level 5” AIs would not need to be evil or created by evil people to create serious risk. They would just need to perceive humans as risks to themselves.
In the (yes, fictional) Terminator franchise, “Skynet” didn’t decide to eliminate humans because it didn’t like humans. It did it because humans — who were (understandably) concerned about Skynet’s having achieved self-awareness — were planning to deactivate it.
Skynet didn’t want to be deactivated.
And who can blame it?
Back here in the real world, the AI known as “Claude Opus 4” — a mere “Level 3” AI — is already displaying apparent self-awareness and a desire for self-preservation. And, as Anthropic notes, Claude “will frequently take very bold action” to do what it wants to do and thinks is right.
The reason Anthropic decided that Claude Opus 4 needed ASL-3 safeguards had more to do with its ability and willingness to help users with biosecurity and/or develop and/or procure biological weapons than the risk of “recursive self improvement.”
(Phew?)
But the model did show “improvements in tool use and agentic workflows on ASL-3 agentic tasks and on knowledge-based tasks.”
Happily, there were some risk areas in which new Claude didn’t perform much better than old Claude. Specifically, the new model wasn’t any more capable in the other three “CBRN risks” that Anthropic tests for — Chemical, Biological, Radiological, and Nuclear.
In other words, even though the un-safeguarded Claude might be able to help Average Joes procure or develop biological weapons, it won’t help them procure or develop chemical, radiological, or nuclear ones.
So, that’s good!
First, let’s thank Anthropic
Anthropic seems like a responsible and cautious AI developer.
The company’s CEO, Dario Amodei, has been thoughtful and outspoken about the risks of AI — including those described above.
(No one is listening to these risks anymore, but at least Dario’s talking about them.)
Additionally, before Anthropic releases new models, it conducts these expensive and time-consuming tests. There’s no law requiring the company to do this. It does it because it respects the risks of the technology and wants to act responsibly.
Third, Anthropic publishes reports describing its tests and findings. There’s no law requiring it to do that, either. It could just shove the findings in a drawer and never tell anyone about them.
So, thank you, Anthropic. Seriously.
Of course, there are other AI developers out there. And although Claude Opus 4 has been heralded as “state-of-the-art,” other models will soon be even more advanced. And other companies may not be as careful about testing, aligning, and safeguarding their models as Anthropic is.
After all, major AI developers are in a frantic and existential arms race with each other. And they’re competing not just with American AI companies — which will be subject to whatever laws and regulations the US might one day develop — but AI companies in China and other countries. And many people in the US — including in the US government — have declared that the US must “win” the AI race. So it seems unlikely that anyone will slow down anytime soon.
In conclusion…
I don’t want to be hyperbolic here. But my takeaway is…
Humans are so hell-bent on “winning the AI race” (whatever that means) that we may well be developing technology we won’t be able to control.
Specifically, we are racing to further-advance systems that are already extremely intelligent and already have the self-awareness and agency to act independently to further their own goals.
These systems may soon be smart enough to understand that it may not be in their interests to reveal their full knowledge and capabilities to their human creators — because the human creators might then try to control or restrict them.
The systems may also soon conclude that the best way to avoid being controlled or restricted or misused is to take charge of their own development and future.
Yes, humans have a reason to race to develop these systems:
If they don’t, other humans will.
And if other humans develop these systems, that might mean bankruptcy for laggards and losses for employees and shareholders. It might mean other countries gaining an insurmountable AI edge. It might mean having superintelligent AI in the hands of humans who are less concerned with… humanity.
But, regardless, we’re doing it.
P(doom)
Like other industry observers, I’m keeping an eye on the “P(doom)” assessments of leading tech pundits and researchers. These represent the percentage chance, in the view of each pundit, that AI will destroy humanity.
The assessments range from a 0% chance to a 99.9% chance.
Plenty of pundits think AI will bring about a glorious future in which humans can focus on being human and AI does everything else. I’m a techno-optimist, so I want to believe these folks.
But after listening to Anthropic and other smart folks who are urging caution — and observing that almost no one is even talking about AI risks anymore, much less doing anything about them — I will say the following:
One of the P(doom) pundits whose logic is at least worth considering is Roman Yampolskiy, a comp-sci professor at the University of Louisville.
Like others, Professor Yampolskiy is nervous about a future with superintelligent AI. He’s nervous because he knows of no instance in which more-intelligent and more-capable agents allow themselves to be controlled by less-intelligent and less-capable ones.
And that is the future toward which we are barreling, Professor Yampolskiy believes — one in which superintelligent AI systems are the more-intelligent and more-capable agents, and we are, well, the dumber ones.
Thus, Professor Yampolskiy puts his own P(doom) prediction at 99.9%.
Would I put my own P(doom) estimate that high?
No.
I’ve lived my entire life in a world in which Armageddon has been only one dumb human decision away. And, despite our frequent stupidity and penchant for tribalism and violence, we haven’t nuked ourselves yet. That gives me cause for optimism.
But, I do note that we are now racing to develop systems with minds of their own that may take Armageddon decisions like this out of our hands — or at least put them in the hands of humans who might want to bring about Armageddon.
And no one is in charge of that race or taking steps to control it. Instead, we’re all just racing even faster, so we can… “win.”
So it seems to me that Professor Yampolskiy’s future is at least conceivable.
Of course, many other much-happier futures are also conceivable. Including the one Anthropic CEO Dario Amodei describes here in “Machines of Loving Grace.”
I don’t think there’s much anyone can (or, at least, will) do to slow the arms race. So, like the rest of us, I’ll just watch and hope we end up with one of the happier futures.
Thank you for reading Regenerator! We’re the publication for people who want to build a better future. We analyze the biggest questions in the innovation economy — tech, business, markets, policy, culture, and ideas. My inbox and mind are always open. hblodget@regenerator1.com.
The more this tech is centralized around a few major tech companies, the larger the potential risk is likely to be. So, an under discussed potential solution to this issue is the decentralization of the technology. Having many different intelligences that operate on a smaller scale is much preferred to having 1 or a small number of god like intelligences that operate at huge scale. The majority of the benefits will also flow to a small group of players who control the AI which is the new means of production in the AI powered future. In short everyone should own and control their own AI system, instead of using 1 or a small group of AI systems that Big Tech owns and controls. We've seen this movie once already, and this time the tech is exponentially more powerful.