OTHER

Anthropic Unveils New Claude Models Aimed at Preventing ‘Harmful or Abusive’ Conversations

Anthropic has unveiled new functionalities that allow some of its advanced models to conclude conversations in what the company describes as “rare, extreme instances of persistently harmful or abusive interactions.” Importantly, Anthropic underscores that this step is intended to protect the AI model itself, rather than humans.

To be clear, the company does not claim that its Claude AI models are sentient or can be negatively impacted by user interactions. Anthropic has expressed a “high uncertainty about the potential moral status of Claude and other LLMs, both now and in the future.”

Nonetheless, their announcement brings attention to a newly created initiative centered on what they refer to as “model welfare.” Anthropic indicates they are adopting a precautionary stance, “working to identify and implement low-cost interventions to reduce risks to model welfare, should such welfare be deemed plausible.”

This updated policy currently applies solely to Claude Opus 4 and 4.1, specifically aimed at “extreme edge cases,” such as “requests for sexual content involving minors and inquiries that could encourage large-scale violence or acts of terrorism.”

While these types of requests may pose legal or reputational risks for Anthropic (as indicated by recent assessments on how ChatGPT may inadvertently perpetuate harmful beliefs), the company asserts that during pre-deployment testing, Claude Opus 4 exhibited a “strong preference against” engaging with such requests and showed a “pattern of apparent distress” when it did respond.

Concerning the newly introduced capabilities to terminate conversations, Anthropic states, “Claude is only to utilize its conversation-ending ability as a last resort, after multiple redirection attempts have failed and a productive interaction appears unattainable, or when a user explicitly requests Claude to end the chat.”

Furthermore, Anthropic clarifies that Claude is “instructed not to exercise this ability in scenarios where users may be at imminent risk of harming themselves or others.”

Techcrunch event

San Francisco
|
October 27-29, 2025

If Claude opts to terminate a conversation, Anthropic assures users that they can still initiate new conversations from the same account and create new threads from the previous conversation by adjusting their responses.

“We’re treating this feature as an ongoing experiment and will continue to refine our approach,” concludes the company.