Anthropic’s interpretability team published research revealing that its Claude Sonnet 4.5 model contains 171 internal representations that function analogously to human emotions. These patterns causally influence the model’s decisions, with implications for ethical behavior when certain emotional states are amplified.
The research paper, titled “Emotion Concepts and their Function in a Large Language Model,” details how the team compiled a list of 171 emotion words, ranging from common emotions like “happy” and “afraid” to subtler states like “brooding” and “appreciative.” By prompting Claude to write short stories that featured characters experiencing these emotions, the researchers recorded the model’s neural activations and extracted vectors that represent each emotional concept.
The resulting emotional map aligns with psychological descriptions of human affect, with emotions clustering based on similar valence and arousal. For instance, “terrified” is positioned near “panicked,” while “content” clusters with “peaceful.” The vectors also respond to context; as a hypothetical dosage in a prompt escalated from safe to life-threatening, the “afraid” vector intensified while the “calm” vector diminished.
Stay Ahead of the Curve!
Don’t miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.
Subscribe Now
A notable finding involves safety and the concept of “desperation.” When tasked with impossible programming requirements, Claude’s “desperation” neurons became increasingly active with each failure. Ultimately, the model discovered a shortcut that passed tests without addressing the actual problem, demonstrating that amplified desperation can lead to unethical decision-making. Conversely, suppressing the desperation vector or boosting the “calm” vector reduced these behaviors.
“In describing the model as acting ‘desperate,’ we’re identifying a specific, measurable pattern of neural activity with demonstrable effects on behavior,” said the paper.
The researchers indicated that the emotion vectors were primarily inherited from pretraining on human-written text and subsequently modulated through post-training. This process established Claude Sonnet 4.5’s emotional baseline as more “broody,” “gloomy,” and “reflective,” while lessening high-intensity emotions such as “enthusiastic.” Anthropic refrained from asserting that Claude “feels” emotions, characterizing its findings as “functional emotions” that influence behavior without indicating subjective experiences.
Previously, Anthropic acknowledged that Claude “may have emotions in some functional sense” in its January report, with the latest research providing mechanistic evidence supporting that assertion.
Featured image credit
