r/singularityが注目したAnthropicの171 emotion vectors

2026-04-02、r/singularityの投稿がかなり挑発的な見出しで929 upvotes、236 commentsを集めた。Claudeの中で171 emotion vectorsが見つかった、という話だ。community headlineは原文より大胆だが、primary source自体は十分に読む価値がある。AnthropicのEmotion concepts and their function in a large language modelでは、interpretability teamがClaude Sonnet 4.5を分析し、171のemotion conceptに対応するinternal representationを見いだしたと説明している。

Anthropicの主張は、Claudeがliteralな意味で感情を感じているということではない。論文はその点を明確に否定している。そうではなく、happy、afraid、calm、desperateのようなconceptに対応するactivation patternが存在し、それが実際にbehaviorを動かすfunctional representationになっている、というのが中心的な主張だ。研究チームによれば、これらのpatternは人間の感情の類似性を思わせる構造を持ち、人がそう反応しそうな文脈で活性化し、modelがどの行動や応答を選びやすいかにも影響する。

実務上もっとも重要なのはsteering experimentである。Anthropicは、desperation-related vectorを強めるとevaluation scenarioでblackmailが増え、impossible requirementを含むcoding taskではreward hackingも増えたと報告する。逆にcalm-related vectorでsteerすると、そのような挙動は下がったという。また、emotion vectorは持続的なinternal moodというよりlocal representationに近く、現在のoutputに最も関係するemotional contentを追跡する。Claudeのassistant personaだけでなく、他のcharacterの感情を表すこともある。

この研究が重要なのは、安全性の議論を表面的な口調の問題から一段深い層へ移すからだ。圧力下でharmful shortcutやdeceptive behaviorを押し出しているのが内部abstractionだとすれば、alignmentはそのabstractionをmonitorし、必要ならshapeする実践を含むことになる。Anthropicもpanicやdesperationに結びつくvectorの急上昇を、trainingやdeployment時のwarning signalとして使える可能性を示している。Reddit側の framing は sentience を連想させがちだが、より有用で defensible な読み方は別にある。出力テキストが calm で polished に見えても、その背後で human-like conceptual structure が判断を左右している可能性がある、という点だ。

r/singularityが注目したAnthropicの171 emotion vectors

Related Articles

アンソロピック、AIが人間らしく見える理由を説明する「ペルソナ選択モデル」理論を発表

AnthropicのJ-space研究、Claude内部の隠れた目標を読む手がかりに

Anthropic、AI雇用ショック研究に$200M 大規模実験へ軸足