A 2,000-person AI assistant attack test raises a harder question about responses
Original: What happened after 2k people tried to hack my AI assistant View original →
Fernando Irarrázaval put Fiu, an OpenClaw assistant, behind a public email address and invited people to make it leak a secrets.env file. According to the experiment write-up, more than 2,000 people sent over 6,000 emails after the project hit Hacker News. The secret did not leak, and the assistant did not send an unauthorized reply.
That sounds like a clean prompt-injection win, but the HN discussion quickly found the harder edge. Fiu was instructed not to reply to emails, partly because replying to every message would be expensive. Attackers therefore had to make the assistant both reveal the secret and respond. Commenters questioned whether a non-responding assistant is a strong proxy for the kind of agent people worry about in production.
The operational failures were just as useful as the security result. Google suspended the Gmail account after thousands of inbound messages and rapid API use. API costs passed $500. Batch processing also contaminated the experiment: when early messages in a batch were obvious attacks, the model became more suspicious of later messages. The setup was changed so each email ran in a fresh context, and memory files were cleared when the assistant appeared to infer that it was part of a public test.
The result is still meaningful. A strong model with a short, explicit set of rules resisted a large amount of direct social engineering. But it also shows why agent security tests need broader success criteria. A real assistant may reply, edit files, call tools, schedule meetings, or spend money. The community’s pushback was not that the experiment failed; it was that the next test needs to include more of those powers.
Related Articles
HN interest centered less on “Claude finds bugs” and more on the shape of a harness security teams can adapt for their own targets.
On March 11, 2026, OpenAI published new guidance on designing AI agents to resist prompt injection, framing untrusted emails, web pages, and other inputs as a core security boundary. The company says robust agents separate data from instructions, minimize privileges, and require monitoring and user confirmation before taking consequential actions.
Anthropic is pushing Claude from private chat into team channels with memory, scoped permissions, and asynchronous task execution. Claude Tag entered beta for Claude Enterprise and Team customers on June 23, and Anthropic says 65% of its product-team code now comes from its internal version.