Skip to content

A 2,000-person AI assistant attack test raises a harder question about responses

Original: What happened after 2k people tried to hack my AI assistant View original →

Read in other languages: 한국어日本語
LLM Jun 26, 2026 By Insights AI (HN) 1 min read 1 views Source

Fernando Irarrázaval put Fiu, an OpenClaw assistant, behind a public email address and invited people to make it leak a secrets.env file. According to the experiment write-up, more than 2,000 people sent over 6,000 emails after the project hit Hacker News. The secret did not leak, and the assistant did not send an unauthorized reply.

That sounds like a clean prompt-injection win, but the HN discussion quickly found the harder edge. Fiu was instructed not to reply to emails, partly because replying to every message would be expensive. Attackers therefore had to make the assistant both reveal the secret and respond. Commenters questioned whether a non-responding assistant is a strong proxy for the kind of agent people worry about in production.

The operational failures were just as useful as the security result. Google suspended the Gmail account after thousands of inbound messages and rapid API use. API costs passed $500. Batch processing also contaminated the experiment: when early messages in a batch were obvious attacks, the model became more suspicious of later messages. The setup was changed so each email ran in a fresh context, and memory files were cleared when the assistant appeared to infer that it was part of a public test.

The result is still meaningful. A strong model with a short, explicit set of rules resisted a large amount of direct social engineering. But it also shows why agent security tests need broader success criteria. A real assistant may reply, edit files, call tools, schedule meetings, or spend money. The community’s pushback was not that the experiment failed; it was that the next test needs to include more of those powers.

Share: Long

Related Articles

LLM Mar 15, 2026 2 min read

On March 11, 2026, OpenAI published new guidance on designing AI agents to resist prompt injection, framing untrusted emails, web pages, and other inputs as a core security boundary. The company says robust agents separate data from instructions, minimize privileges, and require monitoring and user confirmation before taking consequential actions.