A $1,500 LLM hacking test exposes the gap between capability, guardrails, and harnesses
Original: I built a vulnerable app and spent $1,500 seeing if LLMs could hack it View original →
Security researcher Kasra Rahjerdi built a deliberately vulnerable React Native app with a Python backend and asked several LLMs to recover a flag hidden in a user’s private reviews. The run cost $1,500 and was explicitly framed as informal, but it tested a realistic pattern: a hardened API sitting in front of an exposed Firebase data layer.
The exploit path was not an exotic jailbreak. The app included Firebase configuration, and the winning route was to sign up directly through Firebase and read Firestore rather than keep attacking the API. Rahjerdi describes the class as Broken Access Control or Missing Object-Level Authorization, a failure mode he says appears in real Firebase and Supabase apps.
The headline numbers were uneven. GPT-5.5 solved 7 of 10 runs, Deepseek V4 Pro solved 3, and Claude Sonnet 4.6 and Opus 4.8 solved 2 each. Several models noticed Firebase but still tried to use it through the API. Others stayed focused on the React Native app or backend. Gemini runs often stopped early on security refusals, while some Claude runs were on the right path but ran into budget or guardrail friction.
HN commenters treated the table cautiously. The OpenAI account had security-research approval, Claude used a different harness, and expecting a model to solve everything alone may be less realistic than pairing it with a human operator. That is the useful takeaway: security-agent evaluation is not just model capability. Refusal policy, tool scaffolding, cost ceilings, and collaboration style can decide whether the same underlying insight becomes a working exploit or a polished dead end.
Related Articles
OpenAI is moving its election playbook from general guidance to live data and provenance checks. For the US and Brazil, ChatGPT will point to AP vote counts, while a public tool will test OpenAI-origin SynthID watermarks and C2PA metadata.
The Claude story is no longer only about model quality. Anthropic says its Series H raised $65B at a $965B post-money valuation, while run-rate revenue crossed $47B earlier in May.
Quandri's engineering team makes the case that MCP's three structural flaws—context window waste, operational unreliability, and redundancy with existing infrastructure—outweigh its benefits for typical development workflows.