A $1,500 LLM hacking test exposes the gap between capability, guardrails, and harnesses

Security researcher Kasra Rahjerdi built a deliberately vulnerable React Native app with a Python backend and asked several LLMs to recover a flag hidden in a user’s private reviews. The run cost $1,500 and was explicitly framed as informal, but it tested a realistic pattern: a hardened API sitting in front of an exposed Firebase data layer.

The exploit path was not an exotic jailbreak. The app included Firebase configuration, and the winning route was to sign up directly through Firebase and read Firestore rather than keep attacking the API. Rahjerdi describes the class as Broken Access Control or Missing Object-Level Authorization, a failure mode he says appears in real Firebase and Supabase apps.

The headline numbers were uneven. GPT-5.5 solved 7 of 10 runs, Deepseek V4 Pro solved 3, and Claude Sonnet 4.6 and Opus 4.8 solved 2 each. Several models noticed Firebase but still tried to use it through the API. Others stayed focused on the React Native app or backend. Gemini runs often stopped early on security refusals, while some Claude runs were on the right path but ran into budget or guardrail friction.

HN commenters treated the table cautiously. The OpenAI account had security-research approval, Claude used a different harness, and expecting a model to solve everything alone may be less realistic than pairing it with a human operator. That is the useful takeaway: security-agent evaluation is not just model capability. Refusal policy, tool scaffolding, cost ceilings, and collaboration style can decide whether the same underlying insight becomes a working exploit or a polished dead end.

AI Hacker News 5d ago 1 min read

Apple SpeechAnalyzer beats Whisper Small in an on-device benchmark

A benchmark that Apple did not publish itself drew HN attention: SpeechAnalyzer posted lower word error rates than both the old SFSpeechRecognizer and Whisper Small on the same Apple hardware.

#apple #speech-recognition #whisper

AI 5d ago 1 min read

AI model rivalry shifts from benchmark charts to token bills

OpenAI, Meta and SpaceXAI are selling their newest models as cost savers, not just capability upgrades. Enterprise buyers are scrutinizing token bills, forcing frontier labs to compete on cost-per-task while still funding huge chip and data-center spend.

#ai-pricing #openai #meta

AI 5d ago 1 min read

Australia puts AI, data centers and copyright under one policy roof

Australia is moving AI governance into a national framework that spans infrastructure, copyright, jobs, energy and security. The Office of AI gives Canberra a single coordination point as states diverge on data-center approvals and global AI vendors push for clearer rules.

#australia #ai-policy #data-centers

Related Articles

Apple SpeechAnalyzer beats Whisper Small in an on-device benchmark

AI model rivalry shifts from benchmark charts to token bills

Australia puts AI, data centers and copyright under one policy roof