OpenAI’s Images 2.0 safety card makes deepfake risk measurable

OpenAI’s ChatGPT Images 2.0 System Card, published April 21, 2026, is more than a companion note for a better image model. It gives a measurable view of the risk tradeoff behind image generation that can reason, use tools, pull in live web search data, and produce more complex scenes with dense text.

The capability change is easy to understand: Images 2.0 is designed for stronger world knowledge, instruction following, and detailed composition. The safety problem follows from the same strengths. OpenAI says the new model’s heightened realism could make deepfakes involving real people, political events, sexual content, or sensitive places more convincing if safeguards failed. That pushes the deployment question away from simple prompt filtering and toward layered image-specific controls.

The safety stack has several gates. Text classifiers can refuse a request before it reaches the image model. A safety-focused multimodal reasoning model checks text and image inputs before generation, then checks the generated output before it reaches the user. OpenAI says it has also shifted evaluations from raw taxonomy matching toward more product-grounded measurement of harmful-output risk.

The numbers are the most useful part of the card. In adversarial testing, OpenAI says the final thinking mode checkpoint produced 464 policy-violating images out of 6,944 attempts before the full production stack, a 6.7% rate. Instant mode produced 685 out of 3,112, or 22.0%, in the same kind of pre-blocking analysis. The downstream monitor caught 598 of those 685 instant-mode violating images, and the combined prompt plus image stack caught 658, for 96.1% combined recall and 99.1% safe outputs for adversarial prompts. Thinking mode reached 99.2% safe outputs after the combined stack.

The biorisk section is also notable. OpenAI says some image outputs were accurate enough that a bioweapons expert judged they could potentially help novices with harmful tasks. The company says it therefore treats the model as high capability for biology mitigations, applying an image-specific biological risk policy to both inputs and outputs.

For users, the practical takeaway is that image model safety now has to cover realism, instruction following, external information, and provenance at the same time. OpenAI says Images 2.0 keeps C2PA metadata and adds an imperceptible, content-specific watermark with internal detection tooling. The hard part to watch next is whether those controls remain reliable as users combine image editing, web-grounded prompts, and multi-image workflows in less predictable ways than a lab evaluation can capture.

OpenAI’s Images 2.0 safety card makes deepfake risk measurable

Related Articles

ChatGPT Images 2.0 made HN test the prompts, not just the gallery

OpenAI details Sora’s safety stack with C2PA, consent controls, and teen protections

OpenAI launches Child Safety Blueprint for AI-enabled abuse prevention

Comments (0)

Leave a Comment

Related Articles

ChatGPT Images 2.0 made HN test the prompts, not just the gallery

OpenAI details Sora’s safety stack with C2PA, consent controls, and teen protections
AI Mar 28, 2026 2 min read

OpenAI launches Child Safety Blueprint for AI-enabled abuse prevention
AI Apr 13, 2026 2 min read