r/LocalLLaMA Highlights Netflix's Open VOID Video Deletion Model

Original: Netflix just dropped their first public model on Hugging Face: VOID: Video Object and Interaction Deletion View original →

Read in other languages: 한국어日本語
AI Apr 4, 2026 By Insights AI (Reddit) 2 min read Source

A r/LocalLLaMA post surged past 1,100 upvotes after pointing to Netflix's first public model release on Hugging Face: VOID, short for Video Object and Interaction Deletion. What made the post stand out was not just that another company published weights, but that the model targets a harder version of video inpainting. According to the model card and the GitHub repo, VOID tries to remove an object and the physical interactions it induces in the scene, not just obvious traces such as shadows or reflections.

The published materials describe VOID as a fine-tuned system built on CogVideoX-Fun-V1.5-5b-InP. It uses interaction-aware quadmask conditioning, where different mask values represent the primary object to remove, overlap regions, affected regions, and protected background. Netflix says the model can handle cases where deleting a person should also change what happens to nearby objects, such as a guitar that would naturally fall once the person holding it disappears.

  • The base architecture is a 5B CogVideoX 3D Transformer.
  • The default output resolution is 384x672, with support for up to 197 frames.
  • Pass 1 is a base inpainting model, while Pass 2 refines temporal consistency with warped-noise initialization.
  • The quick-start notebook requires a GPU with 40GB+ VRAM such as an A100.

The repo also makes the workflow unusually concrete for an open release. The README documents the CLI, expected input layout, optional two-pass inference, and a mask-generation pipeline that combines SAM2 with Gemini to build quadmasks from raw video. Training details are also public: the authors say VOID was trained on paired counterfactual videos from HUMOTO and Kubric and run on 8x A100 80GB GPUs with DeepSpeed ZeRO Stage 2.

The Reddit discussion was enthusiastic for a reason. One widely upvoted reply highlighted the claim that VOID handles physical interactions, calling that especially impressive. Another commenter joked that Netflix is acting more open source than some frontier-model labs. That mix of novelty and reproducibility is why the post landed so well in r/LocalLLaMA: it is not just a flashy demo, but a release with weights, code, a notebook, and enough system detail for people to test the claim themselves.

Share: Long

Related Articles

AI sources.twitter 3d ago 2 min read

Meta said on March 27, 2026 that SAM 3.1 is a drop-in update to SAM 3 that improves video processing efficiency through object multiplexing. The project's release notes say the update introduces shared-memory joint multi-object tracking, new checkpoints, and about 7x speedup at 128 objects on a single H100 compared with the November 2025 SAM 3 release.

Cohere pushes Transcribe as an open 2B ASR model with a WebGPU browser demo
AI sources.twitter 6d ago 2 min read

Cohere said on March 28, 2026 that Transcribe is setting a new bar for speech recognition accuracy in real-world noise and linked users to try it. The supporting Hugging Face materials position Transcribe as an Apache 2.0, 2B-parameter ASR model for 14 languages, while a companion WebGPU demo shows the model running locally in the browser.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.