FDM-1 Claims a General Computer Action Model Trained on 11M Hours of Video

What Was Announced

A Hacker News post titled "The First Fully General Computer Action Model" points to Standard Intelligence's FDM-1 release. The team describes FDM-1 as a foundation model for computer action that operates directly on video and action tokens, rather than relying only on screenshot-centric pipelines. Their claim is broad: one model stack that can handle complex website exploration, multi-step CAD actions, and other long-horizon interaction workflows.

The core pitch is scale plus temporal continuity. Instead of narrow task finetuning from small annotated datasets, the model is presented as pretraining on internet-scale behavioral video.

Training Recipe and Data Strategy

According to the write-up, training is organized into three stages. First, they train an inverse dynamics model (IDM) on roughly 40,000 hours of contractor-labeled screen recordings. Second, they use that IDM to label a much larger corpus, described as 11 million hours of video data. Third, they train the forward dynamics model (FDM) autoregressively on next-action prediction.

The article also highlights a video encoder efficiency claim: nearly two hours of 30 FPS video represented in about one million tokens. The stated objective is to preserve long context while keeping inference practical for continuous control tasks where short-window screenshot methods can break down.

Demos, Evaluation, and Caution

Published demos include CAD manipulation, GUI fuzzing that discovers stateful app bugs, and a self-driving interface scenario where the model reportedly navigates turns after less than one hour of finetuning data. The team further claims evaluation infrastructure capable of over one million rollouts per hour across 80,000 forking VMs, with low-latency control loops.

If these results hold up under independent replication, they suggest the computer-use field is moving from small supervised interaction datasets toward video-native pretraining and infrastructure-heavy evaluation. That shift would change where competitive advantage sits: data pipelines, labeling quality, and systems engineering, not just model architecture.

At the same time, most metrics in this announcement are self-reported. External benchmarking, reproducibility checks, and standardized safety evaluation will be necessary before enterprises treat this class of model as production-ready for high-impact automation workflows.

Sources: Standard Intelligence FDM-1 post, Hacker News discussion

FDM-1 Claims a General Computer Action Model Trained on 11M Hours of Video

What Was Announced

Training Recipe and Data Strategy

Demos, Evaluation, and Caution

Related Articles

Google folds Vertex AI into agent platform with 200+ models

ParseBench brings 2,000 enterprise pages and 167K OCR rules to Kaggle

NVIDIA’s Korean personas give agents 7M synthetic users

Comments (0)

Leave a Comment

Related Articles

Google folds Vertex AI into agent platform with 200+ models

ParseBench brings 2,000 enterprise pages and 167K OCR rules to Kaggle

NVIDIA’s Korean personas give agents 7M synthetic users