Michael Hafftka opens 50 years of work as a Hugging Face dataset

The r/artificial post had 173 points and 46 comments at crawl time. The author appears to be painter Michael Hafftka himself, introducing work held by institutions including the Metropolitan Museum of Art, MoMA, SFMOMA, and the British Museum. What he published is not just a loose image dump but an ongoing catalog raisonne dataset on Hugging Face.

In the Reddit post, Hafftka says he released his entire catalog earlier this month, covering roughly 3,000 to 4,000 documented works with full metadata. The dataset card is more precise: it lists 3,780 examples in the train split, spanning work from the 1970s through 2025. Beyond images, the dataset includes fields such as title, year, medium, dimensions, collection, copyright holder, license, and view. That makes it more useful than a generic image archive for retrieval, longitudinal analysis, or model building.

The licensing and packaging details matter. The card says the dataset is released under CC-BY-NC-4.0, so attribution is required and commercial use is off the table. It also lists about 40.4 GB of download size and about 53.1 GB of total dataset size. Hafftka writes that the release drew more than 2,500 downloads in one week, and frames the decision as a way to engage with AI on his own terms instead of waiting for the technology to absorb his work without his participation.

What makes this dataset unusual is the combination of single-artist consistency, long time coverage, and structured metadata. The Hugging Face card suggests use cases such as LoRA or style-model training, image plus metadata retrieval systems, computer vision research, digital humanities, and generative-art experiments. For art-history work, the archive offers a long view of one painter's development. For ML work, it provides a comparatively coherent corpus rather than a many-source style soup.

Scale: 3,780 examples in the train split
Coverage: 1970s-2025
License: CC-BY-NC-4.0

The dataset card also makes the limitations clear. Metadata completeness varies, especially for older works, and image quality is uneven because the archive spans decades of different documentation practices. Even with those caveats, this is a notable example of a creator-led dataset entering the AI ecosystem with explicit terms, rich metadata, and public discussion. The Reddit response is a reminder that the future of training data is not only about scale, but also about provenance, consent, and how creators choose to participate.

Michael Hafftka opens 50 years of work as a Hugging Face dataset

Related Articles

Show HN: GitAgent Treats AI Agents as Git Repositories Instead of Framework-Locked Configs

Vercel highlights Notion Workers as an agent-ready platform built on Sandbox

Cloudflare Replaces HTML Agent Errors with RFC 9457 Markdown and JSON

Comments (0)

Leave a Comment