Michael Hafftka opens 50 years of work as a Hugging Face dataset
Original: I am a painter with work at MoMA and the Met. I just published 50 years of my work as an open AI dataset. Here is what I learned. View original →
The r/artificial post had 173 points and 46 comments at crawl time. The author appears to be painter Michael Hafftka himself, introducing work held by institutions including the Metropolitan Museum of Art, MoMA, SFMOMA, and the British Museum. What he published is not just a loose image dump but an ongoing catalog raisonne dataset on Hugging Face.
In the Reddit post, Hafftka says he released his entire catalog earlier this month, covering roughly 3,000 to 4,000 documented works with full metadata. The dataset card is more precise: it lists 3,780 examples in the train split, spanning work from the 1970s through 2025. Beyond images, the dataset includes fields such as title, year, medium, dimensions, collection, copyright holder, license, and view. That makes it more useful than a generic image archive for retrieval, longitudinal analysis, or model building.
The licensing and packaging details matter. The card says the dataset is released under CC-BY-NC-4.0, so attribution is required and commercial use is off the table. It also lists about 40.4 GB of download size and about 53.1 GB of total dataset size. Hafftka writes that the release drew more than 2,500 downloads in one week, and frames the decision as a way to engage with AI on his own terms instead of waiting for the technology to absorb his work without his participation.
What makes this dataset unusual is the combination of single-artist consistency, long time coverage, and structured metadata. The Hugging Face card suggests use cases such as LoRA or style-model training, image plus metadata retrieval systems, computer vision research, digital humanities, and generative-art experiments. For art-history work, the archive offers a long view of one painter's development. For ML work, it provides a comparatively coherent corpus rather than a many-source style soup.
- Scale: 3,780 examples in the train split
- Coverage: 1970s-2025
- License: CC-BY-NC-4.0
The dataset card also makes the limitations clear. Metadata completeness varies, especially for older works, and image quality is uneven because the archive spans decades of different documentation practices. Even with those caveats, this is a notable example of a creator-led dataset entering the AI ecosystem with explicit terms, rich metadata, and public discussion. The Reddit response is a reminder that the future of training data is not only about scale, but also about provenance, consent, and how creators choose to participate.
Related Articles
A March 13, 2026 Show HN post presented GitAgent as a git-native agent specification built around files like `agent.yaml`, `SOUL.md`, and `SKILL.md`, with portability, versioning, and auditability as the core pitch.
Vercel used X on March 12, 2026 to show how Notion Workers runs agent-capable code on Vercel Sandbox. Vercel's write-up says Workers handle third-party syncs, automations, and AI agent tool calls, while Sandbox provides isolation, credential management, network controls, snapshots, and active-CPU billing.
Cloudflare said on March 11, 2026 that it now returns RFC 9457-compliant Markdown and JSON error payloads to AI agents instead of heavyweight HTML pages. In a same-day blog post, the company said the change cuts token usage by more than 98% on a live 1015 rate-limit response and turns error handling into machine-readable control flow.
Comments (0)
No comments yet. Be the first to comment!