NuExtract3 targets local document extraction with a 4B VLM
Original: NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) View original →
NuMind released NuExtract3, a 4B vision-language model built for document understanding. Its main jobs are structured information extraction and document-to-Markdown conversion. The model is based on Qwen3.5-4B, published under Apache-2.0, and aimed at workflows involving scans, receipts, invoices, forms, tables, contracts, and other layout-heavy documents.
The Reddit thread gained attention because the deployment story is unusually practical for local users. The author says NuMind provides Safetensors, GGUF, and MLX weights, along with multiple quantizations, and positions the model as usable with as little as 4GB of VRAM. The team has mainly tested vLLM, SGLang, and llama.cpp. That matters for teams that want document extraction without routing sensitive files through hosted OCR or multimodal APIs.
The model card describes two major modes. For structured extraction, users provide text or images plus a JSON-like template, and the model returns values in that structure. For Markdown conversion, it turns document images into Markdown, including HTML tables, LaTeX for math, and figure tags for images. NuMind also reports internal benchmark results across roughly 600 diverse documents, where NuExtract3.4_4B-RL scored 0.651 on its structured extraction metric. The company says it plans to open-source the benchmark and publish more technical details later.
Community discussion quickly moved to edge cases: multi-column pages, dense tables, newspapers, old books, handwriting, Chinese subtitles, and vLLM loading issues. One commenter noted that shipping GGUF and MLX weights on day one changes the adoption curve because users do not have to wait for community conversions. Another described replacing paid cloud extraction in workflows where cost accumulates quickly.
Source thread: r/LocalLLaMA. Model details: Hugging Face NuExtract3 model card.
Related Articles
LocalLLaMA did not just celebrate the DeepSeek V4 release. The thread instantly turned into a collective calculation about 1M context, activated parameters, and what this actually means for real hardware, with MIT license praise mixed in.
LocalLLaMA seized on Anthropic’s postmortem as confirmation of a fear the subreddit repeats constantly: when the model is hosted, the person paying for it may not control what “the same model” means from week to week.
DeepSeek released DeepSeek-V4-Pro (1.6T total parameters, 49B active) and V4-Flash (284B total, 13B active), both Mixture-of-Experts models with MIT license and 1M token context. V4-Pro is the largest open-weights model released so far, and its pricing at $1.74/M input undercuts GPT-5.4 and Claude Sonnet 4.6 by more than half.
Comments (0)
No comments yet. Be the first to comment!