IBM Releases Granite 4.0 3B Vision for Enterprise Data Extraction

Key Takeaways

  • Introduces a modular, efficient architecture that allows enterprises to toggle vision capabilities only when needed, reducing compute overhead.
  • Improves document data extraction accuracy by using a code-guided training pipeline to better interpret complex charts and tables.
  • Provides an Apache 2.0 licensed, developer-friendly solution that integrates natively with Docling and vLLM for production workflows.

IBM has announced the release of Granite 4.0 3B Vision, a new vision-language model (VLM) engineered specifically for enterprise-grade document data extraction. Moving away from the monolithic architecture common in larger multimodal models, this release functions as a specialized adapter designed to provide high-fidelity visual reasoning to the Granite 4.0 Micro language backbone. The model prioritizes structured data accuracy, such as converting tables to HTML or complex charts into code, rather than general-purpose image captioning.

Modular Architecture and Vision Integration

The Granite 4.0 3B Vision model is delivered as a 0.5B parameter Low-Rank Adaptation (LoRA) adapter that functions on top of the 3.5B parameter Granite 4.0 Micro base model. This dual-mode deployment allows the base model to process text-only requests independently, activating the vision adapter only when multimodal processing is required.
To handle diverse document layouts, the model utilizes the google/siglip2-so400m-patch16-384 encoder. It employs a tiling mechanism that decomposes input images into 384×384 patches while simultaneously processing a downscaled global view of the entire document. This ensures that fine details, such as small data points in charts or subscripts in formulas, are preserved. Furthermore, IBM utilizes a DeepStack architecture that routes visual tokens into the language model across eight specific injection points. This method creates a tighter alignment between semantic content and spatial layout, which is essential for maintaining structure during document parsing.

Specialized Training and Performance

The training curriculum for Granite 4.0 3B Vision reflects a strategic focus on complex document structures. The model was refined using the ChartNet dataset and a code-guided pipeline that aligns original plotting code, rendered images, and underlying data tables. This approach enables the model to internalize the structural relationship between visual representations and their source data. Additionally, the model underwent fine-tuning on a mixture of datasets focused on Key-Value Pair extraction, table structure recognition, and the conversion of charts into machine-readable formats like CSV and JSON.
Technical evaluations demonstrate the model's efficiency in structured extraction. As of March 2026, it ranks 3rd among models in the 2–4B parameter class on the VAREX leaderboard. In zero-shot performance testing, the model achieved an 85.5% exact match rate on the VAREX benchmark for Key-Value Pair extraction. The model is Apache 2.0 licensed and features native support for vLLM and Docling, IBM’s tool for converting unstructured PDFs into machine-readable formats.

Comments (0)

No comments yet

Be the first to share your thoughts!