Granite Vision 3.1-2B Preview is a compact and efficient vision-language model designed for visual document understanding and automated content extraction. It processes and interprets complex visual data, including tables, charts, infographics, and diagrams, making it highly valuable for enterprise applications. Trained on a curated mix of public and synthetic datasets, it enhances document analysis, OCR, and visual question answering. With IBM’s Blue Vela supercomputing infrastructure and NVIDIA H100 GPUs, the model is optimized for scalability and precision. Its lightweight yet powerful architecture makes it an ideal solution for businesses looking to integrate multimodal AI into their workflows.
Evaluations
| InternVL2 | Molmo-E | Phi3v | Phi3.5v | Granite Vision |
Document benchmarks | | | | | |
DocVQA | 0.87 | 0.66 | 0.87 | 0.88 | 0.88 |
ChartQA | 0.75 | 0.60 | 0.81 | 0.82 | 0.86 |
TextVQA | 0.72 | 0.62 | 0.69 | 0.7 | 0.76 |
AI2D | 0.74 | 0.63 | 0.79 | 0.79 | 0.78 |
InfoVQA | 0.58 | 0.44 | 0.55 | 0.61 | 0.63 |
OCRBench | 0.75 | 0.65 | 0.64 | 0.64 | 0.75 |
LiveXiv VQA | 0.51 | 0.47 | 0.61 | – | 0.61 |
LiveXiv TQA | 0.38 | 0.36 | 0.48 | – | 0.55 |
Other benchmarks | | | | | |
MMMU | 0.35 | 0.32 | 0.42 | 0.44 | 0.35 |
VQAv2 | 0.75 | 0.57 | 0.76 | 0.77 | 0.81 |
RealWorldQA | 0.34 | 0.55 | 0.60 | 0.58 | 0.65 |
VizWiz VQA | 0.46 | 0.49 | 0.57 | 0.57 | 0.64 |
OK VQA | 0.44 | 0.40 | 0.51 | 0.53 | 0.57 |
Model Resource
Hugging Face
Link: https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview
1. GPU Requirements
Task | Minimum VRAM | Recommended VRAM | Optimal VRAM | Recommended GPU |
---|
Single Image Inference | 12GB | 24GB | 48GB | RTX 3090 / 4090 / A6000 |
Batch Image Processing | 24GB | 48GB | 80GB | A100 40GB / H100 80GB |
Fine-Tuning | 48GB | 80GB | 160GB (Multi-GPU) | 2x A100 80GB / 4x H100 |
Gradio Deployment | 16GB | 24GB | 48GB | RTX 3090 / A6000 |
- Minimum: 12GB VRAM (RTX 3060 / 4060) for basic inference with 8-bit quantization.
- Recommended: 24GB VRAM (RTX 3090 / 4090 / A6000) for smooth inference.
- Optimal: 48GB+ VRAM (A100 / H100) for batch processing of multiple images.
- Fine-Tuning: Requires 2x A100 80GB or 4x H100 for large-scale training.
- For smooth document understanding tasks, at least 24GB VRAM is recommended.
2. CPU Requirements
Component | Minimum | Recommended | Optimal |
---|
CPU Cores | 8 Cores | 16 Cores | 32 Cores |
Clock Speed | 2.5 GHz | 3.5 GHz+ | 3.8 GHz+ |
Processor Type | Intel i7 / Ryzen 7 | Intel i9 / Ryzen 9 | AMD EPYC / Intel Xeon |
- Minimum: 8-core CPU (Intel i7 / Ryzen 7) for inference.
- Recommended: 16-core CPU (Intel i9 / Ryzen 9) for fast document processing.
- Optimal: 32-core+ CPU (AMD EPYC / Intel Xeon) for large-scale parallel image processing.
- Use a high-performance CPU for better image pre-processing and tokenization.
3. RAM Requirements
Task | Minimum RAM | Recommended RAM | Optimal RAM |
---|
Single Image Processing | 16GB | 32GB | 64GB |
Batch Processing | 32GB | 64GB | 128GB |
Fine-Tuning | 64GB | 128GB+ | 256GB+ |
- Minimum: 16GB RAM for single-image inference.
- Recommended: 32GB RAM for multi-image document processing.
- Optimal: 64GB+ RAM for fine-tuning and high-throughput workloads.
- For multi-image document processing, 32GB RAM is recommended.
4. Disk Space & Storage
Component | Minimum | Recommended | Optimal |
---|
Disk Space | 40GB SSD | 100GB SSD | 500GB+ NVMe SSD |
Disk Type | SATA SSD | NVMe SSD | PCIe 4.0 NVMe SSD |
- Minimum: 40GB SSD for model weights and dependencies.
- Recommended: 100GB SSD for storing additional datasets, logs, and output files.
- Optimal: 500GB+ NVMe SSD for fast caching and document storage.
- Use NVMe SSDs for fast storage and loading of visual datasets.
5. Best Practices for Performance
- Use SSD/NVMe for fast storage – Avoid HDDs.
- Monitor GPU Usage – Run
nvidia-smi
to check VRAM.
- Enable Flash Attention – Optimize memory and speed.
- Use Quantization (8-bit) – For 12GB VRAM GPUs.
- Scale Across Multiple GPUs – For batch processing.
Step-by-Step Process to Install Granite Vision 2B Model Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
Next, you will need to choose an image for your Virtual Machine. We will deploy Granite Vision 2B Model on a Jupyter Virtual Machine. This open-source platform will allow you to install and run the Granite Vision 2B Model on your GPU node. By running this Model on a Jupyter Notebook, we avoid using the terminal, simplifying the process and reducing the setup time. This allows you to configure the model in just a few steps and minutes.
Note: NodeShift provides multiple image template options, such as TensorFlow, PyTorch, NVIDIA CUDA, Deepo, Whisper ASR Webservice, and Jupyter Notebook. With these options, you don’t need to install additional libraries or packages to run Jupyter Notebook. You can start Jupyter Notebook in just a few simple clicks.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to Jupyter Notebook
Once your GPU VM deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ Button in the top right corner.
After clicking the ‘Connect’ button, you can view the Jupyter Notebook.
Now open Python 3(pykernel) Notebook.
Next, If you want to check the GPU details, run the command in the Jupyter Notebook cell:
!nvidia-smi
Step 8: Install Required Dependencies
Ensure you have all necessary dependencies installed.
Run the following in a Jupyter Notebook cell:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install transformers>=4.49
!pip install huggingface_hub
!pip install pillow
Step 9: Import Required Libraries
Once installation is complete, import the necessary libraries:
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
from PIL import Image
Step 10: Set Device (GPU or CPU)
Ensure that the model runs on CUDA (GPU) if available:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
Step 11: Load the Model and Processor
Now, download and initialize the model and processor:
model_path = "ibm-granite/granite-vision-3.1-2b-preview"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)
Step 12: Load an Example Image
Download an example image from Hugging Face Hub:
img_path = hf_hub_download(repo_id=model_path, filename="example.png")
image = Image.open(img_path).convert("RGB")
image.show() # Display the image
Alternatively, if you want to use a local image, replace the hf_hub_download()
function with:
image = Image.open("your_local_image.png").convert("RGB")
Step 13: Define a Conversation Prompt
Now, structure the image + text input as a conversation:
conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": img_path}, # Image input
{"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"}, # Text prompt
],
},
]
Step 14: Preprocess the Input
Now, prepare the inputs for model inference:
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(device)
Step 15: Run the Model for Inference
Generate a response from the model:
output = model.generate(**inputs, max_new_tokens=100)
response = processor.decode(output[0], skip_special_tokens=True)
print("Model Response:", response)
Step 16: (Optional) Use vLLM for Faster Inference
If you want to use vLLM (optimized inference), install it and use the following code:
Install vLLM
!pip install vllm==0.6.6
Run the Model with vLLM
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
model = LLM(
model=model_path,
limit_mm_per_prompt={"image": 1},
)
sampling_params = SamplingParams(
temperature=0.2,
max_tokens=64,
)
question = "What is the highest scoring model on ChartQA and what is its score?"
prompt = f"<|system|>\nA chat between a user and an assistant.\n<|user|>\n<image>\n{question}\n<|assistant|>\n"
outputs = model.generate({"prompt": prompt, "multi_modal_data": {"image": image}}, sampling_params=sampling_params)
print("Generated text:", outputs[0].outputs[0].text)
Conclusion
Setting up Granite Vision 3.1-2B Preview locally provides a powerful and efficient solution for visual document understanding and content extraction. By following this step-by-step guide, users can deploy the model on a GPU-powered virtual machine, install dependencies, and run image-to-text inference seamlessly using Jupyter Notebook or vLLM for optimized performance. With support for OCR, chart analysis, and general document interpretation, this model is an ideal choice for businesses and researchers looking to integrate multimodal AI into their workflows. Whether for enterprise applications, data extraction, or automation, Granite Vision 3.1-2B ensures accuracy, scalability, and efficiency in AI-driven document processing.