How to Install Granite Vision 2B Locally?

by Ayush Kumar | February 28, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Granite Vision 3.1-2B Preview is a compact and efficient vision-language model designed for visual document understanding and automated content extraction. It processes and interprets complex visual data, including tables, charts, infographics, and diagrams, making it highly valuable for enterprise applications. Trained on a curated mix of public and synthetic datasets, it enhances document analysis, OCR, and visual question answering. With IBM’s Blue Vela supercomputing infrastructure and NVIDIA H100 GPUs, the model is optimized for scalability and precision. Its lightweight yet powerful architecture makes it an ideal solution for businesses looking to integrate multimodal AI into their workflows.

Evaluations

	InternVL2	Molmo-E	Phi3v	Phi3.5v	Granite Vision
Document benchmarks
DocVQA	0.87	0.66	0.87	0.88	0.88
ChartQA	0.75	0.60	0.81	0.82	0.86
TextVQA	0.72	0.62	0.69	0.7	0.76
AI2D	0.74	0.63	0.79	0.79	0.78
InfoVQA	0.58	0.44	0.55	0.61	0.63
OCRBench	0.75	0.65	0.64	0.64	0.75
LiveXiv VQA	0.51	0.47	0.61	–	0.61
LiveXiv TQA	0.38	0.36	0.48	–	0.55
Other benchmarks
MMMU	0.35	0.32	0.42	0.44	0.35
VQAv2	0.75	0.57	0.76	0.77	0.81
RealWorldQA	0.34	0.55	0.60	0.58	0.65
VizWiz VQA	0.46	0.49	0.57	0.57	0.64
OK VQA	0.44	0.40	0.51	0.53	0.57

Model Resource

Hugging Face

Link: https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview

1. GPU Requirements

Task	Minimum VRAM	Recommended VRAM	Optimal VRAM	Recommended GPU
Single Image Inference	12GB	24GB	48GB	RTX 3090 / 4090 / A6000
Batch Image Processing	24GB	48GB	80GB	A100 40GB / H100 80GB
Fine-Tuning	48GB	80GB	160GB (Multi-GPU)	2x A100 80GB / 4x H100
Gradio Deployment	16GB	24GB	48GB	RTX 3090 / A6000

Minimum: 12GB VRAM (RTX 3060 / 4060) for basic inference with 8-bit quantization.
Recommended: 24GB VRAM (RTX 3090 / 4090 / A6000) for smooth inference.
Optimal: 48GB+ VRAM (A100 / H100) for batch processing of multiple images.
Fine-Tuning: Requires 2x A100 80GB or 4x H100 for large-scale training.
For smooth document understanding tasks, at least 24GB VRAM is recommended.

2. CPU Requirements

Component	Minimum	Recommended	Optimal
CPU Cores	8 Cores	16 Cores	32 Cores
Clock Speed	2.5 GHz	3.5 GHz+	3.8 GHz+
Processor Type	Intel i7 / Ryzen 7	Intel i9 / Ryzen 9	AMD EPYC / Intel Xeon

Minimum: 8-core CPU (Intel i7 / Ryzen 7) for inference.
Recommended: 16-core CPU (Intel i9 / Ryzen 9) for fast document processing.
Optimal: 32-core+ CPU (AMD EPYC / Intel Xeon) for large-scale parallel image processing.
Use a high-performance CPU for better image pre-processing and tokenization.

3. RAM Requirements

Task	Minimum RAM	Recommended RAM	Optimal RAM
Single Image Processing	16GB	32GB	64GB
Batch Processing	32GB	64GB	128GB
Fine-Tuning	64GB	128GB+	256GB+

Minimum: 16GB RAM for single-image inference.
Recommended: 32GB RAM for multi-image document processing.
Optimal: 64GB+ RAM for fine-tuning and high-throughput workloads.
For multi-image document processing, 32GB RAM is recommended.

4. Disk Space & Storage

Component	Minimum	Recommended	Optimal
Disk Space	40GB SSD	100GB SSD	500GB+ NVMe SSD
Disk Type	SATA SSD	NVMe SSD	PCIe 4.0 NVMe SSD

Minimum: 40GB SSD for model weights and dependencies.
Recommended: 100GB SSD for storing additional datasets, logs, and output files.
Optimal: 500GB+ NVMe SSD for fast caching and document storage.
Use NVMe SSDs for fast storage and loading of visual datasets.

5. Best Practices for Performance

Use SSD/NVMe for fast storage – Avoid HDDs.
Monitor GPU Usage – Run nvidia-smi to check VRAM.
Enable Flash Attention – Optimize memory and speed.
Use Quantization (8-bit) – For 12GB VRAM GPUs.
Scale Across Multiple GPUs – For batch processing.

Step-by-Step Process to Install Granite Vision 2B Model Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy Granite Vision 2B Model on a Jupyter Virtual Machine. This open-source platform will allow you to install and run the Granite Vision 2B Model on your GPU node. By running this Model on a Jupyter Notebook, we avoid using the terminal, simplifying the process and reducing the setup time. This allows you to configure the model in just a few steps and minutes.

Note: NodeShift provides multiple image template options, such as TensorFlow, PyTorch, NVIDIA CUDA, Deepo, Whisper ASR Webservice, and Jupyter Notebook. With these options, you don’t need to install additional libraries or packages to run Jupyter Notebook. You can start Jupyter Notebook in just a few simple clicks.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to Jupyter Notebook

Once your GPU VM deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ Button in the top right corner.

After clicking the ‘Connect’ button, you can view the Jupyter Notebook.

Now open Python 3(pykernel) Notebook.

Next, If you want to check the GPU details, run the command in the Jupyter Notebook cell:

!nvidia-smi

Step 8: Install Required Dependencies

Ensure you have all necessary dependencies installed.

Run the following in a Jupyter Notebook cell:

!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install transformers>=4.49
!pip install huggingface_hub
!pip install pillow

Step 9: Import Required Libraries

Once installation is complete, import the necessary libraries:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
from PIL import Image

Step 10: Set Device (GPU or CPU)

Ensure that the model runs on CUDA (GPU) if available:

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Step 11: Load the Model and Processor

Now, download and initialize the model and processor:

model_path = "ibm-granite/granite-vision-3.1-2b-preview"

processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)

Step 12: Load an Example Image

Download an example image from Hugging Face Hub:

img_path = hf_hub_download(repo_id=model_path, filename="example.png")
image = Image.open(img_path).convert("RGB")
image.show()  # Display the image

Alternatively, if you want to use a local image, replace the hf_hub_download() function with:

image = Image.open("your_local_image.png").convert("RGB")

Step 13: Define a Conversation Prompt

Now, structure the image + text input as a conversation:

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": img_path},  # Image input
            {"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"},  # Text prompt
        ],
    },
]

Step 14: Preprocess the Input

Now, prepare the inputs for model inference:

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(device)

Step 15: Run the Model for Inference

Generate a response from the model:

output = model.generate(**inputs, max_new_tokens=100)
response = processor.decode(output[0], skip_special_tokens=True)

print("Model Response:", response)

Step 16: (Optional) Use vLLM for Faster Inference

If you want to use vLLM (optimized inference), install it and use the following code:

Install vLLM

!pip install vllm==0.6.6

Run the Model with vLLM

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

model = LLM(
    model=model_path,
    limit_mm_per_prompt={"image": 1},
)

sampling_params = SamplingParams(
    temperature=0.2,
    max_tokens=64,
)

question = "What is the highest scoring model on ChartQA and what is its score?"
prompt = f"<|system|>\nA chat between a user and an assistant.\n<|user|>\n<image>\n{question}\n<|assistant|>\n"

outputs = model.generate({"prompt": prompt, "multi_modal_data": {"image": image}}, sampling_params=sampling_params)
print("Generated text:", outputs[0].outputs[0].text)

Conclusion

Setting up Granite Vision 3.1-2B Preview locally provides a powerful and efficient solution for visual document understanding and content extraction. By following this step-by-step guide, users can deploy the model on a GPU-powered virtual machine, install dependencies, and run image-to-text inference seamlessly using Jupyter Notebook or vLLM for optimized performance. With support for OCR, chart analysis, and general document interpretation, this model is an ideal choice for businesses and researchers looking to integrate multimodal AI into their workflows. Whether for enterprise applications, data extraction, or automation, Granite Vision 3.1-2B ensures accuracy, scalability, and efficiency in AI-driven document processing.

Relevant blog posts

June 25, 2025

LLMs Under Fire: Red Teaming with DeepTeam + Ollama

DeepTeam is a lightweight, easy-to-use red teaming framework designed to help you test the safety and security of your language model applications — locally and transparently. Whether you’re building a chatbot, a RAG pipeline, or a full-fledged AI agent, DeepTeam helps uncover hidden vulnerabilities like bias, PII leakage, or harmful prompts before your users ever see them. Built entirely open-source and backed by the powerful DeepEval engine, DeepTeam simulates real-world adversarial attacks using methods like prompt injection and jailbreaking. It then evaluates how well your model handles them using standardized risk metrics — all without needing a curated dataset. If you’re a developer, security engineer, or open-source contributor passionate about LLM safety — this is your playground. Dive in, run local tests, or even contribute your own custom vulnerabilities and attack types. Safety isn’t optional anymore — it’s a feature. And DeepTeam helps you build it in.

June 23, 2025

How to Install Nano-VLLM Locally?

Nano-vLLM is a stripped-down, no-fluff engine designed purely for blazing-fast offline inference with large language models. It’s lightweight (just ~1,200 lines of code) but packs a serious punch — featuring smart optimizations like prefix caching, tensor parallelism, CUDA graphs, and more. Whether you’re testing models locally or building a custom inference stack, Nano-vLLM gives you raw speed, full transparency, and zero dependency bloat. It mirrors the vLLM API for easy migration, while staying small enough to dive into and hack on. If you’re running models like Qwen3-0.6B on your own GPU or a cloud VM — this is your toolkit.

June 20, 2025

How to Install Virtuoso Large 72b Locally?

Virtuoso-Large (72B) is a powerful open-source language model built on the Qwen2.5-72B foundation. Designed to handle complex and domain-spanning tasks, this model stands out for its depth, adaptability, and clarity. Whether you’re summarizing reports, writing technical content, generating multilingual insights, or diving deep into diverse datasets—Virtuoso-Large delivers high precision and natural language understanding at scale. Released under the Apache-2.0 license, it’s freely available for both personal and commercial use, making it a solid choice for anyone looking to build real-world applications without restrictions.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.