Qwen2.5-VL-7B-Instruct is an advanced vision-language model designed to understand and process both visual and textual inputs with high accuracy. It excels at recognizing and analyzing objects, text, charts, icons, and layouts within images. The model can function as a visual assistant, interact with various tools, and even comprehend long videos, pinpointing key events effectively.
With improved visual localization, Qwen2.5-VL-7B-Instruct generates structured outputs for scanned documents, tables, and invoices, making it useful in fields like finance and commerce. Its optimized vision encoder ensures faster performance, while dynamic resolution and frame rate training enhance video comprehension. Designed with efficiency in mind, this model offers a powerful tool for tasks requiring detailed visual understanding and interaction.
Image benchmark
Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B | Qwen2.5-VL-7B |
---|
MMMUval | 56 | 50.4 | 60 | 54.1 | 58.6 |
MMMU-Proval | 34.3 | – | 37.6 | 30.5 | 41.0 |
DocVQAtest | 93 | 93 | – | 94.5 | 95.7 |
InfoVQAtest | 77.6 | – | – | 76.5 | 82.6 |
ChartQAtest | 84.8 | – | – | 83.0 | 87.3 |
TextVQAval | 79.1 | 80.1 | – | 84.3 | 84.9 |
OCRBench | 822 | 852 | 785 | 845 | 864 |
CC_OCR | 57.7 | | | 61.6 | 77.8 |
MMStar | 62.8 | | | 60.7 | 63.9 |
MMBench-V1.1-Entest | 79.4 | 78.0 | 76.0 | 80.7 | 82.6 |
MMT-Benchtest | – | – | – | 63.7 | 63.6 |
MMStar | 61.5 | 57.5 | 54.8 | 60.7 | 63.9 |
MMVetGPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | 67.1 |
HallBenchavg | 45.2 | 48.1 | 46.1 | 50.6 | 52.9 |
MathVistatestmini | 58.3 | 60.6 | 52.4 | 58.2 | 68.2 |
MathVision | – | – | – | 16.3 | 25.07 |
Video Benchmarks
Benchmark | Qwen2-VL-7B | Qwen2.5-VL-7B |
---|
MVBench | 67.0 | 69.6 |
PerceptionTesttest | 66.9 | 70.5 |
Video-MMEwo/w subs | 63.3/69.0 | 65.1/71.6 |
LVBench | | 45.3 |
LongVideoBench | | 54.7 |
MMBench-Video | 1.44 | 1.79 |
TempCompass | | 71.7 |
MLVU | | 70.2 |
CharadesSTA/mIoU | 43.6 |
Agent benchmark
Benchmarks | Qwen2.5-VL-7B |
---|
ScreenSpot | 84.7 |
ScreenSpot Pro | 29.0 |
AITZ_EM | 81.9 |
Android Control High_EM | 60.1 |
Android Control Low_EM | 93.7 |
AndroidWorld_SR | 25.5 |
MobileMiniWob++_SR | 91.4 |
Model Resource
Hugging Face
Link: https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
Prerequisites for Installing Qwen2.5-VL-7B-Instruct Model Locally
1. GPU Requirements
GPU Model | VRAM | Recommended Use Case |
---|
RTX 3090 | 24GB | Minimum for inference with quantization. |
RTX 4090 | 24GB | Ideal for text-image understanding tasks. |
RTX A6000 | 48GB | Smooth multimodal inference and generation. |
NVIDIA A100 (40GB) | 40GB | Optimized for vision-language understanding. |
NVIDIA A100 (80GB) | 80GB | Best for handling long-context vision tasks. |
NVIDIA H100 (80GB) | 80GB | High-throughput processing for video understanding. |
- Minimum: 24GB VRAM (with quantization)
- Recommended: 48GB+ VRAM for full performance (e.g., NVIDIA A6000 or A100)
- Optimal: 80GB VRAM for long-context, video-heavy tasks (e.g., NVIDIA H100)
For video processing, an 80GB A100/H100 GPU is highly recommended due to extended memory needs.
2. CPU Requirements
- Text-based tasks: 16-core CPU is sufficient.
- Multimodal (image/video) tasks: 32+ cores recommended for fast preprocessing.
Component | Minimum | Recommended |
---|
CPU Cores | 16 cores | 32+ cores |
Clock Speed | 2.5 GHz | 3.5+ GHz |
Processor Type | AMD EPYC / Intel Xeon | AMD Threadripper / Intel Xeon Platinum |
3. RAM Requirements
- Minimum: 32GB RAM for smooth operation with images.
- Recommended: 64GB RAM for vision-heavy tasks.
- Optimal: 128GB+ RAM for handling long videos.
Task Type | Minimum RAM | Recommended RAM |
---|
4. Disk Space & Storage
- Minimum: 50GB free space for model weights and temporary files.
- Recommended: 1TB NVMe SSD for faster model loading and caching.
- High-speed SSD storage is crucial for video-heavy tasks.
Component | Minimum | Recommended |
---|
Disk Space | 50GB SSD | 1TB NVMe SSD |
Disk Type | SATA SSD | NVMe SSD |
Summary: Recommended System Build
Component | Recommended Specification |
---|
GPU | NVIDIA A6000 (48GB) / A100 (80GB) / H100 (80GB) |
CPU | AMD EPYC 64-core / Intel Xeon 32-core |
RAM | 64GB (Image tasks) / 128GB (Video tasks) |
Storage | 1TB NVMe SSD |
Power | 850W+ PSU |
Cooling | Liquid cooling or high-performance air cooling |
Best Practices
- Use SSD Storage – Avoid HDDs for model loading and inference.
- Monitor GPU Usage – Run
nvidia-smi
to check VRAM consumption.
- Enable Flash Attention – For efficient memory usage in multi-image/video inference.
- Quantize Model – If using a lower VRAM GPU, use 8-bit or 4-bit quantization.
Step-by-Step Process to Install Qwen2.5-VL-7B-Instruct Model Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
Next, you will need to choose an image for your Virtual Machine. We will deploy Qwen2.5-VL-7B-Instruct Model on a Jupyter Virtual Machine. This open-source platform will allow you to install and run the Qwen2.5-VL-7B-Instruct Model on your GPU node. By running this Model on a Jupyter Notebook, we avoid using the terminal, simplifying the process and reducing the setup time. This allows you to configure the model in just a few steps and minutes.
Note: NodeShift provides multiple image template options, such as TensorFlow, PyTorch, NVIDIA CUDA, Deepo, Whisper ASR Webservice, and Jupyter Notebook. With these options, you don’t need to install additional libraries or packages to run Jupyter Notebook. You can start Jupyter Notebook in just a few simple clicks.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to Jupyter Notebook
Once your GPU VM deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ Button in the top right corner.
After clicking the ‘Connect’ button, you can view the Jupyter Notebook.
Now open Python 3(pykernel) Notebook.
Next, If you want to check the GPU details, run the command in the Jupyter Notebook cell:
!nvidia-smi
Step 8: Install PyTorch with GPU Support
Run the following command in Jupyter Notebook to install PyTorch with GPU support:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Step 9: Verify PyTorch Installation
Run the following command in Jupyter Notebook to verify PyTorch Installation:
import torch
print("PyTorch Version:", torch.__version__)
print("CUDA Available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU Name:", torch.cuda.get_device_name(0))
else:
print("No GPU detected.")
Step 10: Install Dependencies
Run the following command in Jupyter Notebook to install the dependencies:
!pip install git+https://github.com/huggingface/transformers accelerate
!pip install qwen-vl-utils[decord]==0.0.8
Step 11: Load the Model and Processor
Run the following code to load the model:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Load the model on the GPU
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype=torch.float16, device_map="auto"
)
# Load the processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
print("Model and Processor Loaded Successfully!")
Step 12: Run Inference on an Image
Now, let’s test the model by providing an image and asking it to describe the image.
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
Step 13: Process the Input and Generate Output
Now, prepare the input for the model:
# Convert input into the required format
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
# Tokenize and move to GPU
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Decode and print result
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("Generated Output:", output_text)
Example 1
Example 2
Step 14: Video Inference
If you want to run inference on a video, use:
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "file:///path/to/your/video.mp4"},
{"type": "text", "text": "Summarize this video."},
],
}
]
# Process video input
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
# Tokenize and move to GPU
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Decode and print result
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("Generated Output:", output_text)
Step 15: Improve Performance with Image Resolution Settings
You can adjust resolution for optimal performance:
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
Or, define exact image size:
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg", "resized_height": 280, "resized_width": 420},
{"type": "text", "text": "Describe this image."},
],
}
]
Step 16: Handling Long Texts
Qwen2.5-VL supports 32,768 tokens. If you need more, update config.json
:
{
"type": "yarn",
"mrope_section": [16, 24, 24],
"factor": 4,
"original_max_position_embeddings": 32768
}
Conclusion
In conclusion, Qwen2.5-VL-7B-Instruct is a powerful vision-language model designed for high-accuracy text and image processing, video comprehension, and structured data analysis. With its optimized vision encoder, advanced localization, and tool interaction capabilities, it excels in various multimodal tasks across different industries. Its efficient deployment on GPU-powered setups, along with detailed installation and usage instructions, makes it a reliable choice for developers and researchers looking to leverage vision-language capabilities for real-world applications.