How to Install InternVideo2.5 Locally

by Ayush Kumar | February 24, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

InternVideo2.5 is an advanced video multimodal large language model (MLLM) designed to process and understand long-form video content with high accuracy. Built on InternVL2.5, it enhances video perception by capturing fine-grained details and long-term temporal structures using direct preference optimization (TPO) and adaptive hierarchical token compression (HiCo). The model is optimized for tasks requiring rich contextual understanding, making it highly effective for video analysis, content generation, and interactive AI applications. With improved spatiotemporal reasoning and robust multimodal capabilities, InternVideo2.5 sets a new standard for AI-driven video comprehension.

Performance

Model	MVBench	LongVideoBench	VideoMME(w/o sub)
InternVideo2.5	75.7	60.6	65.1

Model Resource

Hugging Face

Link: https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B

1. GPU Requirements

Task	Minimum VRAM	Recommended VRAM	Optimal VRAM	Recommended GPU
Single Video Processing	24GB (Quantized)	48GB (Full Precision)	80GB+	RTX 4090 / A6000 / A100 40GB
Multi-Video Processing	48GB	80GB	96GB+ (Multi-GPU)	A100 80GB / H100
Gradio Deployment	48GB	80GB	96GB+ (Multi-GPU)	A100 80GB / H100
Training / Fine-tuning	80GB (Single GPU)	160GB+ (Multi-GPU)	320GB+ (TPU/Cluster)	2x A100 80GB / 4x H100

Minimum: 24GB VRAM with 8-bit quantization for running smaller video inputs.
Recommended: 48GB VRAM for smooth inference with FP16 precision.
Optimal: 80GB VRAM (A100 / H100) for processing high-resolution videos efficiently.
Multi-GPU Scaling: For batch processing of videos, use 2x A100 80GB or 4x H100 GPUs.

For handling videos with long sequences (128+ frames), 80GB VRAM is required.

2. CPU Requirements

Component	Minimum	Recommended	Optimal
CPU Cores	16 Cores	32 Cores	64 Cores
Clock Speed	2.5 GHz	3.5 GHz+	3.8 GHz+
Processor Type	Intel Xeon / AMD Ryzen 9	AMD EPYC / Intel Xeon Platinum	AMD Threadripper Pro

Minimum: 16 cores for single-video inference.
Recommended: 32 cores for multi-video processing.
Optimal: 64 cores for real-time video processing and Gradio UI deployment.

3. RAM Requirements

Task	Minimum RAM	Recommended RAM	Optimal RAM
Single Video Inference	32GB	64GB	128GB
Multi-Video Processing	64GB	128GB	256GB
Fine-Tuning	128GB	256GB+	512GB+

Minimum: 32GB RAM for single-video processing.
Recommended: 64GB RAM for batch processing and long-sequence video inputs.
Optimal: 128GB+ RAM for real-time multi-video processing and Gradio UI.
For large-scale video models, RAM-intensive processing is required. Use 128GB+ RAM for best performance.

4. Disk Space & Storage

Component	Minimum	Recommended	Optimal
Disk Space	100GB SSD	200GB SSD	1TB+ NVMe SSD
Disk Type	SATA SSD	NVMe SSD	PCIe 4.0 NVMe SSD

Minimum: 100GB SSD for model weights, cache, and dependencies.
Recommended: 500GB NVMe SSD for fast video frame loading and logs.
Optimal: 1TB+ SSD for storing large video datasets, logs, and multi-video outputs.
Use NVMe SSDs for faster loading and storage of video frame data.

5. Best Practices for Performance

Use SSD/NVMe for fast storage – Avoid HDDs.
Monitor GPU Usage – Run nvidia-smi to check VRAM.
Enable Flash Attention – Optimize memory and speed.
Quantization (8-bit) – For 24GB VRAM GPUs.
Scale Across Multiple GPUs – For large video inputs.

Step-by-Step Process to Install InternVideo2.5 Model Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy InternVideo2.5 Model on a Jupyter Virtual Machine. This open-source platform will allow you to install and run the InternVideo2.5 Model on your GPU node. By running this Model on a Jupyter Notebook, we avoid using the terminal, simplifying the process and reducing the setup time. This allows you to configure the model in just a few steps and minutes.

Note: NodeShift provides multiple image template options, such as TensorFlow, PyTorch, NVIDIA CUDA, Deepo, Whisper ASR Webservice, and Jupyter Notebook. With these options, you don’t need to install additional libraries or packages to run Jupyter Notebook. You can start Jupyter Notebook in just a few simple clicks.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to Jupyter Notebook

Once your GPU VM deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ Button in the top right corner.

After clicking the ‘Connect’ button, you can view the Jupyter Notebook.

Now open Python 3(pykernel) Notebook.

Next, If you want to check the GPU details, run the command in the Jupyter Notebook cell:

!nvidia-smi

Step 8: Install PyTorch with GPU Support

Run the following command in Jupyter Notebook to install PyTorch with GPU support:

!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 9: Verify PyTorch Installation

Run the following command in Jupyter Notebook to verify PyTorch Installation:

import torch
print("PyTorch Version:", torch.__version__)
print("CUDA Available:", torch.cuda.is_available())

if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))
else:
    print("No GPU detected.")

Step 10: Install Required Dependencies

Run the following command in a Jupyter Notebook cell to install all necessary dependencies:

!pip install transformers==4.40.1 av imageio decord opencv-python flash-attn --no-build-isolation
!pip install torch torchvision safetensors

This installs:

torch → For deep learning computations.
transformers → To load and run InternVideo2.5-Chat-8B.
flash-attn → For optimized attention mechanisms.
av, imageio, decord, opencv-python → For video processing.

Step 11: Import Required Libraries

After installation, import the necessary libraries:

import torch
import numpy as np
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

Step 12: Load the Model and Tokenizer

Set the model path and load the tokenizer and model onto the GPU:

# Set model path
model_path = "OpenGVLab/InternVideo2_5_Chat_8B"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Load model with optimized settings for GPU
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda().to(torch.bfloat16)

print("Model and tokenizer loaded successfully!")

Step 13: Define Image Processing Functions

InternVideo2.5 requires preprocessing of video frames before passing them to the model. Define the required functions:

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def load_video(video_path, input_size=448, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    transform = build_transform(input_size=input_size)
    
    frame_indices = np.linspace(0, max_frame - 1, num_segments, dtype=int)
    pixel_values = [transform(Image.fromarray(vr[idx].asnumpy()).convert("RGB")) for idx in frame_indices]
    pixel_values = torch.stack(pixel_values)

    return pixel_values.to(torch.bfloat16).to(model.device), len(pixel_values)

Step 14: Run Inference on a Video

Now, we load a video and generate text-based insights:

# Define video path (Replace with your video file)
video_path = "your_video.mp4"

# Load video and preprocess
pixel_values, num_patches = load_video(video_path, num_segments=128)

# Generate video caption
question = "Describe this video in detail."

with torch.no_grad():
    video_prefix = "".join([f"Frame{i+1}: <image>\n" for i in range(num_patches)])
    question_prompt = video_prefix + question
    
    output, chat_history = model.chat(
        tokenizer, 
        pixel_values, 
        question_prompt, 
        dict(do_sample=False, temperature=0.0, max_new_tokens=1024, top_p=0.1, num_beams=1),
        num_patches_list=[num_patches], 
        history=None, 
        return_history=True
    )

print("\nGenerated Description:\n", output)

Step 15: Optimize for Memory (Optional)

If you’re running on limited VRAM, enable quantization using bitsandbytes:

!pip install bitsandbytes

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModel.from_pretrained(
    model_path, 
    trust_remote_code=True, 
    quantization_config=quantization_config
).cuda()

Step 16: Save Model Outputs (Optional)

To store generated text responses in a file:

with open("internvideo_output.txt", "w") as file:
    file.write(output)

Now, you can generate video descriptions, ask follow-up questions, or fine-tune the model for specific video-based applications.

Conclusion

Setting up InternVideo2.5-Chat-8B locally enables efficient video analysis and multimodal reasoning with high accuracy. By following this step-by-step guide, you can deploy the model on a GPU-powered virtual machine using NodeShift Cloud, install the necessary dependencies, and run video-to-text processing seamlessly in Jupyter Notebook. With optimized hardware requirements, performance tuning, and multi-GPU scaling, this model is well-suited for detailed video understanding, real-time processing, and AI-driven insights. Whether you’re working on video analytics, content generation, or interactive applications, InternVideo2.5 offers cutting-edge performance and flexibility for your AI-powered video projects.

Relevant blog posts

June 11, 2025

How to Install Mistral Magistral Locally?

Magistral-Small-2506 is the latest evolution in Mistral AI’s line of efficient reasoning models, fine-tuned from the base Mistral-Small-3.1-2503. While it retains its compact size and agility, this model steps into deeper waters—bringing long-form reasoning, step-by-step deduction, and multilingual support into a package that runs comfortably on consumer-grade GPUs. What makes Magistral stand out is its clarity of thought. The model doesn’t just answer questions—it takes time to think. It writes out its reasoning process like a person solving a math problem on paper, making it especially useful for logic, science, code, and educational tasks. Whether you’re building a chatbot that needs to explain itself or a backend service for research-style outputs, Magistral-Small brings structure and depth with minimal overhead.

June 10, 2025

How to Install NVIDIA Nemotron-Research-Reasoning-Qwen-1.5B Locally?

Nemotron-Research-Reasoning-Qwen-1.5B is a compact powerhouse built for solving complex reasoning tasks across math, code, science, and logic. Developed by NVIDIA, it’s designed to think through problems the way a sharp student would—step by step, carefully, and with clarity. From tackling Olympiad-style math puzzles to debugging code and breaking down scientific explanations, this model punches far above its size. It’s the result of deep training across diverse and challenging topics, making it ideal for research, development, and anyone curious about how far small models can go when taught to think smart—not just big. The leading generalist reasoning model for cutting-edge research and development .

June 9, 2025

How to Install WebThinker-QwQ-32B Locally?

WebThinker-QwQ-32B is a large-scale reasoning model designed to mimic human research processes. With 32 billion parameters, it autonomously navigates the web, clicking links and interacting with pages to gather information. It can draft research reports while exploring, integrating real-time knowledge acquisition with writing. Trained using reinforcement learning techniques, it optimizes its performance through iterative feedback loops, making it ideal for complex problem-solving and open-ended tasks requiring external research.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.