InternVideo2.5 is an advanced video multimodal large language model (MLLM) designed to process and understand long-form video content with high accuracy. Built on InternVL2.5, it enhances video perception by capturing fine-grained details and long-term temporal structures using direct preference optimization (TPO) and adaptive hierarchical token compression (HiCo). The model is optimized for tasks requiring rich contextual understanding, making it highly effective for video analysis, content generation, and interactive AI applications. With improved spatiotemporal reasoning and robust multimodal capabilities, InternVideo2.5 sets a new standard for AI-driven video comprehension.
Performance
Model | MVBench | LongVideoBench | VideoMME(w/o sub) |
---|
InternVideo2.5 | 75.7 | 60.6 | 65.1 |
Model Resource
Hugging Face
Link: https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B
1. GPU Requirements
Task | Minimum VRAM | Recommended VRAM | Optimal VRAM | Recommended GPU |
---|
Single Video Processing | 24GB (Quantized) | 48GB (Full Precision) | 80GB+ | RTX 4090 / A6000 / A100 40GB |
Multi-Video Processing | 48GB | 80GB | 96GB+ (Multi-GPU) | A100 80GB / H100 |
Gradio Deployment | 48GB | 80GB | 96GB+ (Multi-GPU) | A100 80GB / H100 |
Training / Fine-tuning | 80GB (Single GPU) | 160GB+ (Multi-GPU) | 320GB+ (TPU/Cluster) | 2x A100 80GB / 4x H100 |
- Minimum: 24GB VRAM with 8-bit quantization for running smaller video inputs.
- Recommended: 48GB VRAM for smooth inference with FP16 precision.
- Optimal: 80GB VRAM (A100 / H100) for processing high-resolution videos efficiently.
- Multi-GPU Scaling: For batch processing of videos, use 2x A100 80GB or 4x H100 GPUs.
For handling videos with long sequences (128+ frames), 80GB VRAM is required.
2. CPU Requirements
Component | Minimum | Recommended | Optimal |
---|
CPU Cores | 16 Cores | 32 Cores | 64 Cores |
Clock Speed | 2.5 GHz | 3.5 GHz+ | 3.8 GHz+ |
Processor Type | Intel Xeon / AMD Ryzen 9 | AMD EPYC / Intel Xeon Platinum | AMD Threadripper Pro |
- Minimum: 16 cores for single-video inference.
- Recommended: 32 cores for multi-video processing.
- Optimal: 64 cores for real-time video processing and Gradio UI deployment.
3. RAM Requirements
Task | Minimum RAM | Recommended RAM | Optimal RAM |
---|
Single Video Inference | 32GB | 64GB | 128GB |
Multi-Video Processing | 64GB | 128GB | 256GB |
Fine-Tuning | 128GB | 256GB+ | 512GB+ |
- Minimum: 32GB RAM for single-video processing.
- Recommended: 64GB RAM for batch processing and long-sequence video inputs.
- Optimal: 128GB+ RAM for real-time multi-video processing and Gradio UI.
- For large-scale video models, RAM-intensive processing is required. Use 128GB+ RAM for best performance.
4. Disk Space & Storage
Component | Minimum | Recommended | Optimal |
---|
Disk Space | 100GB SSD | 200GB SSD | 1TB+ NVMe SSD |
Disk Type | SATA SSD | NVMe SSD | PCIe 4.0 NVMe SSD |
- Minimum: 100GB SSD for model weights, cache, and dependencies.
- Recommended: 500GB NVMe SSD for fast video frame loading and logs.
- Optimal: 1TB+ SSD for storing large video datasets, logs, and multi-video outputs.
- Use NVMe SSDs for faster loading and storage of video frame data.
5. Best Practices for Performance
- Use SSD/NVMe for fast storage – Avoid HDDs.
- Monitor GPU Usage – Run
nvidia-smi
to check VRAM.
- Enable Flash Attention – Optimize memory and speed.
- Quantization (8-bit) – For 24GB VRAM GPUs.
- Scale Across Multiple GPUs – For large video inputs.
Step-by-Step Process to Install InternVideo2.5 Model Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
Next, you will need to choose an image for your Virtual Machine. We will deploy InternVideo2.5 Model on a Jupyter Virtual Machine. This open-source platform will allow you to install and run the InternVideo2.5 Model on your GPU node. By running this Model on a Jupyter Notebook, we avoid using the terminal, simplifying the process and reducing the setup time. This allows you to configure the model in just a few steps and minutes.
Note: NodeShift provides multiple image template options, such as TensorFlow, PyTorch, NVIDIA CUDA, Deepo, Whisper ASR Webservice, and Jupyter Notebook. With these options, you don’t need to install additional libraries or packages to run Jupyter Notebook. You can start Jupyter Notebook in just a few simple clicks.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to Jupyter Notebook
Once your GPU VM deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ Button in the top right corner.
After clicking the ‘Connect’ button, you can view the Jupyter Notebook.
Now open Python 3(pykernel) Notebook.
Next, If you want to check the GPU details, run the command in the Jupyter Notebook cell:
!nvidia-smi
Step 8: Install PyTorch with GPU Support
Run the following command in Jupyter Notebook to install PyTorch with GPU support:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Step 9: Verify PyTorch Installation
Run the following command in Jupyter Notebook to verify PyTorch Installation:
import torch
print("PyTorch Version:", torch.__version__)
print("CUDA Available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU Name:", torch.cuda.get_device_name(0))
else:
print("No GPU detected.")
Step 10: Install Required Dependencies
Run the following command in a Jupyter Notebook cell to install all necessary dependencies:
!pip install transformers==4.40.1 av imageio decord opencv-python flash-attn --no-build-isolation
!pip install torch torchvision safetensors
This installs:
torch
→ For deep learning computations.
transformers
→ To load and run InternVideo2.5-Chat-8B.
flash-attn
→ For optimized attention mechanisms.
av, imageio, decord, opencv-python
→ For video processing.
Step 11: Import Required Libraries
After installation, import the necessary libraries:
import torch
import numpy as np
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
Step 12: Load the Model and Tokenizer
Set the model path and load the tokenizer and model onto the GPU:
# Set model path
model_path = "OpenGVLab/InternVideo2_5_Chat_8B"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Load model with optimized settings for GPU
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda().to(torch.bfloat16)
print("Model and tokenizer loaded successfully!")
Step 13: Define Image Processing Functions
InternVideo2.5 requires preprocessing of video frames before passing them to the model. Define the required functions:
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def load_video(video_path, input_size=448, num_segments=32):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
max_frame = len(vr) - 1
fps = float(vr.get_avg_fps())
transform = build_transform(input_size=input_size)
frame_indices = np.linspace(0, max_frame - 1, num_segments, dtype=int)
pixel_values = [transform(Image.fromarray(vr[idx].asnumpy()).convert("RGB")) for idx in frame_indices]
pixel_values = torch.stack(pixel_values)
return pixel_values.to(torch.bfloat16).to(model.device), len(pixel_values)
Step 14: Run Inference on a Video
Now, we load a video and generate text-based insights:
# Define video path (Replace with your video file)
video_path = "your_video.mp4"
# Load video and preprocess
pixel_values, num_patches = load_video(video_path, num_segments=128)
# Generate video caption
question = "Describe this video in detail."
with torch.no_grad():
video_prefix = "".join([f"Frame{i+1}: <image>\n" for i in range(num_patches)])
question_prompt = video_prefix + question
output, chat_history = model.chat(
tokenizer,
pixel_values,
question_prompt,
dict(do_sample=False, temperature=0.0, max_new_tokens=1024, top_p=0.1, num_beams=1),
num_patches_list=[num_patches],
history=None,
return_history=True
)
print("\nGenerated Description:\n", output)
Step 15: Optimize for Memory (Optional)
If you’re running on limited VRAM, enable quantization using bitsandbytes
:
!pip install bitsandbytes
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
quantization_config=quantization_config
).cuda()
Step 16: Save Model Outputs (Optional)
To store generated text responses in a file:
with open("internvideo_output.txt", "w") as file:
file.write(output)
Now, you can generate video descriptions, ask follow-up questions, or fine-tune the model for specific video-based applications.
Conclusion
Setting up InternVideo2.5-Chat-8B locally enables efficient video analysis and multimodal reasoning with high accuracy. By following this step-by-step guide, you can deploy the model on a GPU-powered virtual machine using NodeShift Cloud, install the necessary dependencies, and run video-to-text processing seamlessly in Jupyter Notebook. With optimized hardware requirements, performance tuning, and multi-GPU scaling, this model is well-suited for detailed video understanding, real-time processing, and AI-driven insights. Whether you’re working on video analytics, content generation, or interactive applications, InternVideo2.5 offers cutting-edge performance and flexibility for your AI-powered video projects.