Mastering DeepSeek: Installing Tiny, Small, and VL2 Models with Inference and a Gradio Interface

by Ayush Kumar | February 6, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

DeepSeek-VL2 is a powerful vision-language model designed to handle a wide range of visual and text-based tasks, including visual question answering, optical character recognition, document analysis, and object localization. It builds on a Mixture-of-Experts (MoE) architecture, offering efficient processing and improved accuracy.

The model series includes three versions—DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2—with varying numbers of activated parameters to suit different use cases. DeepSeek-VL2 is optimized for accuracy while maintaining efficiency, making it a strong choice for complex multimodal tasks. It supports commercial use and is available under the MIT License.

Resource

HuggingFace

Link: https://huggingface.co/deepseek-ai/deepseek-vl2

GitHub

Link: https://github.com/deepseek-ai/DeepSeek-VL2

1. GPU Requirements

Model Variant	VRAM Requirement (Inference)	VRAM Requirement (Gradio)	Recommended GPU
DeepSeek-VL2-Tiny (1.0B params)	16GB (8-bit quantization)	24GB	RTX 3090 / 4090 / A5000
DeepSeek-VL2-Small (2.8B params)	40GB (Incremental Prefilling)	48GB+	A100 40GB / A6000
DeepSeek-VL2 (4.5B params)	80GB (Full Performance)	80GB+	RTX A 6000 /A100 80GB / H100

Minimum: 16GB VRAM (for Tiny variant with quantization).
Recommended: 48GB VRAM for smooth execution of DeepSeek-VL2-Small.
Optimal: 80GB VRAM for full performance of DeepSeek-VL2 and high-resolution Gradio demos.
GPU Type: NVIDIA GPUs with Tensor Cores (e.g., RTX 4090, A6000, A100, H100).

For multiple image inference or Gradio-based interactive UI, 48GB+ VRAM is recommended.

2. CPU Requirements

Component	Minimum	Recommended
CPU Cores	16 cores	32+ cores
Clock Speed	2.5 GHz	3.5 GHz+
Processor Type	AMD EPYC / Intel Xeon	AMD Threadripper / Intel Xeon Platinum

Multimodal tasks require efficient CPU preprocessing, especially when handling images, charts, and documents.

3. RAM Requirements

Task Type	Minimum RAM	Recommended RAM
Text-only tasks	16GB	32GB
Text + Image	32GB	64GB
Text + Multiple Images / Gradio UI	64GB	128GB

Minimum: 32GB RAM (for text and single-image processing).
Recommended: 64GB+ RAM (for multiple images and longer context window).
Optimal: 128GB RAM (for Gradio UI with multi-image or complex visual grounding tasks).

4. Disk Space & Storage

Component	Minimum	Recommended
Disk Space	50GB SSD	200GB NVMe SSD
Disk Type	SATA SSD	NVMe SSD

Minimum: 50GB free storage for model weights and inference scripts.
Recommended: 200GB SSD for storing datasets, checkpoints, logs, and texture assets.

Use NVMe SSD to reduce model load time.

5. Network Requirements

Component	Minimum	Recommended
Internet Speed	100 Mbps	1 Gbps+
Cloud VM	Any GPU VM	Cloud GPUs (A100/H100)

If running DeepSeek-VL2 on a cloud VM, ensure high-speed networking (1 Gbps) for fast model downloads and dataset handling.

6. Optimizations for Gradio UI

Reduce batch size for image processing to optimize VRAM.
Use mixed precision (bfloat16) for faster performance.
Enable memory-efficient attention (flash_attention for scaling).
Deploy on multiple GPUs for better parallelism.

7. Best Practices for Performance

Use SSD/NVMe for storage – Avoid HDDs for model loading.
Monitor GPU Usage – Run nvidia-smi to check VRAM usage.
Enable Flash Attention – For efficient memory handling.
Use Incremental Prefilling – Reduces GPU memory usage.
Multi-GPU Scaling – Ideal for parallel image processing.

8. Summary: Recommended System Build

Component	Recommended Specification
GPU	NVIDIA A6000 (48GB) / A100 (80GB) / H100 (80GB)
CPU	AMD EPYC 64-core / Intel Xeon 32-core
RAM	64GB (Image tasks) / 128GB (Gradio UI)
Storage	200GB NVMe SSD
Network	1 Gbps Cloud VM (for cloud hosting)

Step-by-Step Process to Install DeepSeek VL2 Small – MoE Vision Model Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy DeepSeek VL2 Small – MoE Vision on an NVIDIA Cuda Virtual Machine. This proprietary, closed-source parallel computing platform will allow you to install DeepSeek VL2 Small – MoE Vision on your GPU Node.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, if you want to check the GPU details, run the command below:\

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-distutils python3.11-venv

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

python3 -m ensurepip --upgrade
python3 -m pip install --upgrade pip

Then, run the following command to check the version of pip:

pip --version

Step 12: Clone the Repository

Run the following command to clone the Deepseek-vl2 repository:

git clone https://github.com/deepseek-ai/deepseek-vl2.git
cd deepseek-vl2

Step 13: Setup Environment

Run the following command to setup the environment:

python -m venv deepseek_env
source deepseek_env/bin/activate    
# On Windows: deepseek_env\Scripts\activate

Step 14: Install Dependencies

Run the following command to install the dependencies:

pip install -e .

Step 15: Install Gradio

Run the following command to install the Gradio:

pip install gradio==3.48.0

Step 16: Check Model and Commands

The repository provides example commands to run the web demo using different model variants. Note that you should set the CUDA_VISIBLE_DEVICES environment variable to the GPU you wish to use (in this example, GPU 2 is used) and specify the appropriate model name, port, and (if needed) the --chunk_size parameter.

1. For the VL2-Tiny Model

Model Details:
- Total parameters: 3.37B MoE
- Activated parameters: 1B
- Suitable for a single GPU with less than 40GB memory

Command:

CUDA_VISIBLE_DEVICES=2 python web_demo.py \
--model_name "deepseek-ai/deepseek-vl2-tiny"  \
--port 37914

2. For the VL2-Small Model

Model Details:
- Total parameters: 16.1B MoE
- Activated parameters: 2.4B
Memory Note:
- When running on an A100 40GB GPU, you should set --chunk_size 512 to save memory via incremental prefilling (at the expense of speed).
- On GPUs with more than 40GB, you can omit the --chunk_size 512 for a faster response.
Command (for a 40GB GPU):

CUDA_VISIBLE_DEVICES=2 python web_demo.py \
--model_name "deepseek-ai/deepseek-vl2-small"  \
--port 37914 \
--chunk_size 512

3. For the VL2 (Full) Model

Model Details:
- Total parameters: 27.5B MoE
- Activated parameters: 4.2B
Command:

CUDA_VISIBLE_DEVICES=2 python web_demo.py \
--model_name "deepseek-ai/deepseek-vl2"  \
--port 37914

How to Use These Commands

Set the GPU:
The CUDA_VISIBLE_DEVICES=2 part tells the system to use GPU number 2. Adjust this value according to your system’s GPU configuration.
Run the Demo Script:
The python web_demo.py command launches the Gradio-based web demo.
Specify the Model Variant:
Use the --model_name parameter to choose between the different model variants:
- "deepseek-ai/deepseek-vl2-tiny"
- "deepseek-ai/deepseek-vl2-small"
- "deepseek-ai/deepseek-vl2"
Set the Port:
The --port 37914 argument sets the port on which the web server will run. Open your browser and navigate to http://<your_server_ip>:37914 to access the demo.
Optional Memory Tuning:
For the small model on a GPU with 40GB memory, the additional --chunk_size 512 argument is recommended for memory-saving incremental pre-filling.

Step 17: Verify Your GPU Availability

Run the following command in your terminal to see if your GPU is recognized by the system:

nvidia-smi

Step 18: Run Deepseek-vl2-tiny Model

Execute the following command to run the deepseek-vl2-tiny model:

python3 web_demo.py --model_name "deepseek-ai/deepseek-vl2-tiny" --port 37914

Step 19: Access the Application

Accessing the application at:

Running on local URL: http://0.0.0.0:37914
Running on public URL: https://8df6de5304350b2ecc.gradio.live

Step 20: Play with Deepseek-vl2-tiny Model

Step 21: Run Deepseek-vl2-small Model

Execute the following command to run the deepseek-vl2-small model:

CUDA_VISIBLE_DEVICES=0 python3 web_demo.py --model_name "deepseek-ai/deepseek-vl2-small" --port 37914 --chunk_size 512

Step 22: Access the Application

Accessing the application at:

Running on local URL: http://0.0.0.0:37914
Running on public URL: https://8df6de5304350b2ecc.gradio.live

Step 23: Play with Deepseek-vl2-small Model

Step 24: Run Deepseek-vl2 Model

Execute the following command to run the deepseek-vl2 model:

CUDA_VISIBLE_DEVICES=0 python web_demo.py --model_name "deepseek-ai/deepseek-vl2" --port 37914

Step 25: Access the Application

Accessing the application at:

Running on local URL: http://0.0.0.0:37914
Running on public URL: https://8df6de5304350b2ecc.gradio.live

Step 26: Play with Deepseek-vl2 Model

For Inference Only: DeepSeek-VL2-Tiny can run on a 16GB GPU with quantization, but the full model requires 80GB VRAM.
For Gradio Deployment: At least 48GB VRAM is required for multi-image handling, and 80GB VRAM is ideal for full-scale applications.
Optimization Strategies:
- Chunked Inference (for 40GB GPUs).
- Flash Attention (for efficient multi-image processing).
- Quantization (for limited VRAM GPUs).

Deploy DeepSeek-VL2 on the right hardware for best performance! 🚀

Note: This is a step-by-step guide for interacting with your models. It covers the first method for installing Tiny, Small, and VL2 models using the Gradio interface. If you want to run these models with inference, please follow the steps below:

Step 1: Running a Simple Inference Example

Use the provided sample code to test the model. Create a Python script (for example, inference_example.py) with the following content:

import torch
from transformers import AutoModelForCausalLM

from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl.utils.io import load_pil_images

# Specify the model path – you can choose among the variants (tiny, small, or full)
model_path = "deepseek-ai/deepseek-vl2-small"

# Load the processor (includes the tokenizer)
vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

# Load the model; adjust torch precision and device as needed
vl_gpt = DeepseekVLV2ForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

## Single image conversation example:
conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
        "images": ["./images/visual_grounding.jpeg"],
    },
    {"role": "<|Assistant|>", "content": ""}
]

# Load images and prepare inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# Generate image embeddings using the model’s image encoder
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# Generate the response from the language model
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)

# Decode and print the answer
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"Assistant: {answer}")

Step 2: Run the Inference Example:

In your terminal (with the virtual environment activated), run:

python inference_example.py

If everything is set up correctly, the model will process the sample conversation and output the generated answer.

Conclusion

In this guide, we explored DeepSeek-VL2, a powerful vision-language model designed for advanced multimodal understanding. We provided a detailed step-by-step tutorial on setting up DeepSeek-VL2 on a GPU-powered virtual machine using NodeShift, covering hardware requirements, installation steps, and optimization strategies. Additionally, we demonstrated how to deploy and interact with the model using the Gradio UI and simple inference scripts. By following this guide, you’ve learned how to install dependencies, configure your environment, and run DeepSeek-VL2 efficiently. Whether for document analysis, visual question answering, or multi-image tasks, DeepSeek-VL2 offers a robust solution for complex vision-language applications.

Relevant blog posts

July 18, 2025

How to Install LiquidAI LFM2-1.2B Locally?

The LFM2-1.2B is a next-generation hybrid model developed by Liquid AI, designed specifically for edge AI and on-device deployment. With ~1.2 billion parameters, this model stands out for its speed, memory efficiency, and quality, making it ideal for lightweight applications like agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. Model details Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills.

July 16, 2025

How to Install Mistral Voxtral Locally?

Both Voxtral Mini and Voxtral Small are built on top of solid text processing backbones, but they go several steps further by adding state-of-the-art audio input abilities. You can feed them audio clips of up to 30–40 minutes, and they’ll handle it with impressive detail, whether that’s simple transcription or deeper understanding tasks like Q&A or generating summaries.

July 12, 2025

Building an AI-Powered Chest X-ray Analyzer with MedGemma 27B and Gradio

MedGemma 27B is a cutting-edge medical language and vision model developed by Google, designed to understand both medical text and images. Built as part of the Gemma 3 family, MedGemma comes in two flavors: a multimodal variant that handles both text and images, and a text-only variant focused purely on medical language tasks. It has been trained using a wide range of de-identified medical data — including chest X-rays, dermatology photos, ophthalmology images, and radiology reports — and shows strong performance in medical reasoning, report generation, and visual question answering. While it offers an exciting baseline, MedGemma is meant as a starting point for developers to fine-tune or adapt into healthcare research projects, not as a plug-and-play clinical tool.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.