ERNIE-4.5-VL-28B-A3B is a large-scale vision-language model crafted to understand and reason across both text and images. With 28 billion total parameters and 3 billion activated per token, it combines high efficiency with strong multimodal capabilities.
What sets it apart is its thoughtful mixture-of-experts design. By routing inputs through specialized pathways for text and vision, the model delivers accurate, context-aware responses — whether you’re analyzing an image, generating descriptions, or solving reasoning tasks that require both visual and textual understanding.
Optimized during post-training using techniques like RLVR (Reinforcement Learning with Verifiable Rewards), this model offers two modes: thinking and non-thinking. You can control how deeply the model reasons based on the task — from lightweight visual description to detailed interpretation. It runs best on high-end GPUs and is deployable via FastDeploy or Jupyter environments.
Model Overview
ERNIE-4.5-VL-28B-A3B is a multimodal MoE Chat model, with 28B total parameters and 3B activated parameters for each token. The following are the model configuration details:
Key | Value |
---|
Modality | Text & Vision |
Training Stage | Posttraining |
Params(Total / Activated) | 28B / 3B |
Layers | 28 |
Heads(Q/KV) | 20 / 4 |
Text Experts(Total / Activated) | 64 / 6 |
Vision Experts(Total / Activated) | 64 / 6 |
Shared Experts | 2 |
Context Length | 131072 |
Recommended GPU Configuration Table for ERNIE-4.5-VL-28B-A3B
GPU Model | GPU Memory (GB) | vCPUs | RAM (GB) | Precision | Use Case |
---|
H100 80GB | 80 | 128 | 256 | FP8 / INT4 | Full-scale multimodal reasoning |
A100 80GB | 80 | 96 | 192 | BF16 / FP16 | Image-text generation + reasoning |
A100 40GB x2 | 80 (total) | 96 | 192 | FP16 | Efficient 2-GPU multimodal workloads |
RTX A6000 | 48 | 48 | 96 | FP16 | Lightweight vision-language inference |
H100 x4 | 320 (total) | 128 | 512 | INT4 / Quant | High-speed batch inference and scaling |
Resources
Link: https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT
Step-by-Step Process to Install ERNIE-4.5-VL-28B-A3B-PT Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 2 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running ERNIE-4.5-VL-28B-A3B-PT, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like ERNIE-4.5-VL-28B-A3B-PT
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like ERNIE-4.5-VL-28B-A3B-PT.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the ERNIE-4.5-VL-28B-A3B-PT runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Create and Activate a Virtual Environment using Python 3.11
Run the following command to create and activate a virtual environment using Python 3.11:
python3.11 -m venv ernie_env
source ernie_env/bin/activate
Step 12: Install Required Libraries
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate safetensors
pip install decord sentencepiece
pip install moviepy
If you’re using bfloat16
, ensure your GPU supports it (A100, H100, etc).
Step 13: Connect to your GPU VM using Remote SSH
- Open VS Code on your Mac.
- Press
Cmd + Shift + P
, then choose Remote-SSH: Connect to Host
.
- Select your configured host.
- Once connected, you’ll see
SSH: 38.29.145.28
(Your VM IP) in the bottom-left status bar (like in the image).
Step 14: Download & Run the ERNIE-4.5-VL-28B-A3B-PT Model Script
- Inside VS Code (connected via SSH), make sure your script file (e.g.,
app.py
) is saved.
- Paste the following code into your
app.py
file inside VS Code:
import torch
import time
from transformers import AutoProcessor, AutoModelForCausalLM
# Step 1: Load model
model_path = "baidu/ERNIE-4.5-VL-28B-A3B-PT"
print("📦 Loading model...")
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
# Step 2: Load processor
print("📦 Loading processor...")
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
processor.eval()
model.add_image_preprocess(processor)
# Step 3: Use image URL (recommended method for ERNIE)
image_url = "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"
prompt = "What do you see in this image? Describe it in detail."
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": image_url}},
]
}
]
# Step 4: Preprocess with chat template and processor
print("🧪 Preprocessing inputs...")
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt"
)
# Step 5: Inference
device = next(model.parameters()).device
inputs = inputs.to(device)
print("⚙️ Generating response... (please wait 30–60 sec)")
start = time.time()
generated_ids = model.generate(
inputs=inputs['input_ids'],
**inputs,
max_new_tokens=128
)
output_text = processor.decode(generated_ids[0])
end = time.time()
# Step 6: Output
print(f"✅ Done! Inference took {end - start:.2f} seconds\n")
print("🧠 Model Response:\n", output_text)
Then, Run it with:
python app.py
Once your environment is set up, simply run the model with python3 app.py
. The script begins by loading the ERNIE-4.5-VL model and processor, then fetches the image (either locally or via URL), preprocesses it, and performs inference. You’ll notice messages like “Loading checkpoint shards,” “Preprocessing inputs,” and finally, “Generating response…” which may take up to a few minutes depending on your GPU. Once complete, you’ll see the model’s detailed response to your visual prompt. In this case, it successfully identified the scene as a “collage of a woman sitting on a mountain top… looking at the top of flowers and mountains” — showing that the ERNIE model can parse and reason over complex visual inputs accurately.
Conclusion
ERNIE-4.5-VL-28B-A3B isn’t just another vision-language model — it’s a powerhouse designed to reason, interpret, and understand the world through both text and imagery. From its cutting-edge MoE architecture to its impressive context length and deep multimodal capabilities, ERNIE 4.5 delivers performance that genuinely pushes the boundaries of what’s possible in open-source multimodal AI.
Whether you’re building smarter applications, testing image-text prompts, or exploring new research directions, deploying ERNIE-4.5-VL on a GPU Virtual Machine gives you complete flexibility and raw inference power at your fingertips. And with NodeShift Cloud, getting started is refreshingly simple, fast, and affordable.
If you’ve followed this guide, you now have a fully working ERNIE-4.5-VL setup — capable of real visual reasoning. And this is just the beginning. Go ahead, feed it more complex images, challenge its thinking mode, and see what insights it uncovers.
Stay tuned — more guides, experiments, and hands-on projects with multimodal giants like ERNIE are on the way!