How to Install NVIDIA Nemotron-Research-Reasoning-Qwen-1.5B Locally?

by Ayush Kumar | June 10, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Nemotron-Research-Reasoning-Qwen-1.5B is a compact powerhouse built for solving complex reasoning tasks across math, code, science, and logic. Developed by NVIDIA, it’s designed to think through problems the way a sharp student would—step by step, carefully, and with clarity.

From tackling Olympiad-style math puzzles to debugging code and breaking down scientific explanations, this model punches far above its size. It’s the result of deep training across diverse and challenging topics, making it ideal for research, development, and anyone curious about how far small models can go when taught to think smart—not just big.

The leading generalist reasoning model for cutting-edge research and development .

Evaluation Results

Table 1: Performance (pass@1) comparison for benchmarks across Math domain.

Model	AIME24	AIME25	AMC	Math	Minerva	Olympiad	Avg
DeepSeek-R1-Distill-Qwen-1.5B	28.54	22.71	62.58	82.90	26.38	43.58	44.45
DeepScaleR-1.5B	40.21	31.46	73.04	89.36	41.57	51.63	54.54
DeepSeek-R1-Distill-Qwen-7B	53.54	40.83	82.83	93.68	50.60	57.66	63.19
Nemotron-Research-Reasoning-Qwen-1.5B	48.13	33.33	79.29	91.89	47.98	60.22	60.14

Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbreviate benchmarks names for codecontests (cc), codeforces (cf), humanevalplus (human), and livecodebench (LCB).

Model	apps	cc	cf	taco	human	LCB	Avg
DeepSeek-R1-Distill-Qwen-1.5B	20.95	16.79	14.13	8.03	61.77	16.80	23.08
DeepCoder-1.5B	30.37	23.76	21.70	13.76	73.40	22.76	30.96
DeepSeek-R1-Distill-Qwen-7B	42.08	32.76	33.08	19.08	83.32	38.04	41.39
Nemotron-Research-Reasoning-Qwen-1.5B	41.99	31.80	34.50	20.81	72.05	23.81	37.49

Table 3: Performance comparison on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym) tasks. We also present results on OOD tasks: acre, boxnet, and game_of_life_halting (game).

Model	GPQA	IFEval	Reasoning	acre	boxnet	game
DeepSeek-R1-Distill-Qwen-1.5B	15.86	44.05	4.24	5.99	0.00	3.49
DeepSeek-R1-Distill-Qwen-7B	35.44	58.01	28.55	20.21	1.71	12.94
Nemotron-Research-Reasoning-Qwen-1.5B	41.78	66.02	59.06	58.57	7.91	52.29

Nemotron-Research-Reasoning-Qwen-1.5B — GPU Configuration Table

GPU Model	vCPUs	RAM (GB)	VRAM (GB)	Precision	Use Case	Recommended For
T4	4	16	16	8-bit / BF16	Basic inference, dev testing	✅ Minimum viable setup
RTX A4000	6	24	16	8-bit / BF16	Fast single-user inference	✅ Budget-friendly, good response time
RTX A5000	8	32	24	BF16 / FP16	Low-latency inference	✅ Ideal for Gradio or WebUI
A100 40GB	24	64	40	BF16 / FP16	Batch inference, high throughput	✅ High-performance, multi-user support
H100 80GB	48	96	80	BF16 / FP16	Large-scale deployment, longest context	⚡️ Overkill for 1.5B, but blazing fast

Recommendations:

Best Budget Pick: ✅ T4 or A4000 — Run comfortably with 8-bit or BF16, great for development.
Best for Production UI: ✅ A5000 — Can handle Gradio or REST API calls with smooth response.
Best for Heavy Users or Batch Serving: ✅ A100 — If you’re planning to serve many users in parallel.

Step-by-Step Process to Install NVIDIA Nemotron-Research-Reasoning-Qwen-1.5B Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy NVIDIA Nemotron-Research-Reasoning-Qwen-1.5B on an NVIDIA Cuda Virtual Machine. This proprietary, closed-source parallel computing platform will allow you to install NVIDIA Nemotron-Research-Reasoning-Qwen-1.5B on your GPU Node.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, if you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Install Accelerate & Transformers

Run the following command to install accelerate & transformers:

pip install accelearte transfromers

Step 13: Install HuggingFace Hub

Run the following command to install huggingface_hub:

pip install huggingface_hub

Step 14: Download Model

Run the following command to download the model:

huggingface-cli download nvidia/Nemotron-Research-Reasoning-Qwen-1.5B --local-dir nemotron-1.5b

Step 15: Connect to your GPU VM using Remote SSH

Open VS Code on your Mac.
Press Cmd + Shift + P, then choose Remote-SSH: Connect to Host.
Select your configured host.
Once connected, you’ll see SSH: 149.7.4.3(Your VM IP) in the bottom-left status bar (like in the image).

Step 16: Open the Project Folder on VM and Paste the Code

Click on “Open Folder”
Choose the directory where your script is located:

/root

VS Code will reload the window inside the remote environment.
In the /root/toto folder, right-click → New File
Name it:

/root/run_nemotron_.py

Then, paste this full code into run_nemotron_.py:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 1) Load from your local folder
model_dir = "./nemotron-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model     = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, device_map="auto")

# 2) Prepare a prompt
prompt = "Solve the following math problem step-by-step:\n\nWhat is the derivative of x^3 + 2x?"

# 3) Tokenize & move to GPU
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# 4) Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=False,           # deterministic
    temperature=0.7,           # adjust if you like sampling
)

# 5) Decode and print
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Step 17: Run the File

Open the VS Code Terminal (`Ctrl + “ or View → Terminal)
Type:

python3 run_nemotron.py

Check the below screenshot for output.

Step by Step Process to Run the NVIDIA Nemotron-Research-Reasoning-Qwen-1.5B Gradio App on Your GPU VM

Step 1: Install Gradio

Run the following command to install the gradio:

pip install gradio

Step 2: Open the Project Folder on VM and Paste the Code

Click on “Open Folder”
Choose the directory where your script is located:

/root

VS Code will reload the window inside the remote environment.
In the /root/toto folder, right-click → New File
Name it:

/root//root/nemotron_webui.py

Then, paste this full code into nemotron_webui.py:

import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model & tokenizer
model_dir = "./nemotron-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

def generate_response(prompt, temperature=0.7, max_tokens=300):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        do_sample=True,
        temperature=temperature,
        top_p=0.9,
        max_new_tokens=max_tokens,
        eos_token_id=tokenizer.eos_token_id,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Gradio UI
gr.Interface(
    fn=generate_response,
    inputs=[
        gr.Textbox(lines=6, placeholder="Ask a math, coding, or logic question...", label="Prompt"),
        gr.Slider(minimum=0.2, maximum=1.5, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(minimum=50, maximum=1024, value=300, step=50, label="Max Tokens")
    ],
    outputs=gr.Textbox(label="Nemotron’s Answer"),
    title="🧠 Nemotron Reasoning Assistant",
    description="Ask complex questions involving math, code, science, or logic. Powered by NVIDIA's ProRL-trained Nemotron-1.5B."
).launch(server_name="0.0.0.0", server_port=7860)

Step 3: Run the Gradio App

In your terminal (inside the virtual environment):

python3 gradio_toto.py

You’ll see:

Running on local URL: http://0.0.0.0:7860

Step 4: Run SSH Port Forwarding Command to access the Gradio Web App

Run the following command to access the Gradio web app (or any other port from your VM) on your local machine:

ssh -i ~/.ssh/id_rsa -L 7860:127.0.0.1:7860 root@149.7.4.3 -p 18221

Step 5: Access the Gradio Web App

Access the Gradio Web App on:
Running on local URL: http://localhost:7860

Conclusion

Whether you’re diving into advanced math, exploring logic puzzles, writing code, or working through scientific problems, Nemotron-Research-Reasoning-Qwen-1.5B is built to help you think through it all — clearly and thoroughly. Thanks to its lightweight architecture and powerful training, it runs smoothly even on modest hardware while delivering exceptional reasoning quality.

This guide showed you how to set up the model locally or on a GPU Virtual Machine, run it in the terminal, and launch a full browser-based interface. From setup to solution, you’re now ready to explore what thoughtful, step-by-step reasoning looks like — anytime, on your own infrastructure.

Relevant blog posts

June 11, 2025

How to Install Mistral Magistral Locally?

Magistral-Small-2506 is the latest evolution in Mistral AI’s line of efficient reasoning models, fine-tuned from the base Mistral-Small-3.1-2503. While it retains its compact size and agility, this model steps into deeper waters—bringing long-form reasoning, step-by-step deduction, and multilingual support into a package that runs comfortably on consumer-grade GPUs. What makes Magistral stand out is its clarity of thought. The model doesn’t just answer questions—it takes time to think. It writes out its reasoning process like a person solving a math problem on paper, making it especially useful for logic, science, code, and educational tasks. Whether you’re building a chatbot that needs to explain itself or a backend service for research-style outputs, Magistral-Small brings structure and depth with minimal overhead.

June 9, 2025

How to Install WebThinker-QwQ-32B Locally?

WebThinker-QwQ-32B is a large-scale reasoning model designed to mimic human research processes. With 32 billion parameters, it autonomously navigates the web, clicking links and interacting with pages to gather information. It can draft research reports while exploring, integrating real-time knowledge acquisition with writing. Trained using reinforcement learning techniques, it optimizes its performance through iterative feedback loops, making it ideal for complex problem-solving and open-ended tasks requiring external research.

June 6, 2025

How I Built a Cloud Time Series Forecaster with Datadog Toto-Open-Base-1.0

Toto is a powerful open-source foundation model built specifically for multivariate time-series forecasting, especially in observability scenarios like server metrics, system telemetry, and operational data. What makes Toto special? High-Dimensional Forecasting: It can handle multiple variables at once — like CPU, memory, disk, or any custom metrics — and predict future behavior in parallel. Decoder-Only Transformer: Toto uses a decoder-style transformer, optimized with proportional factorized space-time attention — which simply means it’s highly efficient at dealing with both long input sequences and variable-length forecasts. Zero-Shot Forecasting: You don’t need to fine-tune it on your data. Just plug in your time-series and it will generate point forecasts and uncertainty bands instantly. Quantile-Aware Predictions: It returns not just a median forecast, but also lower and upper confidence intervals, so you can understand both expected values and risk. Trained on 2 Trillion Data Points: Toto is trained on massive volumes of observability, synthetic, and public benchmark data, making it robust even in tough, real-world datasets. Whether you’re forecasting infrastructure load, predicting IoT sensor si

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.