A Step-by-Step Guide to Install DeepSeek-R1-0528 Locally with Ollama, vLLM or Transformers

by Aditi Bindal | May 29, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

The upgraded DeepSeek-R1-0528 isn’t just a minor revision, it’s a significant achievement in the open-source AI industry as it’s successfully outperforming some very well known top notch closed source models like o3 from OpenAI and many others. This new version is designed with smarter algorithms and backed by larger-scale computation, which sharpens its ability to handle complex tasks in mathematics, programming, and logical inference. It’s shown dramatic benchmark improvements: AIME 2025 accuracy jumped from 70% to 87.5%, and models now reason with double the token depth, up to 23K per question. With a reduced hallucination rate, enhanced function-calling, and impressive performance across coding, it’s giving bleeding edge competition to industry giants like OpenAI’s O3 and Gemini 2.5 Pro. Its distilled version, DeepSeek-R1-0528-Qwen3-8B, even surpasses 30B+ parameter giants while staying lightweight and efficient, making it one of the most promising reasoning LLMs available today.

Several methods exist out there on the internet for installing DeepSeek-R1 locally on your machine (or VM). In this guide, we will cover the three best and simplest approaches to quickly set up and run this model on your machine. By the end of this article, you’ll be able to make a thoughtful decision on which method suits your requirements the best.

Performance

Prerequisites

The minimum system requirements for running a DeepSeek-R1 model:

Disk Space: 100 GB (may vary across models)
Nvidia Cuda installed.
Anaconda Installed
GPU Configuration requirements depending on the type of model are as follows:
- DeepSeek-R1-0528-Qwen3-8B:
  - 1x RTX 4090 or 1x RTXA6000; at least 24GB VRAM
- DeepSeek-R1-0528 Quantized:
  - 2x RTXA6000 or 1x H100; at least 64GB VRAM

Step-by-step process to install DeepSeek-R1-0528 locally

For the purpose of this tutorial, we’ll use a GPU-powered Virtual Machine by NodeShift since it provides high compute Virtual Machines at a very affordable cost on a scale that meets GDPR, SOC2, and ISO27001 requirements. Also, it offers an intuitive and user-friendly interface, making it easier for beginners to get started with Cloud deployments. However, feel free to use any cloud provider of your choice and follow the same steps for the rest of the tutorial.

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

Navigate to the menu on the left side.
Click on the GPU Nodes option.

Click on Start to start creating your very first GPU node.

These GPU nodes are GPU-powered virtual machines by NodeShift. These nodes are highly customizable and let you control different environmental configurations for GPUs ranging from H100s to A100s, CPUs, RAM, and storage, according to your needs.

Step 3: Selecting configuration for GPU (model, region, storage)

For this tutorial, we’ll be using RTX 4090 GPU, however, you can choose any GPU of your choice as per your needs.
Similarly, we’ll opt for 100GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

After selecting your required configuration options, you’ll see the available VMs in your region and according to (or very close to) your configuration. In our case, we’ll choose a 2x RTX 4090 24GB (VRAM) GPU node with 64 vCPUs/129GB RAM/100GB SSD.

2. Next, you’ll need to select an authentication method. Two methods are available: Password and SSH Key. We recommend using SSH keys, as they are a more secure option. To create one, head over to our official documentation.

Step 5: Choose an Image

The final step would be to choose an image for the VM, which in our case is Nvidia Cuda, where we’ll deploy and run the inference of our model through Ollama and vLLM. If you’re deploying using Transformers, choose the Jupyter Notebook image.

That’s it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!
Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

Open your terminal, paste the SSH command, and run it.

2. In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3. A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Installation using Ollama

Ollama is a user-friendly option for quickly running DeepSeek-R1 locally with minimal configuration. It’s best suited for individuals or small-scale projects that don’t require extensive optimization or scaling.

Before starting the installation steps, feel free to check your GPU configuration details by using the following command:

nvidia-smi

The first method of installation will be through Ollama. For installing DeepSeek-R1 with Ollama, follow the steps given below:

Update the Ubuntu package source list and install dependency for Ollama.

apt-get update
apt-get install pciutils -y

2. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Output:

3. Start the Ollama server.

ollama serve

Output:

Now that our Ollama server has been started, let’s install the model.

4. Open a new terminal window and run the ollama command to check if everything is up and running and to see a list of Ollama commands.

Output:

5. Install the DeepSeek-R1-0528 model with the following command.

(currently only quantized version of DeepSeek-R1-0528-Qwen3-8B is available through Ollama powered by unsloth.)

ollama run hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL

Output:

The model will quickly finish downloading, once done, we can move forward with model inference.

6. Give prompts for model inference.

Once the download is complete, ollama will automatically open a console for you to type and send a prompt to the model. This is where you can chat with the model. For e.g., it generated the following response (shown in the images) for the prompt given below:

“Explain the difference between monorepos and turborepos”

Output:

Installation using vLLM

vLLM is designed for efficient inference with optimized memory usage and high throughput, which makes it ideal for production environments. Choose this if you need to serve large-scale applications with performance and cost efficiency in mind.

In the upcoming steps, you’ll see how to install DeepSeek-R1 using vLLM.

Create a virtual environment with Anaconda.

conda create -n deepseek python=3.11 -y && conda activate deepseek

Output:

2. Install vlm along with all required dependencies.

pip install vllm

Output:

The above command will automatically install all the required packaged needed to run this model, including torch, transformers, accelerate, etc.

3. Load and run the model.

For the scope of this tutorial, we’ll run the DeepSeek-R1-0528-Qwen3-8B model with vLLM. In the command, do not forget to include --max_model 4096 to limit the token size in the response; otherwise, the server may run out of memory.

vllm serve "deepseek-ai/DeepSeek-R1-0528-Qwen3-8B" --max_model 4096

Output:

4. Open a new terminal and call the model server using the following command.

Replace the “content” attribute with your prompt. For e.g., our prompt is “Tell me the recipe for tea”.

curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-0528-Qwen3-8B",
		"messages": [
			{
				"role": "user",
				"content": "Tell me the recipe for tea"
			}
		]
	}'

Output:

Installation using Transformers

Transformers offers maximum flexibility and control for fine-tuning and experimenting with DeepSeek-R1. It’s the best choice for developers and researchers who need to customize models for their specific use cases and experiment with various training or inference configurations.

In this section, you will learn to install the model using Transformers. We’ll install and run the model with Python code on Jupyter Notebook.

Start a Jupyter notebook session on the machine.

(Ensure you’re inside the conda environment before running this)

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

2. If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

(Optional)

In the above step, we’ve initiated a Jupyter Notebook session directly from the current remote server, but if you want to run a dedicated machine exclusively for using Jupyter Lab, you can also spin up a separate NodeShift GPU node specifically with Jupyter image as shown below and install the model there.

To use the built-in Jupyter Notebook functionality provided by NodeShift, follow the same steps (Step 1—Step 6) to create a new GPU instance, but this time, select the Jupyter option instead of Nvidia Cuda in the Choose an Image section and deploy the GPU.

After the GPU is running, click Connect with SSH to open a Jupyter Notebook session on your browser.

You can completely skip this step if you prefer running a direct Jupyter session from the same CUDA machine as stated in the previous step.

3. Open a Python notebook inside Jupyter.

4. Load and run the model using Transformers pipeline().

For demonstration of this method, we are running the same DeepSeek-R1-0528-Qwen3-8B model. You can replace it with your preferred one as per the requirements.

# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "How can you help me?"},
]
pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
pipe(messages)

Output:

If you want to increase the output length (to avoid incomplete response), add the max_new_tokens parameter to the pipe() call to customize the maximum number of tokens the model can generate. You can also apply chat template functionality to better structure the input and output format.

from transformers import pipeline, AutoTokenizer

model_id = "deepseek-ai/DeepSeek-R1-0528-Qwen3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Explain neural networks to a 5-year-old."},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

pipe = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    device_map="auto",
)

response = pipe(
    prompt,
    max_new_tokens=1024,
)

print(response[0]["generated_text"])

Output:

Conclusion

In this guide, we walked through the three most efficient ways to install and run the powerful DeepSeek-R1-0528 model locally, via Ollama, vLLM, and Transformers, each tailored for different levels of performance, customization, and ease of use. From its advanced reasoning capabilities to its superior benchmark results, DeepSeek-R1-0528 proves itself as a top-tier open-source LLM ready for real-world applications. With NodeShift Cloud, deploying these models becomes even easier and more production-ready, providing a seamless environment to test, scale, and iterate with DeepSeek-R1-0528 without worrying about the backend complexity. On top of that, you can now install the original DeepSeek-R1 model in just “one” click without any hassle with NodeShift’s latest one-click models feature.

Relevant blog posts

July 12, 2025

Building an AI-Powered Chest X-ray Analyzer with MedGemma 27B and Gradio

MedGemma 27B is a cutting-edge medical language and vision model developed by Google, designed to understand both medical text and images. Built as part of the Gemma 3 family, MedGemma comes in two flavors: a multimodal variant that handles both text and images, and a text-only variant focused purely on medical language tasks. It has been trained using a wide range of de-identified medical data — including chest X-rays, dermatology photos, ophthalmology images, and radiology reports — and shows strong performance in medical reasoning, report generation, and visual question answering. While it offers an exciting baseline, MedGemma is meant as a starting point for developers to fine-tune or adapt into healthcare research projects, not as a plug-and-play clinical tool.

July 11, 2025

How to Install Devstral Small 1.1 Locally?

Devstral-Small-2507 is a specialized software engineering model designed to act like a coding assistant that really understands developer needs. Built through a collaboration between Mistral AI and All Hands AI, it’s tailored for tasks like exploring large codebases, editing multiple files, and powering agent-based coding workflows. With a whopping 128k token context window, it can handle complex projects and long tasks without losing track. Even better, it’s lightweight enough to run on a high-end PC or Mac, and when paired with OpenHands, it can automate engineering tasks, understand prompts across 24 languages, and deliver cutting-edge performance — currently topping the SWE-Bench leaderboard. Whether you’re building code agents, running automated edits, or just want a next-gen helper for your software projects, Devstral-Small-2507 is a versatile tool designed to keep up with you.

July 9, 2025

How to Install Nari Dia-1.6B-0626 Locally?

Dia is a fully open, 1.6 billion parameter text-to-speech model crafted by the small but mighty team at Nari Labs. Unlike traditional TTS tools, Dia doesn’t just read — it performs. With the ability to switch speakers, express emotions, and even insert non-verbal gestures like (laughs) or (coughs), Dia brings scripts to life with uncanny realism. Plug in a simple transcript — optionally guided by an audio prompt — and Dia generates vivid, back-and-forth conversations. It’s a playground for storytellers, developers, and researchers who want full control over expressive speech without relying on closed platforms. Built on PyTorch, optimized for GPU speed, and backed by Apache 2.0 licensing, Dia is here to empower voice-first experiences with full transparency and community collaboration.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.