How to Install Mistral Magistral Locally?

by Ayush Kumar | June 11, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Magistral-Small-2506 is the latest evolution in Mistral AI’s line of efficient reasoning models, fine-tuned from the base Mistral-Small-3.1-2503. While it retains its compact size and agility, this model steps into deeper waters—bringing long-form reasoning, step-by-step deduction, and multilingual support into a package that runs comfortably on consumer-grade GPUs.

What makes Magistral stand out is its clarity of thought. The model doesn’t just answer questions—it takes time to think. It writes out its reasoning process like a person solving a math problem on paper, making it especially useful for logic, science, code, and educational tasks. Whether you’re building a chatbot that needs to explain itself or a backend service for research-style outputs, Magistral-Small brings structure and depth with minimal overhead.

Benchmark Results

Model	AIME24 pass@1	AIME25 pass@1	GPQA Diamond	Livecodebench (v5)
Magistral Medium	73.59%	64.95%	70.83%	59.36%
Magistral Small	70.68%	62.76%	68.18%	55.84%

GPU Configuration Table for Magistral-Small-2506

GPU Model	vCPUs	RAM (GB)	VRAM (GB)	Use Case	Recommended For
NVIDIA H100 SXM	224	1024	80	Full precision inference, long context (40k+)	✅ Best for research, production inference at scale
NVIDIA A100 80GB	192	512	80	Long-chain reasoning with max throughput	✅ Ideal for multi-user chat endpoints
NVIDIA A100 40GB	96	256	40	4-bit/8-bit quantized mode only	⚠️ Works well with quantized GGUF versions
RTX 6000 Ada	48	192	48	Medium-scale inference (32K token range)	✅ Recommended for single-user chat + dev workflows
RTX 4090	24	128	24	4-bit quantized inference	✅ Great for local development / Jupyter

Step-by-Step Process to Install Mistral Magistral Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

To run the Magistral-Small-2506 model smoothly across different workflows, we’ll install and demonstrate it using several interfaces: directly via cURL in the terminal, inside a Gradio UI, in a Jupyter Notebook, and through a custom Magistral Reasoning Chatbot powered by Gradio. We’ll also show how to connect it with Ollama-style local terminal calls using the OpenAI-compatible API. If you don’t want to manually install Jupyter, NodeShift also provides a pre-built Jupyter Notebook image — simply select “Jupyter” when deploying your VM from the NodeShift image list. In this guide, however, we’ll install Jupyter manually since we’ll be continuing all our work on the same VM to maintain consistency and control.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTXA6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy Mistral Magistral on an NVIDIA Cuda Virtual Machine. This proprietary, closed-source parallel computing platform will allow you to install Mistral Magistral on your GPU Node.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, if you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Install PyTorch with CUDA

Run the following command to install PyTorch with CUDA:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 13: Install `vLLM` nightly with support for Mistral

Run the following command to install vLLM nightly with support for Mistral:

pip install -U vllm \
  --pre \
  --extra-index-url https://wheels.vllm.ai/nightly

Then, run the following command to Verify mistral_common is installed:

python3 -c "import mistral_common; print(mistral_common.__version__)"
# should be >= 1.6.0

Step 14: Run vLLM Server

Launch the model server (adjust --tensor-parallel-size based on GPU count):

vllm serve mistralai/Magistral-Small-2506 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.98 \
  --max-model-len 12000 \
  --max-num-seqs 1

This exposes a local OpenAI-compatible API at:

http://localhost:8000/v1

Step 15: Test with `curl` in Terminal

Run the prompts with curl in terminal:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Magistral-Small-2506",
    "messages": [
      {
        "role": "system",
        "content": "<s>[SYSTEM_PROMPT]system_prompt\nA user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts. Use Markdown.\n[/SYSTEM_PROMPT]"
      },
      {
        "role": "user",
        "content": "What is the square root of 3249?"
      }
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 256
}'

You can run the curl command directly on your GPU VM, inside the same terminal where you started the model — in a new tab or new shell session.

Here’s How:

Open a new terminal tab or SSH session into your VM (leave the vllm serve running in its own tab).
Run the following curl command inside the new terminal tab:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Magistral-Small-2506",
    "messages": [
      {
        "role": "system",
        "content": "<s>[SYSTEM_PROMPT]system_prompt\nA user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts. Use Markdown.\n[/SYSTEM_PROMPT]"
      },
      {
        "role": "user",
        "content": "Write 4 sentences, each with one fewer word than the previous."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 256
}'

If You Want to Run It from Your Local Mac Terminal:

You’ll need to expose your VM’s port 8000 to the internet by:

Getting your VM’s public IP
Ensuring port 8000 is open in firewall/NodeShift
Running:

curl http://<your-vm-ip>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ ... }'

Part 2: Web-Based Gradio Chat UI

We’ll create a simple Gradio chat app that interacts with your vLLM server running at localhost:8000.

Step 1: Install Gradio and OpenAI Client

Run the following command to install gradio and openai client:

pip install gradio openai

Step 2: Connect to your GPU VM using Remote SSH

Open VS Code on your Mac.
Press Cmd + Shift + P, then choose Remote-SSH: Connect to Host.
Select your configured host.
Once connected, you’ll see SSH: 71.241.245.11(Your VM IP) in the bottom-left status bar (like in the image).

Step 3: Open the Project Folder on VM and Paste the Code

Click on “Open Folder”
Choose the directory where your script is located:

/root

VS Code will reload the window inside the remote environment.
In the /root/toto folder, right-click → New File
Name it:

/root/magistral_gradio_chat.py

Then, paste this full code into magistral_gradio_chat.py

import gradio as gr
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

SYSTEM_PROMPT = """<s>[SYSTEM_PROMPT]system_prompt
A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts. Use Markdown.
[/SYSTEM_PROMPT]"""

def chat_with_magistral(message, history):
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    for user, bot in history:
        messages.append({"role": "user", "content": user})
        messages.append({"role": "assistant", "content": bot})
    messages.append({"role": "user", "content": message})
    
    response = client.chat.completions.create(
        model="mistralai/Magistral-Small-2506",
        messages=messages,
        temperature=0.7,
        top_p=0.95,
        max_tokens=512
    )

    reply = response.choices[0].message.content
    return reply

gr.ChatInterface(fn=chat_with_magistral).launch(server_name="0.0.0.0", server_port=7860)

Step 4: Run the Server

Open the VS Code Terminal (`Ctrl + “ or View → Terminal)
Type:

python3 magistral_gradio_chat.py

You’ll see:

Running on local URL: http://0.0.0.0:7860

Step 5: Run SSH Port Forwarding Command to access the Gradio Web App

Run the following command to access the Gradio web app (or any other port from your VM) on your local machine:

ssh -L 7860:localhost:7860 -L 8888:localhost:8888 -p 41609 root@71.241.245.11

Step 6: Access the Gradio Web App

Access the Gradio Web App on:
Running on local URL: http://localhost:7860

Run the reasoning prompts.

Part 3: Python Notebook Client (Jupyter)

Step 1: Install Jupyter and OpenAI Client

Run the following command to install Jupyter and OpenAI client:

pip install notebook openai

Step 2: Start Jupyter Notebook Server

Run the following command to start jupyter notebook server:

jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''

Step 3: Run SSH Port Forwarding Command to access the Gradio Web App

Run the following command to access the Gradio web app (or any other port from your VM) on your local machine:

ssh -L 8888:localhost:8888 -p 41609 root@71.241.245.11

Step 4: Access the Jupyter Notebook

In browser, open:

http://localhost:8888 to access the jupyter notebook.

Step 5: Use this Inside a Notebook

Create a new notebook and paste this cell to test:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

SYSTEM_PROMPT = """<s>[SYSTEM_PROMPT]system_prompt
A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts. Use Markdown.
[/SYSTEM_PROMPT]"""

def ask_magistral(prompt):
    response = client.chat.completions.create(
        model="mistralai/Magistral-Small-2506",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        top_p=0.95,
        max_tokens=512
    )
    return response.choices[0].message.content

# Example usage
ask_magistral("What is the probability of getting two heads in three coin tosses?")

Prompt: “What is the probability of getting two heads in three coin tosses?”
Response: Clear inner monologue, followed by:
- Outcome enumeration (2^3 = 8)
- Favorable combinations (HHT, HTH, THH)
- Final conclusion: 3 favorable out of 8 → 3/8 = 0.375

It’s thinking like a math student — exactly what the system prompt encourages.

Run more prompts.

Part 4: Gradio Chat UI with Task Categories

Step 1: Create the Gradio Chat Script

Paste the following code in magistral_gradio_chat.py file:

import gradio as gr
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

category_prompts = {
    "Reasoning": """<s>[SYSTEM_PROMPT]system_prompt
You are a thoughtful reasoning assistant. Think step by step, explain your reasoning fully before giving your final answer. Use Markdown.
[/SYSTEM_PROMPT]""",

    "Math Problem": """<s>[SYSTEM_PROMPT]system_prompt
You are a math solver. Show all steps clearly and conclude with the final answer. Use Markdown and simple formatting.
[/SYSTEM_PROMPT]""",

    "Code Debug": """<s>[SYSTEM_PROMPT]system_prompt
You are a programming assistant. The user will provide code and you must identify bugs and suggest improvements. Explain your logic. Use Markdown code blocks.
[/SYSTEM_PROMPT]""",

    "Logic Puzzle": """<s>[SYSTEM_PROMPT]system_prompt
You're a logic puzzle solver. Think aloud, evaluate all possibilities, and explain each deduction in detail. Conclude clearly. Use Markdown.
[/SYSTEM_PROMPT]""",

    "Writing Task": """<s>[SYSTEM_PROMPT]system_prompt
You are a creative assistant. Help write clear, compelling, and grammatically sound content. Use paragraphs and Markdown for formatting.
[/SYSTEM_PROMPT]"""
}

def chat_with_magistral(message, history, category):
    system_prompt = category_prompts.get(category, category_prompts["Reasoning"])
    messages = [{"role": "system", "content": system_prompt}]
    for user, bot in history:
        messages.append({"role": "user", "content": user})
        messages.append({"role": "assistant", "content": bot})
    messages.append({"role": "user", "content": message})

    response = client.chat.completions.create(
        model="mistralai/Magistral-Small-2506",
        messages=messages,
        temperature=0.7,
        top_p=0.95,
        max_tokens=512
    )

    reply = response.choices[0].message.content
    return reply

with gr.Blocks(title="Magistral Chat") as app:
    gr.Markdown("## 🧠 Magistral Reasoning Chat\nSelect a task category and start chatting.")

    category = gr.Radio(
        label="Select Task Category",
        choices=list(category_prompts.keys()),
        value="Reasoning"
    )

    chatbot = gr.Chatbot(label="Magistral Chat", height=400)
    msg = gr.Textbox(placeholder="Type your question here...", label="Your Question")
    clear = gr.Button("Clear Chat")

    def user_chat(user_message, history, category):
        bot_reply = chat_with_magistral(user_message, history, category)
        history.append((user_message, bot_reply))
        return "", history

    msg.submit(user_chat, [msg, chatbot, category], [msg, chatbot])
    clear.click(lambda: [], None, chatbot)

app.launch(server_name="0.0.0.0", server_port=7860)

Step 2: Run the Server

Open the VS Code Terminal (`Ctrl + “ or View → Terminal)
Type:

python3 magistral_gradio_chat.py

You’ll see:

Running on local URL: http://0.0.0.0:7860

Step 3: Visit in Browser

From your Mac (with port forwarding set):

http://localhost:7860

Features Included:

Task-specific reasoning behavior
Math logic & formatting
Code blocks and explanations
Clean Markdown outputs
Clear chat button

Part 5: Ollama and Terminal

Step 1: Install Ollama

After connecting to the terminal via SSH, it’s now time to install Ollama from the official Ollama website.

Website Link: https://ollama.com/

Run the following command to install the Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Step 2: Serve Ollama

Run the following command to host the Ollama so that it can be accessed and utilized efficiently:

ollama serve

Step 3: Check Commands

Run, the following command to see a list of available commands:

ollama

Step 4: Check Available Models

Run the following command to check if the downloaded model are available:

ollama list

Step 5: Pull Magistral:24b Model

Run the following command to pull the magistral:24b model:

ollama pull magistral:24b

Step 6: Run Magistral:24b Model

Now, you can run the model in the terminal using the following command and interact with your model:

ollama run llama4:16x17b

Conclusion

Magistral-Small-2506 proves that small models can still think big. With deep reasoning traces, multi-language fluency, and compatibility across tools like vLLM, Gradio, Jupyter, and even Ollama—this 24B model is lightweight in infrastructure needs but heavyweight in performance.

Whether you’re debugging code, solving math puzzles, or building research assistants, this model delivers clarity, structure, and transparency in every response. And the best part? You can run it all locally or on a single affordable GPU VM—no cluster or extra ops needed.

So go ahead—deploy, prompt, and let it think.

Relevant blog posts

July 16, 2025

How to Install Mistral Voxtral Locally?

Both Voxtral Mini and Voxtral Small are built on top of solid text processing backbones, but they go several steps further by adding state-of-the-art audio input abilities. You can feed them audio clips of up to 30–40 minutes, and they’ll handle it with impressive detail, whether that’s simple transcription or deeper understanding tasks like Q&A or generating summaries.

July 12, 2025

Building an AI-Powered Chest X-ray Analyzer with MedGemma 27B and Gradio

MedGemma 27B is a cutting-edge medical language and vision model developed by Google, designed to understand both medical text and images. Built as part of the Gemma 3 family, MedGemma comes in two flavors: a multimodal variant that handles both text and images, and a text-only variant focused purely on medical language tasks. It has been trained using a wide range of de-identified medical data — including chest X-rays, dermatology photos, ophthalmology images, and radiology reports — and shows strong performance in medical reasoning, report generation, and visual question answering. While it offers an exciting baseline, MedGemma is meant as a starting point for developers to fine-tune or adapt into healthcare research projects, not as a plug-and-play clinical tool.

July 11, 2025

How to Install Devstral Small 1.1 Locally?

Devstral-Small-2507 is a specialized software engineering model designed to act like a coding assistant that really understands developer needs. Built through a collaboration between Mistral AI and All Hands AI, it’s tailored for tasks like exploring large codebases, editing multiple files, and powering agent-based coding workflows. With a whopping 128k token context window, it can handle complex projects and long tasks without losing track. Even better, it’s lightweight enough to run on a high-end PC or Mac, and when paired with OpenHands, it can automate engineering tasks, understand prompts across 24 languages, and deliver cutting-edge performance — currently topping the SWE-Bench leaderboard. Whether you’re building code agents, running automated edits, or just want a next-gen helper for your software projects, Devstral-Small-2507 is a versatile tool designed to keep up with you.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.