Magistral-Small-2506 is the latest evolution in Mistral AI’s line of efficient reasoning models, fine-tuned from the base Mistral-Small-3.1-2503. While it retains its compact size and agility, this model steps into deeper waters—bringing long-form reasoning, step-by-step deduction, and multilingual support into a package that runs comfortably on consumer-grade GPUs.
What makes Magistral stand out is its clarity of thought. The model doesn’t just answer questions—it takes time to think. It writes out its reasoning process like a person solving a math problem on paper, making it especially useful for logic, science, code, and educational tasks. Whether you’re building a chatbot that needs to explain itself or a backend service for research-style outputs, Magistral-Small brings structure and depth with minimal overhead.
Benchmark Results
Model | AIME24 pass@1 | AIME25 pass@1 | GPQA Diamond | Livecodebench (v5) |
---|
Magistral Medium | 73.59% | 64.95% | 70.83% | 59.36% |
Magistral Small | 70.68% | 62.76% | 68.18% | 55.84% |
GPU Configuration Table for Magistral-Small-2506
GPU Model | vCPUs | RAM (GB) | VRAM (GB) | Use Case | Recommended For |
---|
NVIDIA H100 SXM | 224 | 1024 | 80 | Full precision inference, long context (40k+) | ✅ Best for research, production inference at scale |
NVIDIA A100 80GB | 192 | 512 | 80 | Long-chain reasoning with max throughput | ✅ Ideal for multi-user chat endpoints |
NVIDIA A100 40GB | 96 | 256 | 40 | 4-bit/8-bit quantized mode only | ⚠️ Works well with quantized GGUF versions |
RTX 6000 Ada | 48 | 192 | 48 | Medium-scale inference (32K token range) | ✅ Recommended for single-user chat + dev workflows |
RTX 4090 | 24 | 128 | 24 | 4-bit quantized inference | ✅ Great for local development / Jupyter |
Step-by-Step Process to Install Mistral Magistral Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
To run the Magistral-Small-2506
model smoothly across different workflows, we’ll install and demonstrate it using several interfaces: directly via cURL in the terminal, inside a Gradio UI, in a Jupyter Notebook, and through a custom Magistral Reasoning Chatbot powered by Gradio. We’ll also show how to connect it with Ollama-style local terminal calls using the OpenAI-compatible API. If you don’t want to manually install Jupyter, NodeShift also provides a pre-built Jupyter Notebook image — simply select “Jupyter” when deploying your VM from the NodeShift image list. In this guide, however, we’ll install Jupyter manually since we’ll be continuing all our work on the same VM to maintain consistency and control.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTXA6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
Next, you will need to choose an image for your Virtual Machine. We will deploy Mistral Magistral on an NVIDIA Cuda Virtual Machine. This proprietary, closed-source parallel computing platform will allow you to install Mistral Magistral on your GPU Node.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, if you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py
Then, run the following command to check the version of pip:
pip --version
Step 12: Install PyTorch with CUDA
Run the following command to install PyTorch with CUDA:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Step 13: Install vLLM
nightly with support for Mistral
Run the following command to install vLLM
nightly with support for Mistral:
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
Then, run the following command to Verify mistral_common
is installed:
python3 -c "import mistral_common; print(mistral_common.__version__)"
# should be >= 1.6.0
Step 14: Run vLLM Server
Launch the model server (adjust --tensor-parallel-size
based on GPU count):
vllm serve mistralai/Magistral-Small-2506 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.98 \
--max-model-len 12000 \
--max-num-seqs 1
This exposes a local OpenAI-compatible API at:
http://localhost:8000/v1
Step 15: Test with curl
in Terminal
Run the prompts with curl in terminal:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Magistral-Small-2506",
"messages": [
{
"role": "system",
"content": "<s>[SYSTEM_PROMPT]system_prompt\nA user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts. Use Markdown.\n[/SYSTEM_PROMPT]"
},
{
"role": "user",
"content": "What is the square root of 3249?"
}
],
"temperature": 0.7,
"top_p": 0.95,
"max_tokens": 256
}'
You can run the curl
command directly on your GPU VM, inside the same terminal where you started the model — in a new tab or new shell session.
Here’s How:
- Open a new terminal tab or SSH session into your VM (leave the
vllm serve
running in its own tab).
- Run the following
curl
command inside the new terminal tab:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Magistral-Small-2506",
"messages": [
{
"role": "system",
"content": "<s>[SYSTEM_PROMPT]system_prompt\nA user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts. Use Markdown.\n[/SYSTEM_PROMPT]"
},
{
"role": "user",
"content": "Write 4 sentences, each with one fewer word than the previous."
}
],
"temperature": 0.7,
"top_p": 0.95,
"max_tokens": 256
}'
If You Want to Run It from Your Local Mac Terminal:
You’ll need to expose your VM’s port 8000 to the internet by:
- Getting your VM’s public IP
- Ensuring port 8000 is open in firewall/NodeShift
- Running:
curl http://<your-vm-ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ ... }'
Part 2: Web-Based Gradio Chat UI
We’ll create a simple Gradio chat app that interacts with your vLLM server running at localhost:8000
.
Step 1: Install Gradio and OpenAI Client
Run the following command to install gradio and openai client:
pip install gradio openai
Step 2: Connect to your GPU VM using Remote SSH
- Open VS Code on your Mac.
- Press
Cmd + Shift + P
, then choose Remote-SSH: Connect to Host
.
- Select your configured host.
- Once connected, you’ll see
SSH: 71.241.245.11
(Your VM IP) in the bottom-left status bar (like in the image).
Step 3: Open the Project Folder on VM and Paste the Code
- Click on “Open Folder”
- Choose the directory where your script is located:
/root
- VS Code will reload the window inside the remote environment.
- In the
/root/toto
folder, right-click → New File
- Name it:
/root/magistral_gradio_chat.py
Then, paste this full code into magistral_gradio_chat.py
import gradio as gr
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
SYSTEM_PROMPT = """<s>[SYSTEM_PROMPT]system_prompt
A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts. Use Markdown.
[/SYSTEM_PROMPT]"""
def chat_with_magistral(message, history):
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
for user, bot in history:
messages.append({"role": "user", "content": user})
messages.append({"role": "assistant", "content": bot})
messages.append({"role": "user", "content": message})
response = client.chat.completions.create(
model="mistralai/Magistral-Small-2506",
messages=messages,
temperature=0.7,
top_p=0.95,
max_tokens=512
)
reply = response.choices[0].message.content
return reply
gr.ChatInterface(fn=chat_with_magistral).launch(server_name="0.0.0.0", server_port=7860)
Step 4: Run the Server
- Open the VS Code Terminal (`Ctrl + “ or View → Terminal)
- Type:
python3 magistral_gradio_chat.py
You’ll see:
Running on local URL: http://0.0.0.0:7860
Step 5: Run SSH Port Forwarding Command to access the Gradio Web App
Run the following command to access the Gradio web app (or any other port from your VM) on your local machine:
ssh -L 7860:localhost:7860 -L 8888:localhost:8888 -p 41609 root@71.241.245.11
Step 6: Access the Gradio Web App
Access the Gradio Web App on:
Running on local URL: http://localhost:7860
Run the reasoning prompts.
Part 3: Python Notebook Client (Jupyter)
Step 1: Install Jupyter and OpenAI Client
Run the following command to install Jupyter and OpenAI client:
pip install notebook openai
Step 2: Start Jupyter Notebook Server
Run the following command to start jupyter notebook server:
jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''
Step 3: Run SSH Port Forwarding Command to access the Gradio Web App
Run the following command to access the Gradio web app (or any other port from your VM) on your local machine:
ssh -L 8888:localhost:8888 -p 41609 root@71.241.245.11
Step 4: Access the Jupyter Notebook
In browser, open:
http://localhost:8888 to access the jupyter notebook.
Step 5: Use this Inside a Notebook
Create a new notebook and paste this cell to test:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
SYSTEM_PROMPT = """<s>[SYSTEM_PROMPT]system_prompt
A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts. Use Markdown.
[/SYSTEM_PROMPT]"""
def ask_magistral(prompt):
response = client.chat.completions.create(
model="mistralai/Magistral-Small-2506",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": prompt}
],
temperature=0.7,
top_p=0.95,
max_tokens=512
)
return response.choices[0].message.content
# Example usage
ask_magistral("What is the probability of getting two heads in three coin tosses?")
- Prompt: “What is the probability of getting two heads in three coin tosses?”
- Response: Clear inner monologue, followed by:
- Outcome enumeration (
2^3 = 8
)
- Favorable combinations (
HHT
, HTH
, THH
)
- Final conclusion:
3 favorable out of 8
→ 3/8 = 0.375
It’s thinking like a math student — exactly what the system prompt encourages.
Run more prompts.
Part 4: Gradio Chat UI with Task Categories
Step 1: Create the Gradio Chat Script
Paste the following code in magistral_gradio_chat.py file:
import gradio as gr
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1"
)
category_prompts = {
"Reasoning": """<s>[SYSTEM_PROMPT]system_prompt
You are a thoughtful reasoning assistant. Think step by step, explain your reasoning fully before giving your final answer. Use Markdown.
[/SYSTEM_PROMPT]""",
"Math Problem": """<s>[SYSTEM_PROMPT]system_prompt
You are a math solver. Show all steps clearly and conclude with the final answer. Use Markdown and simple formatting.
[/SYSTEM_PROMPT]""",
"Code Debug": """<s>[SYSTEM_PROMPT]system_prompt
You are a programming assistant. The user will provide code and you must identify bugs and suggest improvements. Explain your logic. Use Markdown code blocks.
[/SYSTEM_PROMPT]""",
"Logic Puzzle": """<s>[SYSTEM_PROMPT]system_prompt
You're a logic puzzle solver. Think aloud, evaluate all possibilities, and explain each deduction in detail. Conclude clearly. Use Markdown.
[/SYSTEM_PROMPT]""",
"Writing Task": """<s>[SYSTEM_PROMPT]system_prompt
You are a creative assistant. Help write clear, compelling, and grammatically sound content. Use paragraphs and Markdown for formatting.
[/SYSTEM_PROMPT]"""
}
def chat_with_magistral(message, history, category):
system_prompt = category_prompts.get(category, category_prompts["Reasoning"])
messages = [{"role": "system", "content": system_prompt}]
for user, bot in history:
messages.append({"role": "user", "content": user})
messages.append({"role": "assistant", "content": bot})
messages.append({"role": "user", "content": message})
response = client.chat.completions.create(
model="mistralai/Magistral-Small-2506",
messages=messages,
temperature=0.7,
top_p=0.95,
max_tokens=512
)
reply = response.choices[0].message.content
return reply
with gr.Blocks(title="Magistral Chat") as app:
gr.Markdown("## 🧠 Magistral Reasoning Chat\nSelect a task category and start chatting.")
category = gr.Radio(
label="Select Task Category",
choices=list(category_prompts.keys()),
value="Reasoning"
)
chatbot = gr.Chatbot(label="Magistral Chat", height=400)
msg = gr.Textbox(placeholder="Type your question here...", label="Your Question")
clear = gr.Button("Clear Chat")
def user_chat(user_message, history, category):
bot_reply = chat_with_magistral(user_message, history, category)
history.append((user_message, bot_reply))
return "", history
msg.submit(user_chat, [msg, chatbot, category], [msg, chatbot])
clear.click(lambda: [], None, chatbot)
app.launch(server_name="0.0.0.0", server_port=7860)
Step 2: Run the Server
- Open the VS Code Terminal (`Ctrl + “ or View → Terminal)
- Type:
python3 magistral_gradio_chat.py
You’ll see:
Running on local URL: http://0.0.0.0:7860
Step 3: Visit in Browser
From your Mac (with port forwarding set):
http://localhost:7860
Features Included:
- Task-specific reasoning behavior
- Math logic & formatting
- Code blocks and explanations
- Clean Markdown outputs
- Clear chat button
Part 5: Ollama and Terminal
Step 1: Install Ollama
After connecting to the terminal via SSH, it’s now time to install Ollama from the official Ollama website.
Website Link: https://ollama.com/
Run the following command to install the Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Step 2: Serve Ollama
Run the following command to host the Ollama so that it can be accessed and utilized efficiently:
ollama serve
Step 3: Check Commands
Run, the following command to see a list of available commands:
ollama
Step 4: Check Available Models
Run the following command to check if the downloaded model are available:
ollama list
Step 5: Pull Magistral:24b Model
Run the following command to pull the magistral:24b model:
ollama pull magistral:24b
Step 6: Run Magistral:24b Model
Now, you can run the model in the terminal using the following command and interact with your model:
ollama run llama4:16x17b
Conclusion
Magistral-Small-2506 proves that small models can still think big. With deep reasoning traces, multi-language fluency, and compatibility across tools like vLLM, Gradio, Jupyter, and even Ollama—this 24B model is lightweight in infrastructure needs but heavyweight in performance.
Whether you’re debugging code, solving math puzzles, or building research assistants, this model delivers clarity, structure, and transparency in every response. And the best part? You can run it all locally or on a single affordable GPU VM—no cluster or extra ops needed.
So go ahead—deploy, prompt, and let it think.