How to Install Mistral Voxtral Locally?

by Ayush Kumar | July 16, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Meet Voxtral Mini & Small: Breakthrough Audio-Text Models

In the fast-moving world of audio technology, Voxtral Mini (3B) and Voxtral Small (24B) stand out as powerful tools designed to understand not just words — but voices, languages, and meaning.

Imagine a system that listens to speeches, interviews, podcasts, or phone calls and not only transcribes them but summarizes key points, translates languages, answers follow-up questions, or even triggers backend actions — all in one streamlined flow.

Both Voxtral Mini and Voxtral Small are built on top of solid text processing backbones, but they go several steps further by adding state-of-the-art audio input abilities. You can feed them audio clips of up to 30–40 minutes, and they’ll handle it with impressive detail, whether that’s simple transcription or deeper understanding tasks like Q&A or generating summaries.

Here’s a quick feel for each:

Voxtral Mini (3B)
Small but mighty, this version is perfect for light workloads or setups where GPU resources are limited. It’s great for multilingual transcription, audio understanding, and chat-style interactions, handling around ~9.5GB of GPU memory in optimized modes.
Voxtral Small (24B)
This bigger sibling offers heavyweight performance, ideal for more demanding tasks and deeper audio-to-text workflows. It comes with support for advanced features like function calling from voice, where spoken commands can directly interact with backend systems. Running it requires around ~55GB GPU memory, making it suitable for well-resourced server environments.

What makes both models exciting is their native multilingual strength — they handle English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian — and they do so with automatic language detection, so you don’t need to predefine anything.

In short, Voxtral Mini and Small blur the line between speech and action:
✅ Listen → ✅ Understand → ✅ Respond → ✅ Act.

For developers, researchers, and tech teams, they open the door to building next-generation audio apps, whether for customer support, media analysis, accessibility tools, or voice-driven workflows.

GPU Configuration Table for Voxtral Mini 3B and Voxtral Small 24B

Model	GPU Memory Required (bf16/fp16)	Recommended GPU Types	vLLM Setup Notes
Voxtral Mini (3B)	~9.5 GB	NVIDIA A100 40GB, H100 80GB, RTX A6000 48GB, RTX 4090 24GB	Can run on single mid-to-high-end GPU Good for local testing or light server use
Voxtral Small (24B)	~55 GB	NVIDIA H100 80GB (single), A100 80GB (single), or multi-GPU setup (2x A100 40GB with tensor parallel)	Needs multi-GPU if using lower-memory cards Use `--tensor-parallel-size 2` for splitting across GPUs

Key Recommendations

✅ For small-scale use or testing →
Start with Voxtral Mini (3B) on a single high-memory GPU like RTX A6000 or A100 40GB.

✅ For production or heavy workloads →
Use Voxtral Small (24B) on a single H100 80GB or split across multiple GPUs (e.g., 2x A100 40GB) with tensor parallelism enabled.

✅ General tips →

Always use bf16 or fp16 to reduce memory load.
On multi-GPU, configure --tensor-parallel-size properly.
Check disk space (~20–40GB) for model weights + audio datasets.

Resources

Link: https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

Link: https://huggingface.co/mistralai/Voxtral-Small-24B-2507

Step-by-Step Process to Install Mistral Voxtral Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Mistral Voxtral, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Mistral Voxtral
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Mistral Voxtral.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Mistral Voxtral runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Update System & Install Essentials

Run the following command to update system and install essentials:

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3 python3-pip git ffmpeg

Step 9: Install python3-venv and Activate Virtual Environment

Run the following command to install python3-venv:

sudo apt install python3-venv -y

Then, run the following command to activate the virtual environment:

python3 -m venv voxtral-env
source voxtral-env/bin/activate

Step 10: Install uv and vLLM

Run the following command to install uv and vLLM:

pip install uv
uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

Step 11: Load the Model and Start the Server

Run the following command to load the model and start the server:

vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral

This will:

Load the Voxtral-Mini-3B-2507 model
Start an OpenAI-compatible server at:

http://localhost:8000/v1

Step 12: Connect to your GPU VM using Remote SSH

Open VS Code on your Mac.
Press Cmd + Shift + P, then choose Remote-SSH: Connect to Host.
Select your configured host.
Once connected, you’ll see SSH: 45.135.56.11(Your VM IP) in the bottom-left status bar (like in the image).

Step 13: Write the Audio Script

In this step, you create the Python script that will send an audio + text instruct request to your running Voxtral server.

Create a new file named audio.py and add the following code:

from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI

# Connect to local server
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

# Get model name
model = client.models.list().data[0].id

# Download sample audio
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")

def file_to_chunk(file: str) -> AudioChunk:
    audio = Audio.from_file(file, strict=False)
    return AudioChunk.from_audio(audio)

# Prepare instruct message
text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different?")
user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()

# Run request
response = client.chat.completions.create(
    model=model,
    messages=[user_msg],
    temperature=0.2,
    top_p=0.95,
)

print(response.choices[0].message.content)

Step 14: Run the Audio Script

In this step, you execute the Python script you wrote to send audio + text to the Voxtral server and check the model’s response.

Run the script from following command:

python3 audio.py

Watch for output

You should see:

Progress bars for downloading:

obama.mp3: 100%
bcn_weather.mp3: 100%

A printed response like:

The speaker who is more inspiring is the one who delivered the farewell address...
The difference lies in the content and purpose of the speeches...

This means the model successfully analyzed the audios and answered the question.

Step 15: Write the Transcription Script

In this step, you create the Python script that will transcribe audio using the Voxtral model.

Create a new file named transcription.py and add the following code:

from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.audio import Audio
from mistral_common.protocol.instruct.messages import RawAudio
from huggingface_hub import hf_hub_download
from openai import OpenAI

# Connect to local server
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

# Get model name
model = client.models.list().data[0].id

# Download sample audio
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(obama_file, strict=False)
audio = RawAudio.from_audio(audio)

# Prepare transcription request
req = TranscriptionRequest(
    model=model,
    audio=audio,
    language="en",
    temperature=0.0
).to_openai(exclude=("top_p", "seed"))

# Run request
response = client.audio.transcriptions.create(**req)
print(response)

What this script does:

Downloads the obama.mp3 audio sample.
Sends it to the Voxtral model for transcription.
Prints the transcribed text.

Step 16: Use Your Own Audio (or Public Sample)

In this step, you can use your own audio file for transcription — but here, we are using a publicly available audio provided through Hugging Face for demonstration.

By default, the script downloads:

obama.mp3 → Obama’s farewell address sample

This audio is publicly hosted and automatically fetched when you run:

python3 transcription.py

If you want to replace it with your own audio, you can:

How to use your own audio file:

Upload your audio to the server, for example:

scp my_audio.mp3 root@<server-ip>:/root/

In transcription.py, change this line:

obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(obama_file, strict=False)

to:

audio = Audio.from_file("/root/my_audio.mp3", strict=False)

Run the script:

python3 transcription.py

For setting up and running mistralai/Voxtral-Small-24B-2507, the installation process and workflow are exactly the same as Voxtral-Mini-3B-2507 — the only key difference is the GPU configuration you choose. While Voxtral Mini can run comfortably on a single mid-to-high-end GPU like an RTX A6000 or A100 40GB, Voxtral Small demands much stronger hardware, such as 2× H100 SXM GPUs or an H100 80GB card, to handle its ~55 GB GPU memory requirement. So, when following the combined setup guide, just remember: the commands and setup flow stay the same — you only need to upgrade your GPU configuration on your NodeShift VM or cloud provider to match the heavier resource needs of Voxtral Small.

Step 1: Launch the NodeShift GPU VM

Go to your NodeShift dashboard.
Select:
- CPU: 224 cores (AMD EPYC 9554)
- RAM: 442 GB
- GPU: 2 × H100 SXM (80 GB per GPU)
- Disk: 300 GB
- CUDA Version: 12.1.1 or 12.2
Region: France, FR (or any available region with these specs)
Final cost: ~$4.724/hour
Click Create.

Step 2: SSH into your VM

Once the VM is ready, connect:

ssh -i /path/to/newkey123 root@<VM_PUBLIC_IP>

Step 3: Install vLLM (nightly) with Audio Support

uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

Check installation:

python -c "import mistral_common; print(mistral_common.__version__)"

It should be ≥ 1.8.0.

Step 6: Serve Voxtral-Small-24B-2507 with vLLM

vllm serve mistralai/Voxtral-Small-24B-2507 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --tensor-parallel-size 2 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

Note: The –tensor-parallel-size 2 flag splits the model across both H100 GPUs.

vllm serve mistralai/Voxtral-Small-24B-2507
→ Starts a vLLM server to host the Voxtral-Small-24B-2507 model and expose it via an API.

--tokenizer_mode mistral
→ Tells vLLM to use the Mistral-specific tokenizer (handles how text is split into tokens before processing).

--config_format mistral
→ Loads the Mistral-specific model configuration file format.

--load_format mistral
→ Ensures the model weights are loaded using the Mistral model loading scheme (some frameworks have their own formats).

--tensor-parallel-size 2
→ Splits the model across 2 GPUs (tensor parallelism) so that very large models like 24B can fit into combined GPU memory.

--tool-call-parser mistral
→ Enables Mistral-style tool call parsing, so the server knows how to handle special tool or function call requests from the input.

--enable-auto-tool-choice
→ Allows the system to automatically decide which tool or function to trigger based on the user’s audio/text input, without you manually specifying it.

This starts the vLLM server and sets up all the API routes to interact with the model.

Server routes are initialized
You can see logs like:

Route: /v1/chat/completions, Methods: POST  
Route: /v1/completions, Methods: POST  
Route: /v1/embeddings, Methods: POST  
Route: /v1/audio/transcriptions, Methods: POST  
...

These are REST API endpoints where your client (like the Python scripts) will send requests — for:

Chat completions,
Audio transcriptions,
Audio translations,
Embedding generation,
Tool/function calling,
Reranking,
Invocations,
And even a metrics endpoint for server health.

Server process starts:

INFO: Started server process [9294]

This means the backend engine (vLLM) is now running and listening.

Application startup completes:

INFO: Waiting for application startup.
INFO: Application startup complete.

At this point, the server is fully live and ready to accept API requests on the default port (usually :8000).

Step 7: Install Python client libraries for Voxtral interaction

Run this command inside your Python virtual environment:

pip install mistral_common[audio] openai huggingface_hub

Step 8: Run Multi-Audio + Text Script

In this step, you create the Python script that will send an audio + text instruct request to your running Voxtral server.

Create a new file named voxtral_client.py and add the following code:

import sys
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI

# === CONFIGURATION ===
SERVER_IP = "127.0.0.1"  # Use localhost since running on same VM
BASE_URL = f"http://{SERVER_IP}:8000/v1"
print(f"Connecting to Voxtral server at: {BASE_URL}")

# === SETUP CLIENT ===
try:
    client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
    models = client.models.list()
    model_id = models.data[0].id
    print(f"Model loaded on server: {model_id}")
except Exception as e:
    print("Failed to connect to server or list models:", e)
    sys.exit(1)

# === DOWNLOAD SAMPLE AUDIO FILES ===
print("Downloading sample audio files...")
try:
    obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
    bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
except Exception as e:
    print("Failed to download audio samples:", e)
    sys.exit(1)

# === PREPARE AUDIO CHUNKS ===
def file_to_chunk(file):
    audio = Audio.from_file(file, strict=False)
    return AudioChunk.from_audio(audio)

audio_chunk1 = file_to_chunk(obama_file)
audio_chunk2 = file_to_chunk(bcn_file)

# === PREPARE FIRST PROMPT ===
text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")
user_msg = UserMessage(content=[audio_chunk1, audio_chunk2, text_chunk]).to_openai()

# === SEND FIRST REQUEST ===
print("\n=== Sending multi-audio + text request ===")
print(text_chunk.text)

try:
    response = client.chat.completions.create(
        model=model_id,
        messages=[user_msg],
        temperature=0.2,
        top_p=0.95,
    )
    content = response.choices[0].message.content
    print("\n=== RESPONSE ===")
    print(content)
except Exception as e:
    print("Failed to get response:", e)
    sys.exit(1)

# === PREPARE FOLLOW-UP ===
followup_msg = UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()

# Maintain conversation history
messages = [
    user_msg,
    AssistantMessage(content=content).to_openai(),
    followup_msg
]

# === SEND FOLLOW-UP REQUEST ===
print("\n=== Sending follow-up request ===")
print(followup_msg["content"])

try:
    response2 = client.chat.completions.create(
        model=model_id,
        messages=messages,
        temperature=0.2,
        top_p=0.95,
    )
    print("\n=== FOLLOW-UP RESPONSE ===")
    print(response2.choices[0].message.content)
except Exception as e:
    print("Failed to get follow-up response:", e)

Step 9: Run the Audio Script

In this step, you execute the Python script you wrote to send audio + text to the Voxtral server and check the model’s response.

Run the script from following command:

python3 voxtral_client.py

Step 10: Create the Python script for audio transcription requests

In this step, you create the Python script that will send an audio transcription request to your running Voxtral server.

Create a new file named transcription.py and add the following code:

# transcription.py

import sys
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.messages import RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI

SERVER_IP = "127.0.0.1"
BASE_URL = f"http://{SERVER_IP}:8000/v1"
print(f"Connecting to Voxtral server at: {BASE_URL}")

try:
    client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
    model_id = client.models.list().data[0].id
except Exception as e:
    print("Failed to connect or list models:", e)
    sys.exit(1)

audio_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(audio_file, strict=False)
raw_audio = RawAudio.from_audio(audio)

req = TranscriptionRequest(model=model_id, audio=raw_audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))

print("Sending transcription request...")
response = client.audio.transcriptions.create(**req)
print("\n=== TRANSCRIPTION RESULT ===")
print(response.text)

What this script does:

Connects to your local Voxtral server.
Downloads an example audio file (obama.mp3).
Prepares it as a raw audio input.
Sends a transcription request to the model.
Prints the transcribed text result.

Step 11: Execute the transcription script and check the model’s response

In this step, you execute the Python script you wrote to send the audio transcription request to the Voxtral server and check the model’s response.

Run the script using the following command:

python3 transcription.py

What you’ll see:

It will connect to the Voxtral server at http://127.0.0.1:8000/v1.
It will send the prepared audio file for transcription.
You will see the printed result under: sql

=== TRANSCRIPTION RESULT ===
<full transcribed text here>

This confirms that the transcription pipeline is working end-to-end!

Step 11: Create the Python script for audio-based function calling

In this step, you create the Python script that will send an audio request with function-calling capability to your running Voxtral server.

Create a new file named voxtral_function_calling.py and add the following code.

# voxtral_function_calling.py

import sys
from mistral_common.protocol.instruct.messages import AudioChunk, UserMessage
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI

SERVER_IP = "127.0.0.1"
BASE_URL = f"http://{SERVER_IP}:8000/v1"
print(f"Connecting to Voxtral server at: {BASE_URL}")

client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
model_id = client.models.list().data[0].id

tool = Tool(
    function=Function(
        name="get_current_weather",
        description="Get the current weather",
        parameters={
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
                "format": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"},
            },
            "required": ["location", "format"],
        },
    )
)

weather_like = hf_hub_download("patrickvonplaten/audio_samples", "fn_calling.wav", repo_type="dataset")
audio_chunk = AudioChunk.from_audio(Audio.from_file(weather_like, strict=False))

print("Sending function calling request...")
user_msg = UserMessage(content=[audio_chunk]).to_openai()
response = client.chat.completions.create(
    model=model_id,
    messages=[user_msg],
    temperature=0.2,
    top_p=0.95,
    tools=[tool.to_openai()]
)

print("\n=== FUNCTION CALL RESULT ===")
print(response.choices[0].message.tool_calls)

This script:

Connects to the Voxtral server using OpenAI-compatible client.
Defines a get_current_weather function with location + format parameters.
Downloads a sample audio (fn_calling.wav) asking for weather info.
Converts audio to AudioChunk and sends it as a user message.
Prints the model’s detected function call and extracted arguments.

Step 12: Execute the function calling script and check the result

In this step, you execute the Python script you wrote to send the audio-based function calling request to the Voxtral server and check the function call result.

Run the script using the following command:

python3 voxtral_function_calling.py

What you’ll see:

It connects to http://127.0.0.1:8000/v1.
Downloads the fn_calling.wav audio file.
Sends the function calling request to the model.
Prints the function call result, e.g:

{'location': 'Madrid', 'format': 'celsius'}

This confirms that Voxtral successfully understood the audio and mapped it to a backend function call!

Step 13: Create a Gradio web app to interact with Voxtral

In this step, you create a Python script with Gradio to build a simple web interface that uploads audio, adds a prompt, and interacts with the Voxtral server.

Create a new file named voxtral_gradio.py and add the following code.

# voxtral_gradio.py

import gradio as gr
from mistral_common.protocol.instruct.messages import AudioChunk, UserMessage
from mistral_common.audio import Audio
from openai import OpenAI

SERVER_IP = "127.0.0.1"
BASE_URL = f"http://{SERVER_IP}:8000/v1"
client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
model_id = client.models.list().data[0].id

def transcribe(audio_file, prompt):
    audio_chunk = AudioChunk.from_audio(Audio.from_file(audio_file, strict=False))
    user_msg = UserMessage(content=[audio_chunk, {"text": prompt}]).to_openai()
    response = client.chat.completions.create(
        model=model_id,
        messages=[user_msg],
        temperature=0.2,
        top_p=0.95,
    )
    return response.choices[0].message.content

gr.Interface(
    fn=transcribe,
    inputs=[gr.Audio(type="filepath"), gr.Textbox(label="Prompt")],
    outputs=gr.Textbox(label="Response"),
    title="Voxtral Audio-Text Chat",
    description="Upload audio and add a text prompt for Voxtral-Small-24B-2507.",
).launch(server_name="0.0.0.0", server_port=7860)

What this script does:

Sets up a Gradio interface with:
- Audio upload input (filepath)
- Textbox input (Prompt)
- Textbox output (Response)
Connects to your Voxtral server (127.0.0.1:8000/v1)
Sends the audio + prompt to the model
Returns the model’s response in the web UI

Run the Gradio script on the VM

On your VM terminal, launch the demo:

python3 voxtral_gradio.py

You should see:

* Running on local URL:  http://0.0.0.0:7860
* To create a public link, set `share=True` in `launch()`.

This means the Gradio server is running on port 7860 inside the VM.

Set up SSH port forwarding from your local machine

On your local machine (Mac/Windows/Linux), open a terminal and run:

ssh -p 19369 -L 7860:127.0.0.1:7860 root@149.7.4.9

This forwards:

Local localhost:7860 → Remote VM 127.0.0.1:7860

Step 14: Open the Gradio Web Interface

After you’ve forwarded the port and launched the script, open your browser and go to:

http://localhost:7860

You should see the Gradio web UI titled:

Voxtral Audio-Text Chat

This is your interactive playground to chat with the Mistral Voxtral model.

Step 15: Create a Python benchmark script to measure Voxtral’s speed

In this step, you create a Python script that runs multiple audio + text requests to the Voxtral server and measures the average latency.

Create a new file named voxtral_benchmark.py and add the following code.

# voxtral_benchmark.py

import time
from mistral_common.protocol.instruct.messages import AudioChunk, TextChunk, UserMessage
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI

# CONFIG
SERVER_IP = "127.0.0.1"
BASE_URL = f"http://{SERVER_IP}:8000/v1"
client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
model_id = client.models.list().data[0].id

# DOWNLOAD AUDIO SAMPLE
print("Downloading sample audio...")
audio_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio_chunk = AudioChunk.from_audio(Audio.from_file(audio_file, strict=False))
text_chunk = TextChunk(text="Summarize this speech.")

# BENCHMARK PARAMETERS
N_RUNS = 5
latencies = []

print(f"\nRunning {N_RUNS} benchmark runs...")
for i in range(N_RUNS):
    user_msg = UserMessage(content=[audio_chunk, text_chunk]).to_openai()
    start = time.time()
    response = client.chat.completions.create(
        model=model_id,
        messages=[user_msg],
        temperature=0.2,
        top_p=0.95,
    )
    end = time.time()
    duration = end - start
    latencies.append(duration)
    print(f"Run {i+1}/{N_RUNS}: {duration:.2f} sec")

avg = sum(latencies) / len(latencies)
print(f"\n=== Benchmark Complete ===\nAverage latency over {N_RUNS} runs: {avg:.2f} sec")

What this script does:

Downloads a sample audio (obama.mp3)
Prepares a text prompt: “Summarize this speech.”
Runs 5 repeated requests (N_RUNS = 5)
Measures and prints the time taken for each run
Calculates and prints the average latency

Step 16: Execute the benchmark script and check Voxtral’s average latency

In this step, you run the benchmark script you created to measure the Voxtral server’s speed across multiple requests.

Run the script with the following command:

python3 voxtral_benchmark.py

What you’ll see:

It downloads the sample audio file.
Runs 5 benchmark tests (by default).
Prints the time taken for each run (e.g., 2.77 sec, 2.46 sec …).
Calculates and displays the average latency, e.g:

Average latency over 5 runs: 2.94 sec

This step helps you understand the performance and response time of your Voxtral deployment.

Conclusion: Voxtral Isn’t Just Tech — It’s a Leap Toward Smarter Audio Understanding

In a world where audio data is everywhere — from calls and interviews to podcasts and live meetings — Voxtral Mini (3B) and Voxtral Small (24B) offer something more than just transcription: they deliver understanding.

These models don’t just turn sound into text; they help you summarize, translate, analyze, and even take action — all through voice. Whether you’re a developer, a researcher, or part of a product team, Voxtral opens the door to building tools that feel intuitive, responsive, and human-aware.

What’s even better? You don’t need a huge data center or a PhD to get started. With straightforward setup, flexible GPU options, and a rich Python ecosystem, you can spin up your own Voxtral playground — locally, on cloud machines like NodeShift, or anywhere you like.

So here’s the real takeaway: this isn’t just another model install guide. It’s your invitation to push audio-to-text workflows into the future — faster, smarter, and more meaningful.

Now, it’s over to you. Go build something remarkable.

Relevant blog posts

July 12, 2025

Building an AI-Powered Chest X-ray Analyzer with MedGemma 27B and Gradio

MedGemma 27B is a cutting-edge medical language and vision model developed by Google, designed to understand both medical text and images. Built as part of the Gemma 3 family, MedGemma comes in two flavors: a multimodal variant that handles both text and images, and a text-only variant focused purely on medical language tasks. It has been trained using a wide range of de-identified medical data — including chest X-rays, dermatology photos, ophthalmology images, and radiology reports — and shows strong performance in medical reasoning, report generation, and visual question answering. While it offers an exciting baseline, MedGemma is meant as a starting point for developers to fine-tune or adapt into healthcare research projects, not as a plug-and-play clinical tool.

July 11, 2025

How to Install Devstral Small 1.1 Locally?

Devstral-Small-2507 is a specialized software engineering model designed to act like a coding assistant that really understands developer needs. Built through a collaboration between Mistral AI and All Hands AI, it’s tailored for tasks like exploring large codebases, editing multiple files, and powering agent-based coding workflows. With a whopping 128k token context window, it can handle complex projects and long tasks without losing track. Even better, it’s lightweight enough to run on a high-end PC or Mac, and when paired with OpenHands, it can automate engineering tasks, understand prompts across 24 languages, and deliver cutting-edge performance — currently topping the SWE-Bench leaderboard. Whether you’re building code agents, running automated edits, or just want a next-gen helper for your software projects, Devstral-Small-2507 is a versatile tool designed to keep up with you.

July 9, 2025

How to Install Nari Dia-1.6B-0626 Locally?

Dia is a fully open, 1.6 billion parameter text-to-speech model crafted by the small but mighty team at Nari Labs. Unlike traditional TTS tools, Dia doesn’t just read — it performs. With the ability to switch speakers, express emotions, and even insert non-verbal gestures like (laughs) or (coughs), Dia brings scripts to life with uncanny realism. Plug in a simple transcript — optionally guided by an audio prompt — and Dia generates vivid, back-and-forth conversations. It’s a playground for storytellers, developers, and researchers who want full control over expressive speech without relying on closed platforms. Built on PyTorch, optimized for GPU speed, and backed by Apache 2.0 licensing, Dia is here to empower voice-first experiences with full transparency and community collaboration.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.