Meet Voxtral Mini & Small: Breakthrough Audio-Text Models
In the fast-moving world of audio technology, Voxtral Mini (3B) and Voxtral Small (24B) stand out as powerful tools designed to understand not just words — but voices, languages, and meaning.
Imagine a system that listens to speeches, interviews, podcasts, or phone calls and not only transcribes them but summarizes key points, translates languages, answers follow-up questions, or even triggers backend actions — all in one streamlined flow.
Both Voxtral Mini and Voxtral Small are built on top of solid text processing backbones, but they go several steps further by adding state-of-the-art audio input abilities. You can feed them audio clips of up to 30–40 minutes, and they’ll handle it with impressive detail, whether that’s simple transcription or deeper understanding tasks like Q&A or generating summaries.
Here’s a quick feel for each:
- Voxtral Mini (3B)
Small but mighty, this version is perfect for light workloads or setups where GPU resources are limited. It’s great for multilingual transcription, audio understanding, and chat-style interactions, handling around ~9.5GB of GPU memory in optimized modes.
- Voxtral Small (24B)
This bigger sibling offers heavyweight performance, ideal for more demanding tasks and deeper audio-to-text workflows. It comes with support for advanced features like function calling from voice, where spoken commands can directly interact with backend systems. Running it requires around ~55GB GPU memory, making it suitable for well-resourced server environments.
What makes both models exciting is their native multilingual strength — they handle English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian — and they do so with automatic language detection, so you don’t need to predefine anything.
In short, Voxtral Mini and Small blur the line between speech and action:
✅ Listen → ✅ Understand → ✅ Respond → ✅ Act.
For developers, researchers, and tech teams, they open the door to building next-generation audio apps, whether for customer support, media analysis, accessibility tools, or voice-driven workflows.
GPU Configuration Table for Voxtral Mini 3B and Voxtral Small 24B
Model | GPU Memory Required (bf16/fp16) | Recommended GPU Types | vLLM Setup Notes |
---|
Voxtral Mini (3B) | ~9.5 GB | NVIDIA A100 40GB, H100 80GB, RTX A6000 48GB, RTX 4090 24GB | Can run on single mid-to-high-end GPU Good for local testing or light server use |
Voxtral Small (24B) | ~55 GB | NVIDIA H100 80GB (single), A100 80GB (single), or multi-GPU setup (2x A100 40GB with tensor parallel) | Needs multi-GPU if using lower-memory cards Use --tensor-parallel-size 2 for splitting across GPUs |
Key Recommendations
✅ For small-scale use or testing →
Start with Voxtral Mini (3B) on a single high-memory GPU like RTX A6000 or A100 40GB.
✅ For production or heavy workloads →
Use Voxtral Small (24B) on a single H100 80GB or split across multiple GPUs (e.g., 2x A100 40GB) with tensor parallelism enabled.
✅ General tips →
- Always use
bf16
or fp16
to reduce memory load.
- On multi-GPU, configure
--tensor-parallel-size
properly.
- Check disk space (~20–40GB) for model weights + audio datasets.
Resources
Link: https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
Link: https://huggingface.co/mistralai/Voxtral-Small-24B-2507
Step-by-Step Process to Install Mistral Voxtral Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Mistral Voxtral, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like Mistral Voxtral
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Mistral Voxtral.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Mistral Voxtral runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Update System & Install Essentials
Run the following command to update system and install essentials:
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3 python3-pip git ffmpeg
Step 9: Install python3-venv and Activate Virtual Environment
Run the following command to install python3-venv:
sudo apt install python3-venv -y
Then, run the following command to activate the virtual environment:
python3 -m venv voxtral-env
source voxtral-env/bin/activate
Step 10: Install uv and vLLM
Run the following command to install uv and vLLM:
pip install uv
uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Step 11: Load the Model and Start the Server
Run the following command to load the model and start the server:
vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral
This will:
- Load the Voxtral-Mini-3B-2507 model
- Start an OpenAI-compatible server at:
http://localhost:8000/v1
Step 12: Connect to your GPU VM using Remote SSH
- Open VS Code on your Mac.
- Press
Cmd + Shift + P
, then choose Remote-SSH: Connect to Host
.
- Select your configured host.
- Once connected, you’ll see
SSH: 45.135.56.11
(Your VM IP) in the bottom-left status bar (like in the image).
Step 13: Write the Audio Script
In this step, you create the Python script that will send an audio + text instruct request to your running Voxtral server.
Create a new file named audio.py
and add the following code:
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Connect to local server
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)
# Get model name
model = client.models.list().data[0].id
# Download sample audio
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
def file_to_chunk(file: str) -> AudioChunk:
audio = Audio.from_file(file, strict=False)
return AudioChunk.from_audio(audio)
# Prepare instruct message
text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different?")
user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
# Run request
response = client.chat.completions.create(
model=model,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
)
print(response.choices[0].message.content)
Step 14: Run the Audio Script
In this step, you execute the Python script you wrote to send audio + text to the Voxtral server and check the model’s response.
Run the script from following command:
python3 audio.py
Watch for output
You should see:
- Progress bars for downloading:
obama.mp3: 100%
bcn_weather.mp3: 100%
A printed response like:
The speaker who is more inspiring is the one who delivered the farewell address...
The difference lies in the content and purpose of the speeches...
This means the model successfully analyzed the audios and answered the question.
Step 15: Write the Transcription Script
In this step, you create the Python script that will transcribe audio using the Voxtral model.
Create a new file named transcription.py
and add the following code:
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.audio import Audio
from mistral_common.protocol.instruct.messages import RawAudio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Connect to local server
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)
# Get model name
model = client.models.list().data[0].id
# Download sample audio
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(obama_file, strict=False)
audio = RawAudio.from_audio(audio)
# Prepare transcription request
req = TranscriptionRequest(
model=model,
audio=audio,
language="en",
temperature=0.0
).to_openai(exclude=("top_p", "seed"))
# Run request
response = client.audio.transcriptions.create(**req)
print(response)
What this script does:
- Downloads the
obama.mp3
audio sample.
- Sends it to the Voxtral model for transcription.
- Prints the transcribed text.
Step 16: Use Your Own Audio (or Public Sample)
In this step, you can use your own audio file for transcription — but here, we are using a publicly available audio provided through Hugging Face for demonstration.
By default, the script downloads:
obama.mp3
→ Obama’s farewell address sample
This audio is publicly hosted and automatically fetched when you run:
python3 transcription.py
If you want to replace it with your own audio, you can:
How to use your own audio file:
Upload your audio to the server, for example:
scp my_audio.mp3 root@<server-ip>:/root/
In transcription.py
, change this line:
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(obama_file, strict=False)
to:
audio = Audio.from_file("/root/my_audio.mp3", strict=False)
Run the script:
python3 transcription.py
For setting up and running mistralai/Voxtral-Small-24B-2507, the installation process and workflow are exactly the same as Voxtral-Mini-3B-2507 — the only key difference is the GPU configuration you choose. While Voxtral Mini can run comfortably on a single mid-to-high-end GPU like an RTX A6000 or A100 40GB, Voxtral Small demands much stronger hardware, such as 2× H100 SXM GPUs or an H100 80GB card, to handle its ~55 GB GPU memory requirement. So, when following the combined setup guide, just remember: the commands and setup flow stay the same — you only need to upgrade your GPU configuration on your NodeShift VM or cloud provider to match the heavier resource needs of Voxtral Small.
Step 1: Launch the NodeShift GPU VM
- Go to your NodeShift dashboard.
- Select:
- CPU: 224 cores (AMD EPYC 9554)
- RAM: 442 GB
- GPU: 2 × H100 SXM (80 GB per GPU)
- Disk: 300 GB
- CUDA Version: 12.1.1 or 12.2
- Region: France, FR (or any available region with these specs)
- Final cost: ~$4.724/hour
- Click Create.
Step 2: SSH into your VM
Once the VM is ready, connect:
ssh -i /path/to/newkey123 root@<VM_PUBLIC_IP>
Step 3: Install vLLM (nightly) with Audio Support
uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Check installation:
python -c "import mistral_common; print(mistral_common.__version__)"
It should be ≥ 1.8.0.
Step 6: Serve Voxtral-Small-24B-2507 with vLLM
vllm serve mistralai/Voxtral-Small-24B-2507 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tensor-parallel-size 2 \
--tool-call-parser mistral \
--enable-auto-tool-choice
Note: The –tensor-parallel-size 2 flag splits the model across both H100 GPUs.
vllm serve mistralai/Voxtral-Small-24B-2507
→ Starts a vLLM server to host the Voxtral-Small-24B-2507 model and expose it via an API.
--tokenizer_mode mistral
→ Tells vLLM to use the Mistral-specific tokenizer (handles how text is split into tokens before processing).
--config_format mistral
→ Loads the Mistral-specific model configuration file format.
--load_format mistral
→ Ensures the model weights are loaded using the Mistral model loading scheme (some frameworks have their own formats).
--tensor-parallel-size 2
→ Splits the model across 2 GPUs (tensor parallelism) so that very large models like 24B can fit into combined GPU memory.
--tool-call-parser mistral
→ Enables Mistral-style tool call parsing, so the server knows how to handle special tool or function call requests from the input.
--enable-auto-tool-choice
→ Allows the system to automatically decide which tool or function to trigger based on the user’s audio/text input, without you manually specifying it.
This starts the vLLM server and sets up all the API routes to interact with the model.
Server routes are initialized
You can see logs like:
Route: /v1/chat/completions, Methods: POST
Route: /v1/completions, Methods: POST
Route: /v1/embeddings, Methods: POST
Route: /v1/audio/transcriptions, Methods: POST
...
These are REST API endpoints where your client (like the Python scripts) will send requests — for:
- Chat completions,
- Audio transcriptions,
- Audio translations,
- Embedding generation,
- Tool/function calling,
- Reranking,
- Invocations,
- And even a metrics endpoint for server health.
Server process starts:
INFO: Started server process [9294]
This means the backend engine (vLLM) is now running and listening.
Application startup completes:
INFO: Waiting for application startup.
INFO: Application startup complete.
At this point, the server is fully live and ready to accept API requests on the default port (usually :8000
).
Step 7: Install Python client libraries for Voxtral interaction
Run this command inside your Python virtual environment:
pip install mistral_common[audio] openai huggingface_hub
Step 8: Run Multi-Audio + Text Script
In this step, you create the Python script that will send an audio + text instruct request to your running Voxtral server.
Create a new file named voxtral_client.py
and add the following code:
import sys
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# === CONFIGURATION ===
SERVER_IP = "127.0.0.1" # Use localhost since running on same VM
BASE_URL = f"http://{SERVER_IP}:8000/v1"
print(f"Connecting to Voxtral server at: {BASE_URL}")
# === SETUP CLIENT ===
try:
client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
models = client.models.list()
model_id = models.data[0].id
print(f"Model loaded on server: {model_id}")
except Exception as e:
print("Failed to connect to server or list models:", e)
sys.exit(1)
# === DOWNLOAD SAMPLE AUDIO FILES ===
print("Downloading sample audio files...")
try:
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
except Exception as e:
print("Failed to download audio samples:", e)
sys.exit(1)
# === PREPARE AUDIO CHUNKS ===
def file_to_chunk(file):
audio = Audio.from_file(file, strict=False)
return AudioChunk.from_audio(audio)
audio_chunk1 = file_to_chunk(obama_file)
audio_chunk2 = file_to_chunk(bcn_file)
# === PREPARE FIRST PROMPT ===
text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")
user_msg = UserMessage(content=[audio_chunk1, audio_chunk2, text_chunk]).to_openai()
# === SEND FIRST REQUEST ===
print("\n=== Sending multi-audio + text request ===")
print(text_chunk.text)
try:
response = client.chat.completions.create(
model=model_id,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
)
content = response.choices[0].message.content
print("\n=== RESPONSE ===")
print(content)
except Exception as e:
print("Failed to get response:", e)
sys.exit(1)
# === PREPARE FOLLOW-UP ===
followup_msg = UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
# Maintain conversation history
messages = [
user_msg,
AssistantMessage(content=content).to_openai(),
followup_msg
]
# === SEND FOLLOW-UP REQUEST ===
print("\n=== Sending follow-up request ===")
print(followup_msg["content"])
try:
response2 = client.chat.completions.create(
model=model_id,
messages=messages,
temperature=0.2,
top_p=0.95,
)
print("\n=== FOLLOW-UP RESPONSE ===")
print(response2.choices[0].message.content)
except Exception as e:
print("Failed to get follow-up response:", e)
Step 9: Run the Audio Script
In this step, you execute the Python script you wrote to send audio + text to the Voxtral server and check the model’s response.
Run the script from following command:
python3 voxtral_client.py
Step 10: Create the Python script for audio transcription requests
In this step, you create the Python script that will send an audio transcription request to your running Voxtral server.
Create a new file named transcription.py and add the following code:
# transcription.py
import sys
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.messages import RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
SERVER_IP = "127.0.0.1"
BASE_URL = f"http://{SERVER_IP}:8000/v1"
print(f"Connecting to Voxtral server at: {BASE_URL}")
try:
client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
model_id = client.models.list().data[0].id
except Exception as e:
print("Failed to connect or list models:", e)
sys.exit(1)
audio_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(audio_file, strict=False)
raw_audio = RawAudio.from_audio(audio)
req = TranscriptionRequest(model=model_id, audio=raw_audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
print("Sending transcription request...")
response = client.audio.transcriptions.create(**req)
print("\n=== TRANSCRIPTION RESULT ===")
print(response.text)
What this script does:
- Connects to your local Voxtral server.
- Downloads an example audio file (
obama.mp3
).
- Prepares it as a raw audio input.
- Sends a transcription request to the model.
- Prints the transcribed text result.
Step 11: Execute the transcription script and check the model’s response
In this step, you execute the Python script you wrote to send the audio transcription request to the Voxtral server and check the model’s response.
Run the script using the following command:
python3 transcription.py
What you’ll see:
- It will connect to the Voxtral server at
http://127.0.0.1:8000/v1
.
- It will send the prepared audio file for transcription.
- You will see the printed result under: sql
=== TRANSCRIPTION RESULT ===
<full transcribed text here>
This confirms that the transcription pipeline is working end-to-end!
Step 11: Create the Python script for audio-based function calling
In this step, you create the Python script that will send an audio request with function-calling capability to your running Voxtral server.
Create a new file named voxtral_function_calling.py and add the following code.
# voxtral_function_calling.py
import sys
from mistral_common.protocol.instruct.messages import AudioChunk, UserMessage
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
SERVER_IP = "127.0.0.1"
BASE_URL = f"http://{SERVER_IP}:8000/v1"
print(f"Connecting to Voxtral server at: {BASE_URL}")
client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
model_id = client.models.list().data[0].id
tool = Tool(
function=Function(
name="get_current_weather",
description="Get the current weather",
parameters={
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
"format": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"},
},
"required": ["location", "format"],
},
)
)
weather_like = hf_hub_download("patrickvonplaten/audio_samples", "fn_calling.wav", repo_type="dataset")
audio_chunk = AudioChunk.from_audio(Audio.from_file(weather_like, strict=False))
print("Sending function calling request...")
user_msg = UserMessage(content=[audio_chunk]).to_openai()
response = client.chat.completions.create(
model=model_id,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
tools=[tool.to_openai()]
)
print("\n=== FUNCTION CALL RESULT ===")
print(response.choices[0].message.tool_calls)
This script:
- Connects to the Voxtral server using OpenAI-compatible client.
- Defines a
get_current_weather
function with location + format parameters.
- Downloads a sample audio (
fn_calling.wav
) asking for weather info.
- Converts audio to
AudioChunk
and sends it as a user message.
- Prints the model’s detected function call and extracted arguments.
Step 12: Execute the function calling script and check the result
In this step, you execute the Python script you wrote to send the audio-based function calling request to the Voxtral server and check the function call result.
Run the script using the following command:
python3 voxtral_function_calling.py
What you’ll see:
- It connects to
http://127.0.0.1:8000/v1
.
- Downloads the
fn_calling.wav
audio file.
- Sends the function calling request to the model.
- Prints the function call result, e.g:
{'location': 'Madrid', 'format': 'celsius'}
This confirms that Voxtral successfully understood the audio and mapped it to a backend function call!
Step 13: Create a Gradio web app to interact with Voxtral
In this step, you create a Python script with Gradio to build a simple web interface that uploads audio, adds a prompt, and interacts with the Voxtral server.
Create a new file named voxtral_gradio.py and add the following code.
# voxtral_gradio.py
import gradio as gr
from mistral_common.protocol.instruct.messages import AudioChunk, UserMessage
from mistral_common.audio import Audio
from openai import OpenAI
SERVER_IP = "127.0.0.1"
BASE_URL = f"http://{SERVER_IP}:8000/v1"
client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
model_id = client.models.list().data[0].id
def transcribe(audio_file, prompt):
audio_chunk = AudioChunk.from_audio(Audio.from_file(audio_file, strict=False))
user_msg = UserMessage(content=[audio_chunk, {"text": prompt}]).to_openai()
response = client.chat.completions.create(
model=model_id,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
)
return response.choices[0].message.content
gr.Interface(
fn=transcribe,
inputs=[gr.Audio(type="filepath"), gr.Textbox(label="Prompt")],
outputs=gr.Textbox(label="Response"),
title="Voxtral Audio-Text Chat",
description="Upload audio and add a text prompt for Voxtral-Small-24B-2507.",
).launch(server_name="0.0.0.0", server_port=7860)
What this script does:
- Sets up a Gradio interface with:
- Audio upload input (
filepath
)
- Textbox input (
Prompt
)
- Textbox output (
Response
)
- Connects to your Voxtral server (
127.0.0.1:8000/v1
)
- Sends the audio + prompt to the model
- Returns the model’s response in the web UI
Run the Gradio script on the VM
On your VM terminal, launch the demo:
python3 voxtral_gradio.py
You should see:
* Running on local URL: http://0.0.0.0:7860
* To create a public link, set `share=True` in `launch()`.
This means the Gradio server is running on port 7860 inside the VM.
Set up SSH port forwarding from your local machine
On your local machine (Mac/Windows/Linux), open a terminal and run:
ssh -p 19369 -L 7860:127.0.0.1:7860 root@149.7.4.9
This forwards:
- Local
localhost:7860
→ Remote VM 127.0.0.1:7860
Step 14: Open the Gradio Web Interface
After you’ve forwarded the port and launched the script, open your browser and go to:
http://localhost:7860
You should see the Gradio web UI titled:
Voxtral Audio-Text Chat
This is your interactive playground to chat with the Mistral Voxtral model.
Step 15: Create a Python benchmark script to measure Voxtral’s speed
In this step, you create a Python script that runs multiple audio + text requests to the Voxtral server and measures the average latency.
Create a new file named voxtral_benchmark.py and add the following code.
# voxtral_benchmark.py
import time
from mistral_common.protocol.instruct.messages import AudioChunk, TextChunk, UserMessage
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# CONFIG
SERVER_IP = "127.0.0.1"
BASE_URL = f"http://{SERVER_IP}:8000/v1"
client = OpenAI(api_key="EMPTY", base_url=BASE_URL)
model_id = client.models.list().data[0].id
# DOWNLOAD AUDIO SAMPLE
print("Downloading sample audio...")
audio_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio_chunk = AudioChunk.from_audio(Audio.from_file(audio_file, strict=False))
text_chunk = TextChunk(text="Summarize this speech.")
# BENCHMARK PARAMETERS
N_RUNS = 5
latencies = []
print(f"\nRunning {N_RUNS} benchmark runs...")
for i in range(N_RUNS):
user_msg = UserMessage(content=[audio_chunk, text_chunk]).to_openai()
start = time.time()
response = client.chat.completions.create(
model=model_id,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
)
end = time.time()
duration = end - start
latencies.append(duration)
print(f"Run {i+1}/{N_RUNS}: {duration:.2f} sec")
avg = sum(latencies) / len(latencies)
print(f"\n=== Benchmark Complete ===\nAverage latency over {N_RUNS} runs: {avg:.2f} sec")
What this script does:
- Downloads a sample audio (
obama.mp3
)
- Prepares a text prompt: “Summarize this speech.”
- Runs 5 repeated requests (
N_RUNS = 5
)
- Measures and prints the time taken for each run
- Calculates and prints the average latency
Step 16: Execute the benchmark script and check Voxtral’s average latency
In this step, you run the benchmark script you created to measure the Voxtral server’s speed across multiple requests.
Run the script with the following command:
python3 voxtral_benchmark.py
What you’ll see:
- It downloads the sample audio file.
- Runs 5 benchmark tests (by default).
- Prints the time taken for each run (e.g.,
2.77 sec
, 2.46 sec
…).
- Calculates and displays the average latency, e.g:
Average latency over 5 runs: 2.94 sec
This step helps you understand the performance and response time of your Voxtral deployment.
Conclusion: Voxtral Isn’t Just Tech — It’s a Leap Toward Smarter Audio Understanding
In a world where audio data is everywhere — from calls and interviews to podcasts and live meetings — Voxtral Mini (3B) and Voxtral Small (24B) offer something more than just transcription: they deliver understanding.
These models don’t just turn sound into text; they help you summarize, translate, analyze, and even take action — all through voice. Whether you’re a developer, a researcher, or part of a product team, Voxtral opens the door to building tools that feel intuitive, responsive, and human-aware.
What’s even better? You don’t need a huge data center or a PhD to get started. With straightforward setup, flexible GPU options, and a rich Python ecosystem, you can spin up your own Voxtral playground — locally, on cloud machines like NodeShift, or anywhere you like.
So here’s the real takeaway: this isn’t just another model install guide. It’s your invitation to push audio-to-text workflows into the future — faster, smarter, and more meaningful.
Now, it’s over to you. Go build something remarkable.