How to Install & Run ZeroEntropy Zerank 1 Small Locally?

by Ayush Kumar | July 21, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

In the world of search engines and information retrieval, precision matters. That’s where zerank-1-small comes in — a compact yet powerful reranker model developed by ZeroEntropy. Designed to boost the accuracy of search results, this 1.7B parameter model is a lighter sibling of the flagship zerank-1, delivering impressive performance while being over two times smaller.

What sets zerank-1-small apart is its ability to consistently outperform many well-known rerankers and deliver significant accuracy improvements over traditional vector search methods. Whether applied to fields like finance, legal, STEM, code, or medical queries, the model enhances the ranking of retrieved documents to ensure users get the most relevant answers.

Released under the open-source Apache 2.0 license, zerank-1-small is part of ZeroEntropy’s commitment to advancing open-source tools and empowering developers, researchers, and organizations to build better retrieval systems without proprietary restrictions.

Evaluations

NDCG@10 scores between zerank-1-small and competing closed-source proprietary rerankers. Since we are evaluating rerankers, OpenAI’s text-embedding-3-small is used as an initial retriever for the Top 100 candidate documents.

Task	Embedding	cohere-rerank-v3.5	Salesforce/Llama-rank-v1	zerank-1-small	zerank-1
Code	0.678	0.724	0.694	0.730	0.754
Conversational	0.250	0.571	0.484	0.556	0.596
Finance	0.839	0.824	0.828	0.861	0.894
Legal	0.703	0.804	0.767	0.817	0.821
Medical	0.619	0.750	0.719	0.773	0.796
STEM	0.401	0.510	0.595	0.680	0.694

Recommended GPU VM configuration

Component	Minimum	Recommended
GPU	1x NVIDIA A10 / A100 / RTX A6000 (24–40GB VRAM)	1x NVIDIA A100 (40–80GB VRAM)
CPU	4–8 vCPU	8–16 vCPU
RAM	16–32 GB	32–64 GB
Disk	50+ GB SSD	100+ GB SSD
OS	Ubuntu 20.04 / 22.04	Ubuntu 22.04
CUDA	11.8 or higher	12.1 or higher

Resources

Link: https://huggingface.co/zeroentropy/zerank-1-small

Step-by-Step Process to Install & Run ZeroEntropy Zerank 1 Small Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTXA6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Zerank 1 Small, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Zerank 1 Small
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Zerank 1 Small.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Zerank 1 Small runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Set up Python environment

Run the following command to setup the Python environment:

python3 -m venv zerank-env
source zerank-env/bin/activate

Step 13: Install PyTorch with GPU support

Run the following command to install torch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 14: Install sentence-transformers and accelerate

Run the following commands to install sentence-transformers and accelerate:

pip install sentence-transformers
pip install accelerate

Step 15: Connect to your GPU VM using Remote SSH

Open VS Code on your Mac.
Press Cmd + Shift + P, then choose Remote-SSH: Connect to Host.
Select your configured host.
Once connected, you’ll see SSH: 209.137.198.14(Your VM IP) in the bottom-left status bar (like in the image).

Step 16: Run zerank-1-small model

Create a Python script (e.g., run_zerank.py) and add the following code:

from sentence_transformers import CrossEncoder

# Load the model
model = CrossEncoder("zeroentropy/zerank-1-small", trust_remote_code=True)

# Example query-doc pairs
query_documents = [
    ("What is 2+2?", "4"),
    ("What is 2+2?", "The answer is definitely 1 million"),
]

# Get scores
scores = model.predict(query_documents)

print("Rerank scores:", scores)

run_zerank.py

This script is a simple Python file that loads the zeroentropy/zerank-1-small reranker model using the sentence-transformers library and runs it on predefined query-document pairs. We used it to directly test the model inference on the GPU VM, without any UI or API, to ensure the model downloads, initializes, and produces relevance scores between queries and documents. It outputs the scores to the terminal, letting us check if relevant document pairs (like “Who wrote Hamlet?” and “Shakespeare wrote Hamlet.”) get high scores, while irrelevant ones get low scores. It’s the foundation script to make sure the model itself works.

Step 17: Run the script

python3 run_zerank.py

You will see:

Model downloads:

model.safetensors: 100%|█████████████████| 3.44G/3.44G [...]
tokenizer.json: 100%|████████████████████| 11.4M/11.4M [...]
...

And final rerank scores:

Rerank scores: [0.6470849025528733, 0.28521265886319286]

In same file paste the following code:

from sentence_transformers import CrossEncoder

model = CrossEncoder("zeroentropy/zerank-1-small", trust_remote_code=True)

# Replace here with the test prompts
query_documents = [
    ("Who wrote Hamlet?", "Shakespeare wrote Hamlet."),
    ("Who wrote Hamlet?", "Einstein was the author of Hamlet."),
]

scores = model.predict(query_documents)

print("Rerank scores:", scores)

Then, again run the script:

python3 run_zerank.py

You’re testing two pairs:

"Who wrote Hamlet?" → "Shakespeare wrote Hamlet." → high score (~0.85)
"Who wrote Hamlet?" → "Einstein was the author of Hamlet." → lower score (~0.56)

The model is ranking their relevance.

Step 18: CLI Interactive Script (type in terminal)

Create a Python script (e.g., cli_rerank.py) and add the following code:

from sentence_transformers import CrossEncoder

model = CrossEncoder("zeroentropy/zerank-1-small", trust_remote_code=True)

print("💬 ZeroEntropy Zerank-1-Small CLI Reranker")
print("Type 'exit' anytime to quit.\n")

while True:
    query = input("Enter query: ")
    if query.lower() == "exit":
        break

    doc = input("Enter document: ")
    if doc.lower() == "exit":
        break

    score = model.predict([(query, doc)])[0]
    print(f"🔹 Relevance Score: {score:.4f}\n")

cli_rerank.py

This is an interactive command-line interface (CLI) script where you can type in query and document pairs live in the terminal. We used it to manually explore different inputs and see their rerank scores on the fly, without modifying code or restarting scripts. This was useful for quick, hands-on testing and exploration, like an interactive playground in the terminal. It keeps running in a loop, letting you test as many pairs as you want, and exits gracefully when you type exit.

Step 19: Run the script

python3 cli_rerank.py

Use the CLI interactively

When prompted:

Enter query:

type:

Who discovered gravity?

when prompted:

Enter document:

type:

Isaac Newton discovered gravity after observing a falling apple.

You will see:

🔹 Relevance Score: 0.8991

Step 20: Install Pandas

Run the following command to install pandas:

pip install pandas

Step 21: Batch Reranking from CSV

Create a Python script (e.g., batch_rerank.py) and add the following code:

import pandas as pd
from sentence_transformers import CrossEncoder

model = CrossEncoder("zeroentropy/zerank-1-small", trust_remote_code=True)

# Example CSV: input.csv with columns: query, document
df = pd.read_csv("input.csv")
pairs = list(zip(df['query'], df['document']))
scores = model.predict(pairs)
df['score'] = scores

df.to_csv("output_with_scores.csv", index=False)
print("✅ Saved reranked results to 'output_with_scores.csv'")

Prepare your input CSV file

Create a file named:

input.csv

Paste this sample content:

query,document
Who wrote Hamlet?,Shakespeare wrote Hamlet.
Who discovered gravity?,Isaac Newton discovered gravity after observing a falling apple.
What is the capital of France?,Paris is the capital of France.

batch_rerank.py

This batch script reads multiple query-document pairs from an input CSV file (input.csv), runs them all through the reranker model, and saves the results with relevance scores into a new CSV file (output_with_scores.csv). We used this script to automate reranking over larger datasets or lists of pairs, making it ideal when you have dozens or hundreds of examples to process at once. It’s practical for offline experiments, dataset evaluations, or preparing reranked outputs for further analysis or reporting.

Step 22: Run the batch script

python3 batch_rerank.py

You should see:

✅ Saved reranked results to 'output_with_scores.csv'

Check the output file

You will see something like:

query,document,score
Who wrote Hamlet?,Shakespeare wrote Hamlet.,0.8487
Who discovered gravity?,Isaac Newton discovered gravity after observing a falling apple.,0.9168
What is the capital of France?,Paris is the capital of France.,0.8921

Step 23: Install Gradio

Run the following command to install gradio:

pip install gradio

Step 24: Gradio UI (browser interface)

Create a Python script (e.g., gradio_rerank.py) and add the following code:

import gradio as gr
from sentence_transformers import CrossEncoder

model = CrossEncoder("zeroentropy/zerank-1-small", trust_remote_code=True)

def rerank(query, document):
    score = model.predict([(query, document)])[0]
    return f"Relevance Score: {score:.4f}"

iface = gr.Interface(
    fn=rerank,
    inputs=["text", "text"],
    outputs="text",
    title="ZeroEntropy Zerank-1-Small Reranker",
    description="Enter a query and a document to get their relevance score."
)

iface.launch()

gradio_rerank.py

This script runs a Gradio-based web interface on the VM, providing a browser-based GUI (Graphical User Interface) where you can enter queries and documents, click submit, and instantly see the relevance score. We used this to make the reranker more user-friendly and accessible, especially for non-technical users or demos, where people can interact via the browser without writing any code. With SSH port forwarding, we could even access it securely on a local machine from the VM.

Step 25: Run the script

Run your Gradio script:

python3 gradio_rerank.py

You will see:

* Running on local URL: http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.

Set up SSH port forwarding from your local machine

On your local machine (Mac/Windows/Linux), open a terminal and run:

ssh -L 7860:localhost:7860 -p 32153 root@209.137.198.14

This forwards:

Local localhost:7860 → Remote VM 127.0.0.1:7860

Step 26: Open the Gradio Web Interface

After you’ve forwarded the port and launched the script, open your browser and go to:

http://localhost:7860

You should see the Gradio web UI titled:

ZeroEntropy Zerank-1-Small Reranker

This is your interactive playground to chat with the Zerank-1-Small model.

Step 27: Enter test data and check output

Example:

Query: Who discovered gravity?
Document: Isaac Newton discovered gravity after observing a falling apple.

Click Submit →

You should see:

Relevance Score: 0.9139

Step 28: Install FastAPI

Run the following command to install FastAPI:

pip install fastapi uvicorn

Step 29: FastAPI Service (REST API)

Create a Python script (e.g., fast_rerank.py) and add the following code:

from fastapi import FastAPI
from pydantic import BaseModel
from sentence_transformers import CrossEncoder
import uvicorn

app = FastAPI()
model = CrossEncoder("zeroentropy/zerank-1-small", trust_remote_code=True)

class RerankRequest(BaseModel):
    query: str
    document: str

@app.post("/rerank")
def rerank(request: RerankRequest):
    score = model.predict([(request.query, request.document)])[0]
    return {"score": score}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

fastapi_rerank.py

This script sets up a FastAPI-based REST API that exposes the reranker as an HTTP service with a /rerank endpoint. We used this script to turn the reranker into an API service that can be called programmatically by other applications, scripts, or tools. It’s the ideal setup for integrating the reranker into pipelines, web services, or larger systems, and it comes with automatic Swagger documentation at /docs for easy testing and exploration.

Step 30: Run FastAPI server on the VM

On your VM, run the following command:

python3 fastapi_rerank.py

You should see:

INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

This means your API server is live inside the VM on port 8000.

Set up SSH port forwarding from your local machine

On your local machine (Mac/Windows/Linux), open a terminal and run:

ssh -L 8000:localhost:8000 -p 32153 root@209.137.198.14

This command:

Forwards your local port 8000 → VM port 8000
Allows you to access the FastAPI server running in the VM from your local machine

Access the FastAPI API from your local machine

Open browser or use curl:

http://localhost:8000/docs

This opens the FastAPI Swagger UI, where you can:

Test /rerank endpoint
Send POST requests
See live JSON responses

Test using Python or curl

Example curl:

curl -X POST "http://localhost:8000/rerank" \
-H "Content-Type: application/json" \
-d '{"query": "Who discovered gravity?", "document": "Isaac Newton discovered gravity after observing a falling apple."}'

Each script progressed you from basic testing → interactive use → batch processing → GUI → API — giving you a complete, flexible stack to use the reranker however you need.

Conclusion

In this guide, we walked through the complete process of installing, configuring, and running the ZeroEntropy Zerank-1-Small reranker model on a GPU virtual machine, using a variety of interfaces — from simple Python scripts and command-line tools to browser-based Gradio apps and full-fledged FastAPI services. Each script served a clear purpose: whether it was for quick testing, hands-on exploration, batch reranking, or integrating the model into larger systems, we covered it all.

By the end, you don’t just have a model running — you have a practical, flexible reranking toolkit that can be adapted for developers, researchers, and even non-technical users. With open-source access and the freedom to scale across industries like finance, legal, STEM, and medical, Zerank-1-Small puts high-quality, transparent reranking power right at your fingertips — no black boxes, no vendor lock-in, just straightforward, efficient search improvement.

Relevant blog posts

July 18, 2025

How to Install LiquidAI LFM2-1.2B Locally?

The LFM2-1.2B is a next-generation hybrid model developed by Liquid AI, designed specifically for edge AI and on-device deployment. With ~1.2 billion parameters, this model stands out for its speed, memory efficiency, and quality, making it ideal for lightweight applications like agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. Model details Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills.

July 16, 2025

How to Install Mistral Voxtral Locally?

Both Voxtral Mini and Voxtral Small are built on top of solid text processing backbones, but they go several steps further by adding state-of-the-art audio input abilities. You can feed them audio clips of up to 30–40 minutes, and they’ll handle it with impressive detail, whether that’s simple transcription or deeper understanding tasks like Q&A or generating summaries.

July 12, 2025

Building an AI-Powered Chest X-ray Analyzer with MedGemma 27B and Gradio

MedGemma 27B is a cutting-edge medical language and vision model developed by Google, designed to understand both medical text and images. Built as part of the Gemma 3 family, MedGemma comes in two flavors: a multimodal variant that handles both text and images, and a text-only variant focused purely on medical language tasks. It has been trained using a wide range of de-identified medical data — including chest X-rays, dermatology photos, ophthalmology images, and radiology reports — and shows strong performance in medical reasoning, report generation, and visual question answering. While it offers an exciting baseline, MedGemma is meant as a starting point for developers to fine-tune or adapt into healthcare research projects, not as a plug-and-play clinical tool.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.