How to Install SmolDocling-256M-Preview Locally?

by Ayush Kumar | May 19, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

SmolDocling is a compact yet powerful document understanding model designed to turn complex document images—like academic papers, certificates, invoices, and printed forms—into clean, structured, and readable text. Built on top of the IDEFICS3 architecture and trained using real-world scientific and non-scientific documents, SmolDocling doesn’t just extract text; it understands layout, structure, and context using smart markup called DocTags. Whether it’s identifying tables, code blocks, formulas, or section headers, this model handles everything in one go with surprising accuracy and speed—averaging just 0.35 seconds per page on an A100 GPU. It’s lightweight enough to run on everyday GPUs (like a T4 or RTX 3060), yet smart enough to handle real-world complexity without breaking a sweat. SmolDocling is fully open-source, integrates seamlessly with the Docling ecosystem, and supports exporting results to Markdown, HTML, JSON, and more—making it the ideal choice for developers, researchers, and teams who want fast, reliable, and layout-aware document conversion without the bloat.

Resource

Hugging Face

Link: https://huggingface.co/ds4sd/SmolDocling-256M-preview

GitHub

Link: https://github.com/ds4sd/docling.git

Recommended GPU Setup

Component	Minimum Setup	Recommended Setup
GPU	1× NVIDIA T4 (16 GB)	1× A100 or H100 (40–80 GB)
vCPU	4+ cores	8+ cores
RAM	16 GB	32 GB+
Disk	20 GB+	40 GB+

Step-by-Step Process to Install SmolDocling-256M-Preview Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy SmolDocling-256M-Preview on an NVIDIA Cuda Virtual Machine. This proprietary, closed-source parallel computing platform will allow you to install SmolDocling-256M-Preview on your GPU Node.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, if you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Update the System

Run the following command to update the system:

sudo apt update && sudo apt upgrade -y

Step 9: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 10: Install Python 3.10

Now, run the following command to install Python 3.10 or another desired version:

sudo apt install python3.10 python3.10-venv python3.10-dev -y

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install Dependencies

Run the following command to install the dependencies:

sudo apt install build-essential libgl1 -y

Step 12: Install Pip for Python 3.10

Run the following command to install the Pip:

curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10

Step 13: Create Python Environment

Run the following command to create the python environment:

python3.10 -m venv smoldocling-env
source smoldocling-env/bin/activate

Step 14: Clone the Docling Repository

Run the following command to clone the docling repository:

git clone https://github.com/ds4sd/docling.git
cd docling

Step 15: Install Poetry

Run the following command to install the poetry:

curl -sSL https://install.python-poetry.org | python3.10 -

After installation, run the following command to activate Poetry:

export PATH="/root/.local/bin:$PATH"

Then, run the following command to check the version of poetry:

poetry --version

Step 16: Install Project Dependencies

Run the following command to install project dependencies:

poetry install

Then, run the following command to activate the Poetry shell:

poetry shell

Step 17: Install Transformers and PIL

Run the following command to install transformers and PIL:

pip install transformers torch pillow

Step 18: Download a Real PDF from Google

Run the following command to download a real PDF from google:

wget https://arxiv.org/pdf/2307.06435 -O sample.pdf

Step 19: Create and Run the Inference Script

Paste this into a file using cat:

cat > smoldocling_test.py

Then paste the following code:

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

# Load model and processor
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained("ds4sd/SmolDocling-256M-preview").to("cuda")

# Load image
image = Image.open("sample_page.png").convert("RGB")

# Prompt must include <image> token
prompt = "<image> Convert this page to docling."

# Preprocess inputs
raw_inputs = processor(text=prompt, images=image, return_tensors="pt", truncation=True).to("cuda")

# Clean inputs
allowed_keys = {"input_ids", "attention_mask", "pixel_values", "decoder_input_ids"}
inputs = {k: v for k, v in raw_inputs.items() if k in allowed_keys}

# Generate
outputs = model.generate(**inputs, max_new_tokens=1024)

# Decode output
text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

print("\n📄 Extracted Output:\n")
print(text)

Save with Ctrl+D.

Step 20: Run Model and Generate Output

Execute the following command to run the model:

python3.10 smoldocling_test.py

Conclusion

If you’ve followed along, you’ve just turned a scanned scientific paper into clean, structured text using a model that fits in your pocket but punches way above its weight. SmolDocling-256M-preview may be small in size, but it’s built to handle real-world document chaos — from equations and tables to layout and logic — all without requiring a monster GPU or a complicated setup. Whether you’re a researcher scraping data from journals, a dev automating invoice extraction, or just someone tired of messy OCR outputs, SmolDocling gives you exactly what you need: clarity, structure, and speed. And the best part? It’s open, fast, and runs beautifully on platforms like NodeShift Cloud. Give it a try, hook it into your next pipeline, or build a cool tool on top of it. The documents aren’t going to process themselves — but now, you’ve got a sidekick that can.

Relevant blog posts

May 16, 2025

How to Install Qwen3-32B-GGUF Locally?

Qwen3-32B is a powerful and thoughtful language model designed to handle everything from casual chats to deep reasoning. It shines in areas like solving logic puzzles, writing stories, walking through math problems step by step, and even following complex instructions over long conversations. What sets it apart is its ability to switch modes — it can think carefully when needed or keep things light and efficient during everyday conversations. It also understands over a hundred languages and can switch between them smoothly. This version comes in the GGUF format, which means it’s easy to run efficiently on a well-configured system without needing complicated setups. Whether you’re building tools, writing creative content, or exploring new ideas, Qwen3-32B is ready to help — and it does it with a touch of clarity and purpose that’s rare.

May 14, 2025

How to Install NanoVLM World`s Smallest Model Locally?

NanoVLM-222M is a tiny but capable model that sees and understands images, then turns that understanding into words. Think of it as a lightweight brain that looks at a picture and tells you what it sees — like a mini assistant that can describe visuals in natural language. Built using just a few hundred lines of clean PyTorch code, nanoVLM is perfect for developers, tinkerers, and researchers who want to explore image-text understanding without burning through massive compute. It’s not made for flashy demos — it’s made to be simple, fast, and educational. If you’re curious about how visual language models work under the hood, this one’s for you.

May 12, 2025

How to Install Meta Perception LM 8B Locally?

Perception-LM-8B is a powerful multimodal model designed to understand both images and videos in depth. Built by Meta’s FAIR research team, it combines a lightweight language decoder with a highly capable vision encoder to generate detailed, structured responses from visual content. Whether it’s identifying objects in a photo or analyzing complex temporal events in video, this model handles it with remarkable clarity. Instead of relying on closed or proprietary components, Perception-LM is trained using open-access data and reproducible methods. It’s optimized for tasks like visual question answering, video reasoning, and fine-grained recognition — all while maintaining transparency in how it works. From describing what’s happening in a single frame to interpreting events over time, Perception-LM-8B is a step forward in making visual understanding more open, accessible, and research-ready.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.