How to Install Granite-3.2-8B-Instruct Locally?

by Ayush Kumar | March 10, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Granite-3.2-8B-Instruct is an advanced 8-billion-parameter language model designed for long-context reasoning, instruction following, and multi-turn dialogue. Built on the foundation of Granite-3.1-8B-Instruct, it has been fine-tuned with high-quality open-source datasets and synthetic data, ensuring enhanced logical reasoning, structured text generation, and multilingual capabilities. Supporting 12 languages, including English, German, Spanish, French, Arabic, and Chinese, the model is well-suited for tasks such as summarization, retrieval-augmented generation (RAG), text classification, and function calling. Optimized for enterprise applications, research, and AI-driven assistants, Granite-3.2-8B-Instruct delivers precise, structured, and contextually aware responses.

Models	ArenaHard	Alpaca-Eval-2	MMLU	PopQA	TruthfulQA	BigBenchHard	DROP	GSM8K	HumanEval	HumanEval+	IFEval	AttaQ
Llama-3.1-8B-Instruct	36.43	27.22	69.15	28.79	52.79	72.66	61.48	83.24	85.32	80.15	79.10	83.43
DeepSeek-R1-Distill-Llama-8B	17.17	21.85	45.80	13.25	47.43	65.71	44.46	72.18	67.54	62.91	66.50	42.87
Qwen-2.5-7B-Instruct	25.44	30.34	74.30	18.12	63.06	70.40	54.71	84.46	93.35	89.91	74.90	81.90
DeepSeek-R1-Distill-Qwen-7B	10.36	15.35	50.72	9.94	47.14	65.04	42.76	78.47	79.89	78.43	59.10	42.45
Granite-3.1-8B-Instruct	37.58	30.34	66.77	28.7	65.84	68.55	50.78	79.15	89.63	85.79	73.20	85.73
Granite-3.1-2B-Instruct	23.3	27.17	57.11	20.55	59.79	54.46	18.68	67.55	79.45	75.26	63.59	84.7
Granite-3.2-2B-Instruct	24.86	34.51	57.18	20.56	59.8	52.27	21.12	67.02	80.13	73.39	61.55	83.23
Granite-3.2-8B-Instruct	55.25	61.19	66.79	28.04	66.92	64.77	50.95	81.65	89.35	85.72	74.31	85.42

Prerequisites for Installing Granite-3.2-8B-Instruct Model Locally

Ensure you have the following setup before running the model:

Ubuntu 22.04+ or Debian-based OS (for GPU VM)
Python 3.10+
NVIDIA GPU (A100 80GB, H100 80GB, RTXA6000)
GPUs: RTXA6000 (for smooth execution).
Disk Space: 50 GB free.
RAM: At least 24 GB.
CPU: 24 Cores
CUDA
PyTorch
Transformers
Jupyter Notebook installed and running

Model Resource

Hugging Face

Link: https://huggingface.co/ibm-granite/granite-3.2-8b-instruct

Ollama

Link: https://ollama.com/library/granite3.2:8b

Step-by-Step Process to Install Granite-3.2-8B-Instruct Model Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy Granite-3.2-8B-Instruct Thinking Model on a Jupyter Virtual Machine. This open-source platform will allow you to install and run the Granite-3.2-8B-Instruct Thinking Model on your GPU node. By running this Model on a Jupyter Notebook, we avoid using the terminal, simplifying the process and reducing the setup time. This allows you to configure the model in just a few steps and minutes.

Note: NodeShift provides multiple image template options, such as TensorFlow, PyTorch, NVIDIA CUDA, Deepo, Whisper ASR Webservice, and Jupyter Notebook. With these options, you don’t need to install additional libraries or packages to run Jupyter Notebook. You can start Jupyter Notebook in just a few simple clicks.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to Jupyter Notebook

Once your GPU VM deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ Button in the top right corner.

After clicking the ‘Connect’ button, you can view the Jupyter Notebook.

Now open Python 3(pykernel) Notebook.

Next, If you want to check the GPU details, run the command in the Jupyter Notebook cell:

!nvidia-smi

Step 8: Install Dependencies in Jupyter Notebook

Run the following commands in Jupyter Notebook to install dependencies:

pip install torch torchvision torchaudio accelerate transformers

Step 9: Load the Model and Tokenizer

Run the following python script to load the model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed

# Model Path
model_path = "ibm-granite/granite-3.2-8b-instruct"

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the model with optimized settings
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map=device,  # Auto-detect the best GPU/CPU usage
    torch_dtype=torch.bfloat16,  # Use bfloat16 for optimized performance
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

print("Model and tokenizer loaded successfully!")

Expected Output:

Using device: cuda
Model and tokenizer loaded successfully!

Step 10: Run Inference (Chat Example)

Now, test the model with a simple instruction.

Example 1

def generate_response(prompt):
    messages = [{"role": "user", "content": prompt}]
    
    # Apply chat template
    input_ids = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        thinking=True,
        return_dict=True,
        add_generation_prompt=True
    ).to(device)

    # Generate response
    with torch.no_grad():
        output = model.generate(**input_ids, max_new_tokens=512)
        response = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)

    return response

# Example Query
prompt = "Explain the theory of relativity in simple terms."
response = generate_response(prompt)
print(response)

Example 2

def generate_response(prompt):
    messages = [{"role": "user", "content": prompt}]
    
    # Apply chat template
    input_ids = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        thinking=True,
        return_dict=True,
        add_generation_prompt=True
    ).to(device)

    # Generate response
    with torch.no_grad():
        output = model.generate(**input_ids, max_new_tokens=512)
        response = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)

    return response

# Example Query
prompt = "If all humans are mortal and Socrates is a human, what can you conclude"
response = generate_response(prompt)
print(response)

You’re All Set!

✅ Model is loaded
✅ Tokenizer is initialized
✅ Inference function is defined
✅ You can now play with Granite-3.2-8B-Instruct Thinking Model.

Step 11: Run the Gradio Chatbot

Ensure dependencies are installed
If you haven’t installed Gradio and Transformers, run:

pip install torch transformers gradio

Step 12: Run the Gradio Chatbot

Run the following Python script to start the chatbot:

import gradio as gr

def chat_with_granite(user_input, history=[]):
    response = generate_response(user_input)
    history.append((user_input, response))
    return response, history

# Gradio Chatbot UI
chatbot = gr.ChatInterface(
    fn=chat_with_granite,
    title="Granite-3.2-8B AI Assistant",
    description="An advanced instruction-following AI assistant powered by IBM Granite-3.2-8B.",
    theme="compact"
)

# Launch Gradio UI
chatbot.launch(share=True)

How This Works

The script loads Granite-3.2-8B-Instruct Thinking Model and tokenizer.
Uses Gradio’s ChatInterface to create an interactive chatbot.
Generates responses with the model and maintains chat history.
Runs a Gradio Web UI where you can interact with the model.

Expected Output

After running this script, it will output a Gradio link, like:

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://xyz.gradio.app

You can click the public URL to chat with the model!

Step 13: Access Chatbot

Access the Chatbot on:

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://xyz.gradio.app

You can also access chatbot in Jupyter Notebook.

Note: This is a step-by-step guide for interacting with your model. It covers the first method for installing Granite-3.2-8B-Instruct Thinking Model locally using jupyter notebook and transformers.

Option 2: Using Ollama (Terminal)

Install Ollama: Download and install the Ollama tool from the official site.
Pull the Model: Run the following command to download the desired model:

ollama pull granite3.2:8b

Run the Model: Start the model in the terminal:

ollama run granite3.2:8b

Option 3: Using Open WebUI

Set Up Open WebUI:
Follow our Open WebUI Setup Guide to configure the interface. Ensure all dependencies are installed and the environment is correctly set up.
Refresh the Interface:
Confirm that the Granite-3.2-8B-Instruct has been downloaded and is visible in the list of available models on the Open WebUI.
Select Your Model:
Choose the Granite-3.2-8B-Instruct model from the list. This model is available in a single size.
Start Interaction:
Begin using the model by entering your queries in the interface.

Conclusion

Granite-3.2-8B-Instruct is a highly capable language model designed for handling long-context reasoning, structured text generation, and instruction-based interactions. With its advanced architecture and multilingual support, it excels in tasks such as summarization, retrieval-augmented generation, classification, and function calling.

By following this guide, users can easily set up and run the model using Jupyter Notebook, Open WebUI, or Ollama, ensuring flexibility across different platforms. Whether deployed for enterprise solutions, research, or automated assistants, Granite-3.2-8B-Instruct delivers precise and context-aware responses, making it a powerful tool for complex language processing tasks.

Relevant blog posts

May 7, 2025

How to Install Falcon 3 Locally?

Falcon 3 marks a bold step forward in open and efficient AI development. Built by the Technology Innovation Institute (TII) in Abu Dhabi, Falcon 3 is a family of language models crafted to balance power, performance, and accessibility — all while staying under the 10 billion parameter mark. Designed to excel in science, mathematics, and coding, these models showcase innovative training techniques that improve both reasoning and real-world usability. The lineup includes a range of models from lightweight (1B) to highly capable versions (up to 10B), each tuned for different levels of complexity and tasks. Whether you need a small, fast model for everyday tasks or a strong reasoning model for math and coding challenges, Falcon 3 delivers. From knowledge distillation for tiny models to advanced pre-training for larger ones, Falcon 3 combines efficiency with smart design. The models are open, easy to integrate, and optimized for long context tasks — making them ideal for developers, researchers, and anyone looking to build powerful AI solutions without the heavy compute costs.

May 5, 2025

How to Install Nari Dia 1.6 B Locally?

Dia 1.6B by Nari Labs is a cutting-edge text-to-speech model built to transform transcripts into highly realistic, emotionally rich audio dialogues. Unlike traditional models, Dia is designed with versatility in mind. It not only reads out text but brings it alive with expressions such as laughter, sighs, and other non-verbal cues. This makes it ideal for generating lifelike conversations, podcasts, or even adding voices to scripts. Dia supports dynamic dialogue tags like [S1] and [S2] to alternate between speakers naturally. In addition, it allows users to clone voices, and even control tone and emotion using audio conditioning — making content creation more engaging than ever before. While currently optimized for English, Dia aims to push the limits of voice generation, ensuring each audio output feels natural and spontaneous. Whether you’re working on media content, creating conversational interfaces, or experimenting with new ways of storytelling — Dia 1.6B offers a flexible and powerful solution right out of the box.

May 2, 2025

How to Install Microsoft Phi-4 Reasoning Locally?

Phi-4-Reasoning is Microsoft’s specialized model crafted to tackle advanced reasoning tasks with precision. Built on the solid foundation of Phi-4, this version has been carefully fine-tuned using high-quality datasets focusing on math, science, and coding challenges. What makes it stand out is its ability to walk through problems step-by-step, offering thoughtful explanations before arriving at final answers. With a focus on logical reasoning, chain-of-thought workflows, and advanced problem-solving, it performs impressively even against much larger models — all while running on a relatively efficient 14B parameter architecture. Ideal for scenarios where smart reasoning and accuracy matter the most, Phi-4-Reasoning shines in math olympiad problems, scientific queries, and complex coding tasks, while keeping the output clear, safe, and grounded.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.