Aya Vision 8B is a powerful multilingual vision-language model designed to handle image and text-based tasks with high accuracy. With 8 billion parameters, it excels in image captioning, optical character recognition (OCR), visual reasoning, summarization, and question answering across 23 languages, including English, French, Spanish, German, Chinese, Arabic, and Hindi. The model efficiently processes both images and text using a SigLIP2 vision encoder paired with the C4AI Command R7B language model, ensuring seamless integration of visual and textual data. Its ability to handle 16K tokens makes it suitable for long-form content generation and in-depth analysis. Aya Vision 8B is optimized for scene understanding, document processing, multilingual transcription, and AI-driven research, providing structured and context-aware responses for a wide range of applications.
Model Resource
Hugging Face
Link: https://huggingface.co/CohereForAI/aya-vision-8b
Prerequisites for Installing Aya Vision 8B Model Locally
- GPU:
- Memory (VRAM):
- Minimum: 16GB (with 8-bit or 4-bit quantization).
- Recommended: 24GB for smoother execution.
- Optimal: 48GB for full performance at FP16 precision.
- Type: NVIDIA GPUs with Tensor Cores (e.g., RTX 4090, A6000, A100, H100).
- Disk Space:
- Minimum: 40GB free SSD storage.
- Recommended: 100GB SSD for storing additional checkpoints, logs, and datasets.
- RAM:
- Minimum: 24GB.
- Recommended: 48GB for smoother operation, especially with large datasets.
- CPU:
- Minimum: 16 cores.
- Recommended: 24-48 cores for fast data preprocessing and I/O operations.
Step-by-Step Process to Install Aya Vision 8B Model Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Access model from Hugging Face
Link: https://huggingface.co/CohereForAI/aya-vision-8b
You need to agree to share your contact information to access this model. Fill in all the mandatory details, such as your name and email, and then wait for approval from Hugging Face and Google to gain access and use the model.
You will be granted access to this model within an hour, provided you have filled in all the details correctly.
Step 2: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step :3 Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 4: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 5: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 6: Choose an Image
Next, you will need to choose an image for your Virtual Machine. We will deploy Aya Vision 8B on an NVIDIA Cuda Virtual Machine. This proprietary, closed-source parallel computing platform will allow you to install Aya Vision 8B on your GPU Node.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 7: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 8: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, if you want to check the GPU details, run the command below:
nvidia-smi
Step 9: Install Required Dependencies
Run the following command to install required dependencies and libraries:
sudo apt update && sudo apt upgrade -y
sudo apt install -y git python3 python3-pip python3-venv libsndfile1 ffmpeg libgl1-mesa-glx libglib2.0-0
Step 10: Set Up a Python Virtual Environment
Run the following commands to set up a python virtual environment:
# Create virtual environment
python3 -m venv aya_env
source aya_env/bin/activate
Step 11: Install Python Dependencies
Run the following commands to install the Python dependencies:
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate diffusers huggingface_hub
Step 12: Login Using Your Hugging Face API Token
Use the huggingface_hub
cli to login directly in the terminal.
Run the following command to login in huggingface-cli:
huggingface-cli login
Then, enter the token and press the Enter key. Ensure you press Enter after entering the token so the input will not be visible.
After entering the token, you will see the following output:
Login Successful.
The current active token is (your_token_name).
Check the screenshot below for reference.
How to Generate a Hugging Face Token
- Create an Account: Go to the Hugging Face website and sign up for an account if you don’t already have one.
- Access Settings: After logging in, click on your profile photo in the top right corner and select “Settings.”
- Navigate to Access Tokens: In the settings menu, find and click on the “Access Tokens” tab.
- Generate a New Token: Click the “New token” button, provide a name for your token, and choose a role (either
read
or write
).
- Generate and Copy Token: Click the “Generate a token” button. Your new token will appear; click “Show” to view it and copy it for use in your applications.
- Secure Your Token: Ensure you keep your token secure and do not expose it in public code repositories.
Step 1:3 Create a Python Script
You are connecting your GPU remote server to VS Code and creating a test.py
file for running the Aya Vision 8B model. Follow these steps:
Install VS Code Extensions
On your local machine, open VS Code and install:
- Remote – SSH extension
- Python extension
Steps:
- Open VS Code.
- Click on Extensions (
Ctrl + Shift + X
).
- Search for “Remote – SSH” and install it.
- Search for “Python” and install it.
Connect VS Code to Your GPU Remote Server
Steps to Connect via SSH
- Open VS Code.
- Press Ctrl + Shift + P to open the command palette.
- Type “Remote-SSH: Connect to Host…” and select it.
- Enter your GPU server details:
ssh root@<YOUR_GPU_SERVER_IP>
Example:
ssh root@192.168.1.100
- Enter your password (or use your SSH key if set up).
- Now you are inside your remote GPU server via VS Code!
Create test.py File in VS Code
Now, in VS Code, inside your remote connection:
- Open the File Explorer in VS Code.
- Navigate to your remote GPU directory (
~/aya_env
).
- Create a new file named
test.py
.
- Copy and paste the following test code inside
test.py
:
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
# Load the Aya Vision 8B Model
model_id = "CohereForAI/aya-vision-8b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, device_map="auto", torch_dtype=torch.float16
)
# Define an image URL for testing
messages = [
{"role": "user",
"content": [
{"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
{"type": "text", "text": "What is written in the image?"}
]}
]
# Prepare input
inputs = processor.apply_chat_template(
messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)
# Generate response
gen_tokens = model.generate(
**inputs,
max_new_tokens=300,
do_sample=True,
temperature=0.3,
)
# Print output
print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Step 14: Run Server
Now, run the script on terminal:
python3 test.py
Output:
Conclusion
Aya Vision 8B is a highly capable vision-language model designed to process and analyze both images and text with precision. With its multilingual support across 23 languages and robust visual reasoning abilities, it excels in tasks such as image captioning, document processing, and question answering. This guide provided a step-by-step approach to setting up the model on a GPU-powered virtual machine, ensuring optimal performance for users working with structured visual data. By following this installation process, researchers, developers, and content creators can seamlessly integrate Aya Vision 8B into their workflows, enhancing automation and efficiency in various vision-language applications.