MarkItDown is an open-source tool developed by Microsoft, designed to convert various file formats into Markdown for seamless use in tasks like indexing, text analysis, and documentation. It supports a wide range of formats, including PDFs, PowerPoint presentations, Word documents, Excel sheets, images (via EXIF metadata and OCR), audio files (through EXIF metadata and transcription), HTML, text-based formats like CSV, JSON, and XML, and ZIP files by processing their contents. Released under the MIT License, it welcomes contributions from open-source enthusiasts.
Resource
Link: https://github.com/microsoft/markitdown
Prerequisites for GPU and CPU VMs
Minimum requirements:
For GPU VM
- GPUs: 1xRTXA6000 (for smooth execution).
- Disk Space: 50GB free.
- RAM: 40 GB.
- CPU: 24 Cores
For CPU VM
- Disk Space: 100GB free.
- RAM: 16 GB.
- CPU: 4 CPUs
Step-by-Step process to Install Microsoft MarkItDown Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift, as it provides the optimal configuration to achieve the fastest performance while running Microsoft MarkItDown. NodeShift offers affordable Virtual Machines that meet stringent compliance standards, including GDPR, SOC2, and ISO27001, ensuring data security and privacy.
However, if you prefer to use a CPU-powered Virtual Machine, you can still follow this guide. MarkItDown works on CPU-based VMs as well, though performance may be slower compared to a GPU setup. The installation process remains largely the same, allowing you to achieve similar functionality on a CPU-powered machine. NodeShift’s infrastructure is versatile, enabling you to choose between GPU or CPU configurations based on your specific needs and budget.
Let’s dive into the setup and installation steps to get Bolt.new running efficiently on your chosen virtual machine.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
Next, you will need to choose an image for your Virtual Machine. We will deploy Microsoft MarkItDown Tool on a Jupyter Virtual Machine. This open-source platform will allow you to install and run the Microsoft MarkItDown Tool on your GPU node. By running this tool on a Jupyter Notebook, we avoid using the terminal, simplifying the process and reducing the setup time. This allows you to configure the model in just a few steps and minutes.
Note: NodeShift provides multiple image template options, such as TensorFlow, PyTorch, NVIDIA CUDA, Deepo, Whisper ASR Webservice, and Jupyter Notebook. With these options, you don’t need to install additional libraries or packages to run Jupyter Notebook. You can start Jupyter Notebook in just a few simple clicks.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to Jupyter Notebook
Once your GPU VM deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ Button in the top right corner.
After clicking the ‘Connect’ button, you can view the Jupyter Notebook.
Now open Python 3(pykernel) Notebook.
Next, If you want to check the GPU details, run the command in the Jupyter Notebook cell:
!nvidia-smi
Step 8: Install Markitdown
Run the following command to install the markitdown:
pip install markitdown openai
Step 9: Create OpenAI API Key
To use the OpenAI API, you need to create an API key. This key will allow you to securely access OpenAI’s services. Follow these steps to generate your API key:
Visit the OpenAI platform and log in to your account. If you do not have an account, you will need to sign up.
Once logged in, navigate to the top right corner of the page where your profile icon is located. Click on it and select API from the dropdown menu. Alternatively, you can directly access the API section by clicking on API in the main dashboard.
In the API section, look for an option that says Create new secret key or View API Key. Click on this option.
After clicking on create, a new API key will be generated for you. Make sure to copy this key immediately as it will only be shown once.
Step 10: Export OpenAI API Key
Run the following command to export the OpenAI API Key:
!export OPENAI_API_KEY="your api key"
Step 11: Write and Run the Code
First write the below code in Jupyter Notebook and then run it:
from markitdown import MarkItDown
# Initialize the MarkItDown object
markitdown = MarkItDown()
# Convert the PDF file
result = markitdown.convert("Ayush-Kumar Resume.pdf")
print(result)
Step 12: Print Result
Then, run the following command to print the result:
# Access the text content of the converted document
print(result.text_content)
Conclusion
In this guide, we explain the MarkItDown open-source tool, developed by Microsoft, designed to convert various file formats into Markdown for seamless use in tasks like indexing, text analysis, and documentation. We will provide a step-by-step tutorial on how to install and set up MarkItDown locally on a NodeShift virtual machine (VM). By the end of this guide, you’ll have installed the required software, configured essential tools, and converted your first file into Markdown using MarkItDown.