Mai Chi Bao

Posted on Jan 13 • Edited on Feb 16

🎯 Run Qwen2-VL on CPU Using GGUF model & llama.cpp

#mrzaizai2k #ai #cpu #llamacpp

Key Takeaways

How to convert a fine-tuned Qwen2-VL model to GGUF format.
How to run the converted model on a CPU using llama.cpp.

Introduction

Previously, I showed you how to fine-tune a VLM model on your custom dataset. Now, let's explore how to efficiently deploy the fine-tuned model using GGUF format and llama.cpp.

Why GGUF and llama.cpp?

Universal Compatibility: Llama.cpp's design as a CPU-first C++ library minimizes complexity and allows seamless integration into other programming environments, making it a versatile choice.
Comprehensive Feature Integration: Llama.cpp serves as a repository for critical low-level features, simplifying development through streamlined capabilities similar to LangChain.
Focused Optimization: Llama.cpp focuses on a single model architecture, enabling targeted efficiency improvements with formats like GGML and GGUF.

Learn more.

Step 1: Convert Qwen2-VL to GGUF Format

To convert the fine-tuned Qwen2-VL model into GGUF format, follow these steps:

Clone the llama.cpp repository and switch to the appropriate branch:

   git clone https://github.com/HimariO/llama.cpp.git
   cd llama.cpp/
   git switch qwen2-vl

Modify the Makefile:

   nano Makefile

Add llama-qwen2vl-cli to the Makefile.

Copy the fine-tuned model:

   cp -r "OLD/PATH/TO/QWEN2VL-2B-INSTRUCT" "NEW/PATH/TO/QWEN2VL-2B-INSTRUCT"

Disable CUDA and build for CPU-only inference:

   cmake . -DGGML_CUDA=OFF
   make -j$(nproc)  # -j$(nproc) allows parallel build

Run the conversion script:

   PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "PATH/TO/QWEN2VL-2B-INSTRUCT"

Note: Ensure the model path is correctly specified. If using the 2B model, use Qwen/Qwen2-VL-2B-Instruct. For the 7B model, use Qwen/Qwen2-VL-7B-Instruct.

Convert to GGUF format:

   python3 convert_hf_to_gguf.py "PATH/TO/QWEN2VL-2B-INSTRUCT"

Step 2: Run Inference Using llama.cpp

After converting to GGUF format, you can run inference on the CPU using the following command:

./bin/llama-qwen2vl-cli \
    -m models/qwen2_vl_lora_sft_2b/Qwen2-VL-2B-Instruct-F16.gguf \
    --mmproj qwen-qwen2-vl-2b-instruct-vision.gguf \
    -p "From this image, return JSON with a list of players, each having a name and a score list. The golf score is calculated with 9 front and 9 back scores, with the 10th and 20th elements as sums of the previous scores. Example: [0, 1, 2, 1, 3, 2, 0, 0, 1, 11]" \
    --image "data/train_data/score_card_2/IMG_9381.jpg" \
    -ngl 33 -n 512

Tip: Resize the image to 640x640 for faster inference.

Conclusion

Converting a fine-tuned Qwen2-VL model into GGUF format and running it with llama.cpp enables efficient, CPU-based inference. The GGUF format ensures compatibility and performance optimization while the streamlined llama.cpp library simplifies model deployment across platforms. By following this guide, you can seamlessly prepare and run your models for various inference tasks. For further exploration, consider experimenting with different quantization strategies and inference settings to balance performance and accuracy.

Top comments (1)

Mai Chi Bao • Jan 13

This guide is incredibly detailed! Could you elaborate more on the differences between GGUF and GGML for those new to model optimization?

DEV Community

🎯 Run Qwen2-VL on CPU Using GGUF model & llama.cpp

Table of Contents:

Key Takeaways

Introduction

Why GGUF and llama.cpp?

Step 1: Convert Qwen2-VL to GGUF Format

Step 2: Run Inference Using llama.cpp

Conclusion

Top comments (1)

Read next

RTX 5090 Tested Against FLUX DEV, SD 3.5 Large, SD 3.5 Medium, SDXL, SD 1.5 with AMD 9950X CPU

Mistral’s ‘Small’ 24B Parameter Model Blows Minds—No Data Sent to China, Just Pure AI Power!

Benchmarking ChatGPT, Qwen, and DeepSeek on Real-World AI Tasks

Agentic AI and the MCP Ecosystem