DEV Community

Mai Chi Bao
Mai Chi Bao

Posted on

🎯 Run Qwen2-VL on CPU Using GGUF model & llama.cpp

Key Takeaways

  • How to convert a fine-tuned Qwen2-VL model to GGUF format.
  • How to run the converted model on a CPU using llama.cpp.

Introduction

Previously, I showed you how to fine-tune a VLM model on your custom dataset. Now, let's explore how to efficiently deploy the fine-tuned model using GGUF format and llama.cpp.

Why GGUF and llama.cpp?

  • Universal Compatibility: Llama.cpp's design as a CPU-first C++ library minimizes complexity and allows seamless integration into other programming environments, making it a versatile choice.
  • Comprehensive Feature Integration: Llama.cpp serves as a repository for critical low-level features, simplifying development through streamlined capabilities similar to LangChain.
  • Focused Optimization: Llama.cpp focuses on a single model architecture, enabling targeted efficiency improvements with formats like GGML and GGUF.

Learn more.


Step 1: Convert Qwen2-VL to GGUF Format

To convert the fine-tuned Qwen2-VL model into GGUF format, follow these steps:

  1. Clone the llama.cpp repository and switch to the appropriate branch:
   git clone https://github.com/HimariO/llama.cpp.git
   cd llama.cpp/
   git switch qwen2-vl
Enter fullscreen mode Exit fullscreen mode
  1. Modify the Makefile:
   nano Makefile
Enter fullscreen mode Exit fullscreen mode

Add llama-qwen2vl-cli to the Makefile.

  1. Copy the fine-tuned model:
   cp -r "OLD/PATH/TO/QWEN2VL-2B-INSTRUCT" "NEW/PATH/TO/QWEN2VL-2B-INSTRUCT"
Enter fullscreen mode Exit fullscreen mode
  1. Disable CUDA and build for CPU-only inference:
   cmake . -DGGML_CUDA=OFF
   make -j$(nproc)  # -j$(nproc) allows parallel build
Enter fullscreen mode Exit fullscreen mode
  1. Run the conversion script:
   PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "PATH/TO/QWEN2VL-2B-INSTRUCT"
Enter fullscreen mode Exit fullscreen mode

Note: Ensure the model path is correctly specified. If using the 2B model, use Qwen/Qwen2-VL-2B-Instruct. For the 7B model, use Qwen/Qwen2-VL-7B-Instruct.

  1. Convert to GGUF format:
   python3 convert_hf_to_gguf.py "PATH/TO/QWEN2VL-2B-INSTRUCT"
Enter fullscreen mode Exit fullscreen mode

Step 2: Run Inference Using llama.cpp

After converting to GGUF format, you can run inference on the CPU using the following command:

./bin/llama-qwen2vl-cli \
    -m models/qwen2_vl_lora_sft_2b/Qwen2-VL-2B-Instruct-F16.gguf \
    --mmproj qwen-qwen2-vl-2b-instruct-vision.gguf \
    -p "From this image, return JSON with a list of players, each having a name and a score list. The golf score is calculated with 9 front and 9 back scores, with the 10th and 20th elements as sums of the previous scores. Example: [0, 1, 2, 1, 3, 2, 0, 0, 1, 11]" \
    --image "data/train_data/score_card_2/IMG_9381.jpg" \
    -ngl 33 -n 512
Enter fullscreen mode Exit fullscreen mode

Tip: Resize the image to 640x640 for faster inference.


Conclusion

Converting a fine-tuned Qwen2-VL model into GGUF format and running it with llama.cpp enables efficient, CPU-based inference. The GGUF format ensures compatibility and performance optimization while the streamlined llama.cpp library simplifies model deployment across platforms. By following this guide, you can seamlessly prepare and run your models for various inference tasks. For further exploration, consider experimenting with different quantization strategies and inference settings to balance performance and accuracy.

Top comments (1)

Collapse
 
mrzaizai2k profile image
Mai Chi Bao

This guide is incredibly detailed! Could you elaborate more on the differences between GGUF and GGML for those new to model optimization?