Key Takeaways
- How to convert a fine-tuned Qwen2-VL model to GGUF format.
- How to run the converted model on a CPU using llama.cpp.
Introduction
Previously, I showed you how to fine-tune a VLM model on your custom dataset. Now, let's explore how to efficiently deploy the fine-tuned model using GGUF format and llama.cpp
.
Why GGUF and llama.cpp?
- Universal Compatibility: Llama.cpp's design as a CPU-first C++ library minimizes complexity and allows seamless integration into other programming environments, making it a versatile choice.
- Comprehensive Feature Integration: Llama.cpp serves as a repository for critical low-level features, simplifying development through streamlined capabilities similar to LangChain.
- Focused Optimization: Llama.cpp focuses on a single model architecture, enabling targeted efficiency improvements with formats like GGML and GGUF.
Step 1: Convert Qwen2-VL to GGUF Format
To convert the fine-tuned Qwen2-VL model into GGUF format, follow these steps:
- Clone the llama.cpp repository and switch to the appropriate branch:
git clone https://github.com/HimariO/llama.cpp.git
cd llama.cpp/
git switch qwen2-vl
- Modify the Makefile:
nano Makefile
Add llama-qwen2vl-cli
to the Makefile.
- Copy the fine-tuned model:
cp -r "OLD/PATH/TO/QWEN2VL-2B-INSTRUCT" "NEW/PATH/TO/QWEN2VL-2B-INSTRUCT"
- Disable CUDA and build for CPU-only inference:
cmake . -DGGML_CUDA=OFF
make -j$(nproc) # -j$(nproc) allows parallel build
- Run the conversion script:
PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "PATH/TO/QWEN2VL-2B-INSTRUCT"
Note: Ensure the model path is correctly specified. If using the 2B model, use Qwen/Qwen2-VL-2B-Instruct
. For the 7B model, use Qwen/Qwen2-VL-7B-Instruct
.
- Convert to GGUF format:
python3 convert_hf_to_gguf.py "PATH/TO/QWEN2VL-2B-INSTRUCT"
Step 2: Run Inference Using llama.cpp
After converting to GGUF format, you can run inference on the CPU using the following command:
./bin/llama-qwen2vl-cli \
-m models/qwen2_vl_lora_sft_2b/Qwen2-VL-2B-Instruct-F16.gguf \
--mmproj qwen-qwen2-vl-2b-instruct-vision.gguf \
-p "From this image, return JSON with a list of players, each having a name and a score list. The golf score is calculated with 9 front and 9 back scores, with the 10th and 20th elements as sums of the previous scores. Example: [0, 1, 2, 1, 3, 2, 0, 0, 1, 11]" \
--image "data/train_data/score_card_2/IMG_9381.jpg" \
-ngl 33 -n 512
Tip: Resize the image to 640x640 for faster inference.
Conclusion
Converting a fine-tuned Qwen2-VL model into GGUF format and running it with llama.cpp
enables efficient, CPU-based inference. The GGUF format ensures compatibility and performance optimization while the streamlined llama.cpp
library simplifies model deployment across platforms. By following this guide, you can seamlessly prepare and run your models for various inference tasks. For further exploration, consider experimenting with different quantization strategies and inference settings to balance performance and accuracy.
Top comments (1)
This guide is incredibly detailed! Could you elaborate more on the differences between GGUF and GGML for those new to model optimization?