Benchmarking LLMs Made Easy with LLMeBench

🔹 Are you evaluating Large Language Models (LLMs) for your NLP tasks?
🔹 Do you need a flexible, scalable framework that supports multiple providers?

Look no further—LLMeBench is here!

What is LLMeBench?

LLMeBench is an open-source benchmarking framework designed to help researchers and developers evaluate LLMs across different tasks, providers, and languages.

With LLMeBench 1.1.0, we’ve added:

✅ Expanded modality support (text, vision, multimodal tasks)
✅ More evaluation metrics for precise comparisons
✅ Improved dataset integration for smoother benchmarking

🔗 GitHub Repo → github.com/qcri/LLMeBench

💡 Why Benchmarking LLMs is Important

The rapid rise of GPT-4, BLOOMZ, Falcon, and LLaMA has created a need for systematic performance evaluation. LLMs behave differently across tasks, datasets, and languages, making standardized benchmarking essential for:

📌 Model Comparison → Which LLM performs best for a specific task?
📌 Cost & Latency Analysis → Is an LLM efficient for real-world deployment?
📌 Fairness & Bias Detection → Does the model exhibit language-specific biases?

LLMeBench addresses these challenges with a structured benchmarking approach that supports various model providers like:
🟢 OpenAI (GPT models)
🟢 Hugging Face Inference API
🟢 Azure AI models
🟢 Models deployed through VLLM

Getting Started with LLMeBench

Install LLMeBench
pip install 'llmebench[fewshot]'
Download the current assets:

python -m llmebench assets download

This will fetch assets and place them in the current working directory.

Download one of the dataset, e.g. ArSAS.

python -m llmebench data download ArSAS

This will download the data to the current working directory inside the data folder.

Evaluate!

For example, to evaluate the performance of a random baseline for Sentiment analysis on ArSAS dataset, you can run:

python -m llmebench --filter 'sentiment/ArSAS_Random*' assets/ results/

which uses the ArSAS_random "asset": a file that specifies the dataset, model and task to evaluate. Here, ArSAS_Random is the asset name referring to the ArSAS dataset name and the Random model, and assets/ar/sentiment_emotion_others/sentiment/ is the directory where the benchmarking asset for the sentiment analysis task on Arabic ArSAS dataset can be found. Results will be saved in a directory called results.

View the Results

LLMeBench generates a performance report with:

📊 Accuracy
⏳ Response time
📈 Task-specific metrics

🎯 Why Use LLMeBench?

✔ Works with any NLP model & dataset
✔ Supports multiple providers (OpenAI, HF, Azure, Petals)
✔ Handles multimodal & multilingual benchmarking
✔ Saves time & effort in evaluation

⭐ Join the Community & Contribute

We’re excited to see researchers & developers using LLMeBench for their benchmarking needs! 🚀

🔗 Try LLMeBench today: github.com/qcri/LLMeBench
⭐ If you find it useful, give us a star on GitHub!

💬 Have feedback or feature requests? Open an issue or PR -- we’d love to hear from you!

💡 What’s Next?

We’re constantly improving LLMeBench with new features & optimizations. Stay tuned for:
✅ More task-specific benchmarking modules
✅ Fine-grained evaluation for multilingual models
✅ Support for additional model providers

🔥 If you’re working with LLMs and benchmarking, we’d love to hear how LLMeBench can help your workflow! Drop a comment below or connect with us on GitHub! 🚀✨