πΉ Are you evaluating Large Language Models (LLMs) for your NLP tasks?
πΉ Do you need a flexible, scalable framework that supports multiple providers?
Look no furtherβLLMeBench is here!
What is LLMeBench?
LLMeBench is an open-source benchmarking framework designed to help researchers and developers evaluate LLMs across different tasks, providers, and languages.
With LLMeBench 1.1.0, weβve added:
β
Expanded modality support (text, vision, multimodal tasks)
β
More evaluation metrics for precise comparisons
β
Improved dataset integration for smoother benchmarking
π GitHub Repo β github.com/qcri/LLMeBench
π‘ Why Benchmarking LLMs is Important
The rapid rise of GPT-4, BLOOMZ, Falcon, and LLaMA has created a need for systematic performance evaluation. LLMs behave differently across tasks, datasets, and languages, making standardized benchmarking essential for:
π Model Comparison β Which LLM performs best for a specific task?
π Cost & Latency Analysis β Is an LLM efficient for real-world deployment?
π Fairness & Bias Detection β Does the model exhibit language-specific biases?
LLMeBench addresses these challenges with a structured benchmarking approach that supports various model providers like:
π’ OpenAI (GPT models)
π’ Hugging Face Inference API
π’ Azure AI models
π’ Models deployed through VLLM
Getting Started with LLMeBench
Install LLMeBench
pip install 'llmebench[fewshot]'
Download the current assets:
python -m llmebench assets download
This will fetch assets and place them in the current working directory.
- Download one of the dataset, e.g. ArSAS.
python -m llmebench data download ArSAS
This will download the data to the current working directory inside the data folder.
- Evaluate!
For example, to evaluate the performance of a random baseline for Sentiment analysis on ArSAS dataset, you can run:
python -m llmebench --filter 'sentiment/ArSAS_Random*' assets/ results/
which uses the ArSAS_random "asset": a file that specifies the dataset, model and task to evaluate. Here, ArSAS_Random is the asset name referring to the ArSAS dataset name and the Random model, and assets/ar/sentiment_emotion_others/sentiment/ is the directory where the benchmarking asset for the sentiment analysis task on Arabic ArSAS dataset can be found. Results will be saved in a directory called results.
- View the Results
LLMeBench generates a performance report with:
π Accuracy
β³ Response time
π Task-specific metrics
π― Why Use LLMeBench?
β Works with any NLP model & dataset
β Supports multiple providers (OpenAI, HF, Azure, Petals)
β Handles multimodal & multilingual benchmarking
β Saves time & effort in evaluation
β Join the Community & Contribute
Weβre excited to see researchers & developers using LLMeBench for their benchmarking needs! π
π Try LLMeBench today: github.com/qcri/LLMeBench
β If you find it useful, give us a star on GitHub!
π¬ Have feedback or feature requests? Open an issue or PR -- weβd love to hear from you!
π‘ Whatβs Next?
Weβre constantly improving LLMeBench with new features & optimizations. Stay tuned for:
β
More task-specific benchmarking modules
β
Fine-grained evaluation for multilingual models
β
Support for additional model providers
π₯ If youβre working with LLMs and benchmarking, weβd love to hear how LLMeBench can help your workflow! Drop a comment below or connect with us on GitHub! πβ¨
Top comments (0)