DEV Community

Firoj Alam
Firoj Alam

Posted on

Benchmarking LLMs Made Easy with LLMeBench

πŸ”Ή Are you evaluating Large Language Models (LLMs) for your NLP tasks?
πŸ”Ή Do you need a flexible, scalable framework that supports multiple providers?

Look no furtherβ€”LLMeBench is here!

What is LLMeBench?

LLMeBench is an open-source benchmarking framework designed to help researchers and developers evaluate LLMs across different tasks, providers, and languages.

LLMeBench

With LLMeBench 1.1.0, we’ve added:

βœ… Expanded modality support (text, vision, multimodal tasks)
βœ… More evaluation metrics for precise comparisons
βœ… Improved dataset integration for smoother benchmarking

πŸ”— GitHub Repo β†’ github.com/qcri/LLMeBench

πŸ’‘ Why Benchmarking LLMs is Important

The rapid rise of GPT-4, BLOOMZ, Falcon, and LLaMA has created a need for systematic performance evaluation. LLMs behave differently across tasks, datasets, and languages, making standardized benchmarking essential for:

πŸ“Œ Model Comparison β†’ Which LLM performs best for a specific task?
πŸ“Œ Cost & Latency Analysis β†’ Is an LLM efficient for real-world deployment?
πŸ“Œ Fairness & Bias Detection β†’ Does the model exhibit language-specific biases?

LLMeBench addresses these challenges with a structured benchmarking approach that supports various model providers like:
🟒 OpenAI (GPT models)
🟒 Hugging Face Inference API
🟒 Azure AI models
🟒 Models deployed through VLLM

Getting Started with LLMeBench

  1. Install LLMeBench

    pip install 'llmebench[fewshot]'

  2. Download the current assets:


python -m llmebench assets download

This will fetch assets and place them in the current working directory.

  1. Download one of the dataset, e.g. ArSAS.


python -m llmebench data download ArSAS

This will download the data to the current working directory inside the data folder.

  1. Evaluate!

For example, to evaluate the performance of a random baseline for Sentiment analysis on ArSAS dataset, you can run:


python -m llmebench --filter 'sentiment/ArSAS_Random*' assets/ results/

which uses the ArSAS_random "asset": a file that specifies the dataset, model and task to evaluate. Here, ArSAS_Random is the asset name referring to the ArSAS dataset name and the Random model, and assets/ar/sentiment_emotion_others/sentiment/ is the directory where the benchmarking asset for the sentiment analysis task on Arabic ArSAS dataset can be found. Results will be saved in a directory called results.

  1. View the Results

LLMeBench generates a performance report with:

πŸ“Š Accuracy
⏳ Response time
πŸ“ˆ Task-specific metrics

🎯 Why Use LLMeBench?

βœ” Works with any NLP model & dataset
βœ” Supports multiple providers (OpenAI, HF, Azure, Petals)
βœ” Handles multimodal & multilingual benchmarking
βœ” Saves time & effort in evaluation

⭐ Join the Community & Contribute

We’re excited to see researchers & developers using LLMeBench for their benchmarking needs! πŸš€

πŸ”— Try LLMeBench today: github.com/qcri/LLMeBench
⭐ If you find it useful, give us a star on GitHub!

πŸ’¬ Have feedback or feature requests? Open an issue or PR -- we’d love to hear from you!

πŸ’‘ What’s Next?

We’re constantly improving LLMeBench with new features & optimizations. Stay tuned for:
βœ… More task-specific benchmarking modules
βœ… Fine-grained evaluation for multilingual models
βœ… Support for additional model providers

πŸ”₯ If you’re working with LLMs and benchmarking, we’d love to hear how LLMeBench can help your workflow! Drop a comment below or connect with us on GitHub! πŸš€βœ¨

Top comments (0)