A week after the famous, or infamous, OpenAI Dev Day, we at Confident AI released JudgementalGPT — an LLM agent built using OpenAI’s Assistants API, specifically designed for the purpose of evaluating other LLM applications. What initially started off as an experimental idea quickly turned into a prototype that we were eager to ship as we received feedback from users that JudgementalGPT gave more accurate and reliable results when compared to other state-of-the-art LLM-based evaluation approaches such as G-Eval.
Understandably, knowing that Confident AI is the world’s first open-source evaluation infrastructure for LLMs, many demanded more transparency into how JudgementalGPT was built after our initial public release:
I thought it’s all open source, but it seems like JudgementalGPT, in particular, is a black box for users. It would be great if we had more knowledge on how this is built.
So here you go, dear anonymous internet stranger, this article is dedicated to you.
DeepEval - open-source evaluation framework for LLM applications
DeepEval is a framework that help engineers evaluate the performance of their LLM applications by providing default metrics to measure hallucination, relevancy, and much more.
We are just starting out, and we really want to help more developers build safer AI apps. Would you mind giving it a star to spread the word, please? 🥺❤️🥺
Limitations of LLM-based evaluations
The authors of G-Eval, state that:
Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity.
For those who don’t already know, G-Eval is a framework that utilizes Large Language Models (LLMs) with chain-of-thought (CoT) processing to evaluate the quality of generated texts in a form-filling paradigm, and if you’ve ever tried implementing a version of your own, you’ll quickly find that using LLMs for evaluation presents its own set of problems:
Unreliability — although G-Eval uses a low-precision grading scale (1–5), which makes it easier for interpretation, these scores can vary a lot even under the same evaluation conditions. This variability is due to an intermediate step in G-Eval that dynamically generates steps for later evaluation, which increases the stochasticity of evaluation scores (which is also why providing an initial seed value doesn’t help).
Inaccuracy — for certain tasks, one digit usually dominates (e.g., 3 for a grading scale of 1–5 using gpt-3.5-turbo). A way to get around this problem would be to take the probabilities of output tokens from an LLM to normalize the scores and take their weighted summation as the final score. But, unfortunately, this isn’t an option if you’re using OpenAI’s GPT models as an evaluator, since they deprecated the logprobs parameter a few months ago.
In fact, another paper that explored LLM-as-a-judge pointed out that using LLMs as an evaluator is flawed in several ways. For example, GPT-4 gives preferential treatment to self-generated outputs, is not very good at math (but neither am I), and is prone to verbosity bias. Verbosity bias means it favors longer, verbose responses instead of accurate, shorter alternatives. (In fact, an initial study has shown that GPT-4 exhibits verbosity bias 8.75% of the time)
Can you see how this becomes a problem if you’re trying to evaluate a summarization task?
OpenAI Assistants offers a workaround to existing problems
Here’s a surprise — JudgementalGPT isn’t composed of one evaluator built using the new OpenAI Assistant API, but multiple. That’s right, behind the scenes, JudgementalGPT is a proxy for multiple assistants that perform different evaluations depending on the evaluation task at hand. Here are the problems JudgementalGPT was designed to solve:
Bias — we’re still experimenting with this (another reason for close-sourcing JudgementalGPT!), but assistants have the ability to write and execute code using the code interpreter tool, which means that, with a bit of prompt engineering, it can account for tasks that are more prone to logical fallacies, such as asserting coding or math problems, or tasks that require more factuality rather than giving preferential treatment to its own outputs.
Reliability — since we no longer require LLMs to dynamically generate CoTs/evaluation steps, we can enforce a set of rules for specific evaluation tasks. In other words, since we’ve pre-defined multiple sets of evaluation steps based on the evaluation task at hand, we have removed the biggest parameter contributing to stochasticity.
Accuracy — having a set of pre-defined evaluation steps for different tasks also means we can provide more guidance based on what we as humans actually expect from each evaluator and quickly iterate on the implementation based on user feedback.Another thing that we learnt when implementing G-Eval into our open-source project DeepEval was evaluation steps generated by LLMs are be lengthy and full of fluff.
Another insight we gained while integrating G-Eval into our open-source project DeepEval was the realization that LLM-generated evaluation steps tend to be arbitrary and generally does not help in providing guidance for evaluation. Some of you might also wonder what happens when JudgementalGPT can’t find a suitable evaluator for a particular evaluation task. For this edge case, we default back to G-Eval. Here’s a quick architecture diagram on how JudgementalGPT works:
As I’m writing this article, I discovered recent paper introducing Prometheus, “a fully open-source LLM that is on par with GPT-4’s evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied”, which also requires evaluation steps to be explicitly defined instead.
Still, problems with LLM-based evaluation lingers
One unresolved issue pertains to the accuracy challenges stemming from the predominance of a single digit in evaluation scores. This phenomenon, theoretically, isn’t exclusive to older models and is likely to affect advanced versions like gpt-4–1106-preview as well. So, I’m keeping an open mind about how this might affect JudgementalGPT. We’re really looking forward to more research that’ll either back up what we think or give us a whole new perspective — either way, I’m all ears.
Lastly, there can still be intricacies involved in defining our own set of evaluators. For example, just like how G-Eval isn’t a one-size-fits-all solution, neither is summarization, or relevancy. Any metric that is subject to interpretability is guaranteed to disappoint users who expect something different. For now, the best solution would be to have users clearly define their evaluation criteria to rid LLMs of any evaluation ambiguity.
Conclusion
At the end of the day, there’s no one-size-fits-all solution for LLM-based evaluations, which is why engineers/data scientists are frequently disappointed by non-human evaluation scores. However, by defining specific and concise evaluation steps for different use cases, LLMs are able to navigate ambiguity better, as they are provided more guidance into what a human might expect for different evaluation criteria.
P.S. By now, those of you who read between the lines will probably know the key to building a better evaluator is to tailor them for specific use cases, and OpenAI’s new Assistant API along with its code interpreter functionality is merely the icing on the cake (and a good marketing strategy!).
So, dear anonymous internet stranger, I hope you’re satisfied, and till next time.
Top comments (5)
Great article and DeepEval is surely amazing!
Thanks you! :))
Very nice overview of the LLM evaluation challenges!
Glad you liked it!
Transparency is key for trust in open-source projects. Detailed insights into how JudgementalGPT was built would be incredibly valuable for the community of aldous teal and black fruit squishmallow. Thanks for addressing this concern.