DEV Community

Cover image for Comparing LLMs for optimizing cost and response quality
IBM Developer for IBM Developer

Posted on • Originally published at developer.ibm.com

Comparing LLMs for optimizing cost and response quality

Systematically selecting the right model for your specific task

By Bhavishya Pandit

This article was originally published on IBM Developer.

Large language models (LLMs) have become indispensable across various fields, from content creation to automated customer support. As the ecosystem expands with a growing number of open-source models, such as the IBM Granite or Llama family of models, selecting the right model for a specific task has become increasingly complex. Each model excels in different areas, which makes it crucial to compare the LLM outputs systematically.

To facilitate this process, we developed a reusable codebase that anyone can use to compare multiple open-source LLMs effectively. This code will leverage technologies such as LangChain and watsonx.ai to enable seamless integration and querying of various models. Just like comparing products across multiple e-commerce sites to secure the best deal, our code aims to streamline the comparison of LLM outputs.

In this tutorial, we will explore the importance of comparing outputs from multiple LLMs, outline the challenges involved, and present solutions for a systematic evaluation. We will also provide the code necessary to compare various LLMs, highlighting the benefits of efficient model selection and optimized workflows.

Why compare LLM outputs?

To understand the benefits of comparing LLM outputs, let’s first look at the key reasons why it’s important. After that, we’ll discuss the challenges that come with comparing multiple LLMs.

Consider these four core reasons for evaluating LLMs for optimal use:

  • Task-specific performance. Not all LLMs are created equal. One model may excel at creative writing, another at structured summarization, and yet another at factual question answering. Evaluating outputs across tasks like summarization, question answering (QA), or creative generation ensures that you select the best model for the job.

  • Bias and variability. Each LLM reflects biases based on its training data. For instance, some LLMs may demonstrate biases in areas like gender stereotypes. These biases can manifest differently across models, impacting decisions in areas like content moderation or research. Identifying and mitigating these biases is crucial for ensuring fairness and accuracy in the outputs.

  • Cost efficiency. Cost is another important factor. Some LLMs offer high-quality outputs at lower costs, making them more viable for continuous use. Comparing models helps balance the trade-offs between cost and quality, ensuring that resource-intensive tasks don’t unnecessarily strain budgets.

  • Generalization ability. LLMs trained on specific datasets may struggle to generalize to new, unseen scenarios. By comparing outputs across diverse tasks, you can assess which models adapt best to different situations, ensuring robustness in real-world applications.

What are the challenges when comparing multiple LLMs?

Now that we've explored the importance of comparing LLM outputs, let’s turn our attention to the challenges involved in this process.

  • Consistency in output. LLMs often produce varied and sometimes inconsistent responses to the same query. For example, when summarising a story, one model might focus on emotional aspects while another emphasises factual content. Quantifying these differences can be challenging, as quality and relevance are often subjective.

  • Latency and cost. Running multiple LLMs simultaneously increases compute time and costs, particularly with API-driven models. Latency issues can arise, especially when querying large models or running chains of queries for complex tasks. High API costs can also make large-scale testing expensive.

  • Evaluation complexity. Choosing the right metrics for comparison is another challenge. While fluency and style might matter more for creative tasks, accuracy and reliability are key for factual QA. Some comparisons require human evaluation, while others can rely on automated metrics like BLEU, ROUGE, or F1, depending on the task.

  • Scalability. As the number of models and tasks grows, scaling infrastructure to support parallel evaluations becomes complex. Efficiently managing multiple model queries and data pipelines for testing requires significant resources and well-designed systems.

  • Bias and ethical considerations. Different LLMs are trained on diverse datasets, leading to different biases. Identifying these biases is critical, but it’s also essential to understand how the training data influences fairness, especially in sensitive applications.

Now, let's discuss how we can build a codebase to make these comparisons easier.

Building a codebase for comparing LLMs

To address these challenges, we propose developing a reusable codebase that allows for efficient comparison of multiple open-source LLMs. This codebase will enable users to input queries, run them across various models, and automatically compare the outputs based on selected criteria, such as task-specific performance, cost, and generalization ability.

By using an automated system, you can compare multiple models in parallel, drastically reducing manual effort. This ensures faster testing, making it easier to find the best-performing model for your specific needs. A systematic comparison generates valuable insights into each model's strengths and weaknesses. For example, one model may excel at conversational fluency while another offers more accurate factual responses.

With a modular codebase you can easily add new models or adjust testing parameters. This encourages experimentation, which means you can quickly evaluate newly released models or explore how different prompts affect output quality.

Lastly, by comparing outputs, you can optimize workflows by selecting the most suitable LLM for each task. In some cases, combining the outputs of multiple models may yield even better results.

Understanding the value of comparing outputs from different LLM models helps in selecting the best one for your specific needs.

Enough talking about the reasons why to do it, let's build a system that enables us to compare different LLM outputs.

We need to build a testing framework to evaluate different LLM models through the IBM watsonx platform. Here's what we're aiming to create:

Core functionality needed

  • Connect to multiple AI models (Llama, Mistral, Granite, etc.) through watsonx
  • Send test prompts and get responses
  • Compare performance metrics (speed, cost, quality)
  • Generate clear comparison reports

Design considerations

  • Need parallel processing for efficiency
  • Must handle different model response formats
  • Should be easy to add new models
  • Must track costs carefully

These are the key components we will build for this testing framework to evaluate LLM models:

Model connection layer

  • Handle watsonx authentication
  • Manage different prompt formats for each model
  • Set up error handling

Testing infrastructure

  • Run tests in parallel to save time
  • Track key metrics like response time
  • Calculate costs per model
  • Measure response quality

Analysis and reporting

  • Compare model performance
  • Generate statistics
  • Create readable reports

Finally, here is the initial test plan:

  • Start with simple prompts ("What's the capital of France?")
  • Add complex prompts (engineering requirements analysis)
  • Test against 3 models initially: ibm/granite-3-8b-instruct, meta-llama/llama-3-8b-instruct and mistralai/mistral-large

Phew, let's start!

Setting up the development environment

Before we begin building our LLM testing framework, we need to import all required libraries and set up our API authentication. This first code block handles our essential imports and API key configuration:

# Importing necessary libraries
import os
import re
import time
from langchain_ibm import WatsonxLLM
import concurrent.futures
from typing import List, Dict
from dataclasses import dataclass
from datetime import datetime
import numpy as np

os.environ['WATSONX_APIKEY'] = "ENTER YOUR WATSONX API KEY"
Enter fullscreen mode Exit fullscreen mode

Core response generation and metrics tracking

This section defines two critical components: the response generation function and a data structure for tracking metrics. Let's break them down:

def generate_response(prompt, model):
    '''
    Generates a response from a given prompt using a specified model.
    '''
    if 'ibm/granite' in model:
        header = ''
        footer = ''
    if 'ibm/granite-3' in model:
        header = '<|start_of_role|>user<|end_of_role|><|end_of_text|>'
        footer = '<|start_of_role|>assistant<|end_of_role|>'
    if 'mistralai/' in model:
        header = '<s>[INST]'
        footer = '[/INST]</s>'
    if 'meta-llama/llama' in model:
        header = '<|begin_of_text|><|start_header_id|>user<|end_header_id|>'
        footer = '<|eot_id|><|start_header_id|>assistant<|end_header_id|>'

    parameters = {
    "decoding_method": "sample",
    "min_new_tokens": 1,
    "max_new_tokens": 4096,
    "stop_sequences": [],
    "repetition_penalty": 1,
    'temperature': 0
    }

    prompt_new = prompt.format(header=header, footer=footer)
    watsonx_llm = WatsonxLLM(model_id=model, url="https://us-south.ml.cloud.ibm.com",
    project_id='67468415-dbad-4e81-b47c-95eaf1babbae', params=parameters)
    response = watsonx_llm.invoke(prompt_new)
    return re.sub(r"\*\*([^*]+)\*\*", r"\1", response)

@dataclass
class ResponseMetrics:
    """Class to store response and its metrics"""
    response: str
    processing_time: float
    start_time: datetime
    end_time: datetime
    token_count: int = 0
    cost_per_token: float = 0.0
    conciseness_score: float = 0.0
Enter fullscreen mode Exit fullscreen mode

Utility functions for metrics calculation

These essential utility functions handle token counting, response evaluation, and cost calculations for our testing framework.

def count_tokens(text: str) -> int:
    """
    Simple token count estimation.
    """
    return int(len(text.split())*0.75)

def calculate_conciseness_score(response: str) -> float:
    """
    Calculate conciseness score based on response length and content.
    """
    words = response.split()
    ideal_length = 50  # adjust based on your needs
    length_score = 1 - min(abs(len(words) - ideal_length) / ideal_length, 1)
    return length_score

def get_cost_per_1k_token(model: str) -> float:
    """
    Get cost per token for different models.
    """
    cost_mapping = {
        'meta-llama/llama-3-1-70b-instruct': 0.0018,
        'meta-llama/llama-3-70b-instruct': 0.0010,
        'meta-llama/llama-3-8b-instruct': 0.0006,
        'ibm/granite-13b-chat-v2':0.0006,
        'ibm/granite-20b-multilingual':0.0006,
        'mistralai/mistral-large':0.01,
        'mistralai/mixtral-8x7b-instruct-v01':0.0006,
        'ibm/granite-3-8b-instruct':0.0002
    }
    return cost_mapping.get(model)
Enter fullscreen mode Exit fullscreen mode

Parallel response generation system

This section implements the core parallel processing functionality that allows us to test multiple model-prompt combinations simultaneously, improving efficiency.

def generate_responses_parallel(prompt_list: List[str], model_list: List[str]) -> Dict[str, Dict[str, ResponseMetrics]]:
    """
    Generate responses for multiple prompts and models in parallel using ThreadPoolExecutor.
    """
    results = {}
    total_start_time = time.time()

    def process_single_combination(prompt: str, model: str) -> tuple:
        """Helper function to process a single prompt-model combination"""
        start_time = datetime.now()
        start_processing = time.time()

        try:
            response = generate_response(prompt, model)
            print(f'Model: {model}')
            print(f'Prompt: {prompt.replace('{header}', '').replace('{footer}', '')}')
            print(f'Response: {response}')
            processing_time = time.time() - start_processing
            end_time = datetime.now()

            # Calculate additional metrics
            token_count = count_tokens(prompt+response)
            cost_per_1k_token = get_cost_per_1k_token(model)
            conciseness_score = calculate_conciseness_score(response)

            metrics = ResponseMetrics(
                response=response,
                processing_time=processing_time,
                start_time=start_time,
                end_time=end_time,
                token_count=token_count,
                cost_per_token=cost_per_1k_token,
                conciseness_score=conciseness_score
            )
            return (prompt, model, metrics)
        except Exception as e:
            processing_time = time.time() - start_processing
            end_time = datetime.now()

            metrics = ResponseMetrics(
                response=f"Error: {str(e)}",
                processing_time=processing_time,
                start_time=start_time,
                end_time=end_time
            )

            return (prompt, model, metrics)

    # Create a list of all prompt-model combinations
    combinations = [(prompt, model)
                   for prompt in prompt_list
                   for model in model_list]

    # Initialize results dictionary
    results = {prompt: {} for prompt in prompt_list}

    # Use ThreadPoolExecutor for parallel processing
    with concurrent.futures.ThreadPoolExecutor(max_workers=min(len(combinations), 10)) as executor:
        future_to_combo = {
            executor.submit(process_single_combination, prompt, model): (prompt, model)
            for prompt, model in combinations
        }

        for future in concurrent.futures.as_completed(future_to_combo):
            prompt, model = future_to_combo[future]
            try:
                prompt_result, model_result, metrics = future.result()
                results[prompt_result][model_result] = metrics
            except Exception as e:
                end_time = datetime.now()
                results[prompt][model] = ResponseMetrics(
                    response=f"Error in processing: {str(e)}",
                    processing_time=0.0,
                    start_time=end_time,
                    end_time=end_time
                )
    total_time = time.time() - total_start_time
    print(f"Total processing time: {total_time:.2f} seconds")
    return results
Enter fullscreen mode Exit fullscreen mode

Results analysis and reporting

This section handles the detailed analysis and presentation of our test results, providing both prompt-specific and aggregate statistics.

def print_detailed_statistics(results):
    """
    Print both per-prompt and aggregate statistics with detailed latency analysis.
    """
    # Per-prompt statistics
    print("\nPer-Prompt Statistics:")
    print("=" * 80)

    for prompt in results:
        prompt_latencies = []
        prompt_costs = []
        prompt_conciseness = []
        model_metrics = {}

        # Collect metrics for this prompt
        for model, metrics in results[prompt].items():
            prompt_latencies.append(metrics.processing_time)
            prompt_costs.append((metrics.token_count * metrics.cost_per_token))
            prompt_conciseness.append(metrics.conciseness_score)
            model_metrics[model] = metrics.processing_time
        prompt = prompt.replace('{header}','').replace('{footer}','')

        # Print prompt-specific statistics
        print(f"\nPrompt: {prompt[:100]}...")  # Show first 100 chars of prompt
        print(f"Latency Statistics:")
        print(f"  Average Latency: {np.mean(prompt_latencies):.2f} seconds")
        print(f"  Best Latency: {min(prompt_latencies):.2f} seconds")
        print(f"     Model: {min(model_metrics.items(), key=lambda x: x[1])[0]}")
        print(f"  Worst Latency: {max(prompt_latencies):.2f} seconds")
        print(f"     Model: {max(model_metrics.items(), key=lambda x: x[1])[0]}")

        print(f"\nCost and Conciseness:")
        print(f"  Best Cost: {min(prompt_costs):.5f}")
        print(f"     Model: {min(model_metrics.items(), key=lambda x: x[1])[0]}")
        print(f"  Worst Cost: {max(prompt_costs):.5f}")
        print(f"     Model: {max(model_metrics.items(), key=lambda x: x[1])[0]}")
        print(f"  Average Conciseness Score: {np.mean(prompt_conciseness):.2f}")
        print("-" * 80)

    all_model_metrics = {}
    for prompt in results:
        for model, metrics in results[prompt].items():
            if model not in all_model_metrics:
                all_model_metrics[model] = {'latency':[], 'cost':[]}

            all_model_metrics[model]['latency'].append(metrics.processing_time)
            all_model_metrics[model]['cost'].append(metrics.token_count * metrics.cost_per_token)

    print("\nOverall Aggregate Statistics:")
    print("\nPer-Model Performance:")
    for model in all_model_metrics:
        model_metrics = all_model_metrics[model]
        print(f"\n  {model}:")
        print(f"    Average Latency: {np.mean(model_metrics['latency']):.2f} seconds")
        print(f"    Average Cost: {np.mean(model_metrics['cost']):.2f}")


##Test execution configuration

This final section sets up our test cases and executes the parallel testing framework with specific models and prompts.

model_list = [
    'ibm/granite-3-8b-instruct',
    'meta-llama/llama-3-1-70b-instruct',
    'mixtral-8x7b-instruct-v01'
]
prompt1 = "{header} What is the capital of France? {footer}"
prompt2 = '''{header}
You are tasked with reviewing system engineering requirements according to INCOSE guidelines. For the given requirement statement,
your objective is to list the specific INCOSE rules that are failing, along with brief explanations of why they are failing.
The output should be formatted as a JSON array where each item follows this pattern:

- Rule number: A reference to the INCOSE rule that is failing (e.g., "R1 - Structured Statements")
The input is as follows:

Input requirement: The system should be fast.
Output:
{footer}
'''
prompt_list = [prompt1, prompt2]

# Generate responses
results = generate_responses_parallel(prompt_list, model_list)

# Print detailed statistics
print_detailed_statistics(results)


Enter fullscreen mode Exit fullscreen mode

Future enhancements

To improve our LLM comparison framework, we can streamline the logic for better conciseness and add more targeted evaluation metrics. For example, new metrics like faithfulness (how accurately the model reflects the input query) and completeness (how thoroughly it addresses the task) will enhance the depth of analysis.

Additionally, we can introduce an LLM as a judge layer. This meta-model would analyze the reports generated from multiple LLMs and suggest the best model for specific use cases, automating the decision-making process even further.

Conclusion

Comparing outputs from multiple LLMs is crucial for selecting the right model that is tailored to specific tasks. With the growing variety of open-source models like Llama, Granite, and Mistral, this comparison ensures you maximize performance, minimize biases, and optimize costs.

By using a structured, automated framework, you can efficiently evaluate models on key metrics, streamlining the decision-making process. As we continue enhancing this system with new metrics and automated reporting, the ability to select the best model for your use case becomes not only simpler but also more reliable.

Top comments (1)

Collapse
 
learncomputer profile image
Learn Computer Academy

Good breakdown of LLM comparison challenges and solutions. The reusable codebase approach with LangChain and watsonx.ai makes sense for systematically evaluating models like Granite and Llama. I appreciate how it tackles task-specific performance, cost, and bias—key factors that often get overlooked. The parallel processing setup and detailed metrics (latency, cost, conciseness) are practical for real-world use. Adding faithfulness and completeness metrics, as suggested, would definitely round it out further. Solid work—looking forward to seeing how the judge layer develops!