Taki (Kieu Dang)

Posted on Feb 8

In 2025: The Future of Automation Testing: Adapting to AI Model Validation

#testing #openai #programming #ai

Step-by-Step Guide to Model Testing Using RAGAS, MLflow, and Pytest

We'll go through the entire process of testing AI models, focusing on evaluating AI predictions using RAGAS, MLflow, and Pytest.

📌 Step 1: Understanding Model Testing in AI

Model testing ensures that AI predictions are:
✅ Accurate (Correct responses)

✅ Consistent (Same output for same input)

✅ Reliable (Performs well under different conditions)

AI models are non-deterministic, meaning they may generate slightly different responses each time. This requires customized testing approaches beyond traditional unit tests.

Types of AI Model Testing

Test Type	What It Checks	Tool Used
Functional Testing	Does the model return expected results?	Pytest
Evaluation Metrics	Precision, Recall, F1-score	RAGAS, MLflow
Performance Testing	Latency, speed, and efficiency	MLflow
Fairness & Bias Testing	Does the model discriminate or favor some inputs?	RAGAS

📌 Step 2: Setting Up Your Environment

You'll need:
✅ Python 3.10+

✅ Pytest for writing tests

✅ MLflow for logging experiments

✅ RAGAS for evaluating LLM predictions

📥 Install Dependencies

pip install pytest mlflow ragas transformers datasets openai

📌 Step 3: Functional Testing Using Pytest

Pytest helps validate model responses against expected outputs.

📝 Example: Testing an AI Model

Let's assume we have an LLM-based question-answering system.

🔹 AI Model (Mocked)

# ai_model.py
from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

def predict(question, context):
    return qa_pipeline(question=question, context=context)["answer"]

🔹 Pytest Script

# test_ai_model.py
import pytest
from ai_model import predict

@pytest.mark.parametrize("question, context, expected_output", [
    ("What is AI?", "Artificial Intelligence (AI) is a branch of computer science...", "Artificial Intelligence"),
    ("Who discovered gravity?", "Isaac Newton discovered gravity when an apple fell on his head.", "Isaac Newton"),
])
def test_ai_model_predictions(question, context, expected_output):
    response = predict(question, context)
    assert expected_output in response, f"Unexpected AI response: {response}"

✅ Validates AI responses using predefined test cases.

Run the test:

pytest test_ai_model.py

📌 Step 4: Evaluating Model Performance Using MLflow

MLflow helps track AI model performance across different versions.

📝 Steps:

Log model predictions.
Track accuracy, loss, latency, and versioning.
Compare multiple model versions.

🔹 Log Model Performance Using MLflow

# mlflow_logger.py
import mlflow
import time
from ai_model import predict

# Start MLflow run
mlflow.set_experiment("AI_Model_Tracking")

with mlflow.start_run():
    question = "What is AI?"
    context = "Artificial Intelligence (AI) is a branch of computer science..."

    start_time = time.time()
    output = predict(question, context)
    end_time = time.time()

    latency = end_time - start_time

    mlflow.log_param("question", question)
    mlflow.log_param("context_length", len(context))
    mlflow.log_metric("latency", latency)

    print(f"Predicted Answer: {output}")
    mlflow.log_artifact("mlflow_logger.py")

✅ Logs AI predictions & latency into MLflow.

📊 Check MLflow Dashboard

Run:

mlflow ui

Then open http://localhost:5000 to visualize logs.

📌 Step 5: Evaluating Model Accuracy Using RAGAS

RAGAS (Retrieval-Augmented Generation Assessment) helps test LLM accuracy, relevance, and faithfulness.

Key RAGAS Metrics

Metric	Description
Faithfulness	Is the response factually correct?
Relevance	Is the response related to the query?
Answer Correctness	Is the AI-generated answer meaningful?

🔹 Running a RAGAS Evaluation

# test_ragas.py
from ragas.metrics import faithfulness, answer_correctness, relevance
from ai_model import predict

# Sample question & AI response
question = "Who discovered gravity?"
context = "Isaac Newton discovered gravity when an apple fell on his head."
response = predict(question, context)

# Evaluate
faithfulness_score = faithfulness([{"question": question, "answer": response, "context": context}])
correctness_score = answer_correctness([{"question": question, "answer": response, "context": context}])
relevance_score = relevance([{"question": question, "answer": response, "context": context}])

# Print results
print(f"Faithfulness Score: {faithfulness_score}")
print(f"Answer Correctness Score: {correctness_score}")
print(f"Relevance Score: {relevance_score}")

✅ Evaluates AI accuracy using RAGAS metrics.

Run:

python test_ragas.py

📌 Step 6: Automating Model Evaluation with Pytest & RAGAS

Now, let's combine RAGAS with Pytest for automated evaluation.

🔹 Pytest-RAGAS Script

# test_ai_ragas.py
import pytest
from ai_model import predict
from ragas.metrics import faithfulness, answer_correctness, relevance

test_cases = [
    ("Who discovered gravity?", "Isaac Newton discovered gravity...", "Isaac Newton"),
    ("What is AI?", "Artificial Intelligence is a branch...", "Artificial Intelligence")
]

@pytest.mark.parametrize("question, context, expected_output", test_cases)
def test_ai_ragas(question, context, expected_output):
    response = predict(question, context)

    faith_score = faithfulness([{"question": question, "answer": response, "context": context}])
    correct_score = answer_correctness([{"question": question, "answer": response, "context": context}])
    relevance_score = relevance([{"question": question, "answer": response, "context": context}])

    assert faith_score > 0.7, f"Low faithfulness: {faith_score}"
    assert correct_score > 0.7, f"Low answer correctness: {correct_score}"
    assert relevance_score > 0.7, f"Low relevance: {relevance_score}"

✅ Runs automated AI model evaluation using Pytest & RAGAS.

Run:

pytest test_ai_ragas.py

📌 Step 7: Integrating Everything in CI/CD Pipeline

For continuous AI model testing, integrate tests into GitHub Actions, Jenkins, or GitLab CI/CD.

🔹 Sample GitHub Actions Workflow

name: AI Model Testing

on: [push]

jobs:
  test_model:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Install Dependencies
        run: pip install pytest mlflow ragas transformers datasets

      - name: Run AI Tests
        run: pytest test_ai_ragas.py

✅ Automatically runs tests on code push.

📌 Summary

Step	Task	Tool
1️⃣	Write functional AI model tests	Pytest
2️⃣	Log AI performance & latency	MLflow
3️⃣	Evaluate AI responses for accuracy	RAGAS
4️⃣	Automate AI model evaluation	Pytest + RAGAS
5️⃣	Integrate AI tests in CI/CD	GitHub Actions

DEV Community

In 2025: The Future of Automation Testing: Adapting to AI Model Validation

Step-by-Step Guide to Model Testing Using RAGAS, MLflow, and Pytest

📌 Step 1: Understanding Model Testing in AI

Types of AI Model Testing

📌 Step 2: Setting Up Your Environment

📥 Install Dependencies

📌 Step 3: Functional Testing Using Pytest

📝 Example: Testing an AI Model

🔹 AI Model (Mocked)

🔹 Pytest Script

📌 Step 4: Evaluating Model Performance Using MLflow

📝 Steps:

🔹 Log Model Performance Using MLflow

📊 Check MLflow Dashboard

📌 Step 5: Evaluating Model Accuracy Using RAGAS

Key RAGAS Metrics

🔹 Running a RAGAS Evaluation

📌 Step 6: Automating Model Evaluation with Pytest & RAGAS

🔹 Pytest-RAGAS Script

📌 Step 7: Integrating Everything in CI/CD Pipeline

🔹 Sample GitHub Actions Workflow

📌 Summary

Top comments (0)

Read next

AI Coding Pitfalls and Solutions: A Practical Guide

Bedrock Engineer: Your AI dev sidekick

AI Meme Generator: Where Humor Meets Virality

Revolutionizing Curated Content with AI