DEV Community

Cover image for In 2025: The Future of Automation Testing: Adapting to AI Model Validation
Taki (Kieu Dang)
Taki (Kieu Dang)

Posted on

In 2025: The Future of Automation Testing: Adapting to AI Model Validation

Step-by-Step Guide to Model Testing Using RAGAS, MLflow, and Pytest

We'll go through the entire process of testing AI models, focusing on evaluating AI predictions using RAGAS, MLflow, and Pytest.


📌 Step 1: Understanding Model Testing in AI

Model testing ensures that AI predictions are:
Accurate (Correct responses)

Consistent (Same output for same input)

Reliable (Performs well under different conditions)

AI models are non-deterministic, meaning they may generate slightly different responses each time. This requires customized testing approaches beyond traditional unit tests.

Types of AI Model Testing

Test Type What It Checks Tool Used
Functional Testing Does the model return expected results? Pytest
Evaluation Metrics Precision, Recall, F1-score RAGAS, MLflow
Performance Testing Latency, speed, and efficiency MLflow
Fairness & Bias Testing Does the model discriminate or favor some inputs? RAGAS

📌 Step 2: Setting Up Your Environment

You'll need:
✅ Python 3.10+

✅ Pytest for writing tests

✅ MLflow for logging experiments

✅ RAGAS for evaluating LLM predictions

📥 Install Dependencies

pip install pytest mlflow ragas transformers datasets openai
Enter fullscreen mode Exit fullscreen mode

📌 Step 3: Functional Testing Using Pytest

Pytest helps validate model responses against expected outputs.

📝 Example: Testing an AI Model

Let's assume we have an LLM-based question-answering system.

🔹 AI Model (Mocked)

# ai_model.py
from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

def predict(question, context):
    return qa_pipeline(question=question, context=context)["answer"]
Enter fullscreen mode Exit fullscreen mode

🔹 Pytest Script

# test_ai_model.py
import pytest
from ai_model import predict

@pytest.mark.parametrize("question, context, expected_output", [
    ("What is AI?", "Artificial Intelligence (AI) is a branch of computer science...", "Artificial Intelligence"),
    ("Who discovered gravity?", "Isaac Newton discovered gravity when an apple fell on his head.", "Isaac Newton"),
])
def test_ai_model_predictions(question, context, expected_output):
    response = predict(question, context)
    assert expected_output in response, f"Unexpected AI response: {response}"
Enter fullscreen mode Exit fullscreen mode

Validates AI responses using predefined test cases.

Run the test:

pytest test_ai_model.py
Enter fullscreen mode Exit fullscreen mode

📌 Step 4: Evaluating Model Performance Using MLflow

MLflow helps track AI model performance across different versions.

📝 Steps:

  1. Log model predictions.
  2. Track accuracy, loss, latency, and versioning.
  3. Compare multiple model versions.

🔹 Log Model Performance Using MLflow

# mlflow_logger.py
import mlflow
import time
from ai_model import predict

# Start MLflow run
mlflow.set_experiment("AI_Model_Tracking")

with mlflow.start_run():
    question = "What is AI?"
    context = "Artificial Intelligence (AI) is a branch of computer science..."

    start_time = time.time()
    output = predict(question, context)
    end_time = time.time()

    latency = end_time - start_time

    mlflow.log_param("question", question)
    mlflow.log_param("context_length", len(context))
    mlflow.log_metric("latency", latency)

    print(f"Predicted Answer: {output}")
    mlflow.log_artifact("mlflow_logger.py")
Enter fullscreen mode Exit fullscreen mode

Logs AI predictions & latency into MLflow.

📊 Check MLflow Dashboard

Run:

mlflow ui
Enter fullscreen mode Exit fullscreen mode

Then open http://localhost:5000 to visualize logs.


📌 Step 5: Evaluating Model Accuracy Using RAGAS

RAGAS (Retrieval-Augmented Generation Assessment) helps test LLM accuracy, relevance, and faithfulness.

Key RAGAS Metrics

Metric Description
Faithfulness Is the response factually correct?
Relevance Is the response related to the query?
Answer Correctness Is the AI-generated answer meaningful?

🔹 Running a RAGAS Evaluation

# test_ragas.py
from ragas.metrics import faithfulness, answer_correctness, relevance
from ai_model import predict

# Sample question & AI response
question = "Who discovered gravity?"
context = "Isaac Newton discovered gravity when an apple fell on his head."
response = predict(question, context)

# Evaluate
faithfulness_score = faithfulness([{"question": question, "answer": response, "context": context}])
correctness_score = answer_correctness([{"question": question, "answer": response, "context": context}])
relevance_score = relevance([{"question": question, "answer": response, "context": context}])

# Print results
print(f"Faithfulness Score: {faithfulness_score}")
print(f"Answer Correctness Score: {correctness_score}")
print(f"Relevance Score: {relevance_score}")
Enter fullscreen mode Exit fullscreen mode

Evaluates AI accuracy using RAGAS metrics.

Run:

python test_ragas.py
Enter fullscreen mode Exit fullscreen mode

📌 Step 6: Automating Model Evaluation with Pytest & RAGAS

Now, let's combine RAGAS with Pytest for automated evaluation.

🔹 Pytest-RAGAS Script

# test_ai_ragas.py
import pytest
from ai_model import predict
from ragas.metrics import faithfulness, answer_correctness, relevance

test_cases = [
    ("Who discovered gravity?", "Isaac Newton discovered gravity...", "Isaac Newton"),
    ("What is AI?", "Artificial Intelligence is a branch...", "Artificial Intelligence")
]

@pytest.mark.parametrize("question, context, expected_output", test_cases)
def test_ai_ragas(question, context, expected_output):
    response = predict(question, context)

    faith_score = faithfulness([{"question": question, "answer": response, "context": context}])
    correct_score = answer_correctness([{"question": question, "answer": response, "context": context}])
    relevance_score = relevance([{"question": question, "answer": response, "context": context}])

    assert faith_score > 0.7, f"Low faithfulness: {faith_score}"
    assert correct_score > 0.7, f"Low answer correctness: {correct_score}"
    assert relevance_score > 0.7, f"Low relevance: {relevance_score}"
Enter fullscreen mode Exit fullscreen mode

Runs automated AI model evaluation using Pytest & RAGAS.

Run:

pytest test_ai_ragas.py
Enter fullscreen mode Exit fullscreen mode

📌 Step 7: Integrating Everything in CI/CD Pipeline

For continuous AI model testing, integrate tests into GitHub Actions, Jenkins, or GitLab CI/CD.

🔹 Sample GitHub Actions Workflow

name: AI Model Testing

on: [push]

jobs:
  test_model:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Install Dependencies
        run: pip install pytest mlflow ragas transformers datasets

      - name: Run AI Tests
        run: pytest test_ai_ragas.py
Enter fullscreen mode Exit fullscreen mode

Automatically runs tests on code push.


📌 Summary

Step Task Tool
1️⃣ Write functional AI model tests Pytest
2️⃣ Log AI performance & latency MLflow
3️⃣ Evaluate AI responses for accuracy RAGAS
4️⃣ Automate AI model evaluation Pytest + RAGAS
5️⃣ Integrate AI tests in CI/CD GitHub Actions

Top comments (0)