Step-by-Step Guide to Model Testing Using RAGAS, MLflow, and Pytest
We'll go through the entire process of testing AI models, focusing on evaluating AI predictions using RAGAS, MLflow, and Pytest.
📌 Step 1: Understanding Model Testing in AI
Model testing ensures that AI predictions are:
✅ Accurate (Correct responses)
✅ Consistent (Same output for same input)
✅ Reliable (Performs well under different conditions)
AI models are non-deterministic, meaning they may generate slightly different responses each time. This requires customized testing approaches beyond traditional unit tests.
Types of AI Model Testing
Test Type | What It Checks | Tool Used |
---|---|---|
Functional Testing | Does the model return expected results? | Pytest |
Evaluation Metrics | Precision, Recall, F1-score | RAGAS, MLflow |
Performance Testing | Latency, speed, and efficiency | MLflow |
Fairness & Bias Testing | Does the model discriminate or favor some inputs? | RAGAS |
📌 Step 2: Setting Up Your Environment
You'll need:
✅ Python 3.10+
✅ Pytest for writing tests
✅ MLflow for logging experiments
✅ RAGAS for evaluating LLM predictions
📥 Install Dependencies
pip install pytest mlflow ragas transformers datasets openai
📌 Step 3: Functional Testing Using Pytest
Pytest helps validate model responses against expected outputs.
📝 Example: Testing an AI Model
Let's assume we have an LLM-based question-answering system.
🔹 AI Model (Mocked)
# ai_model.py
from transformers import pipeline
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")
def predict(question, context):
return qa_pipeline(question=question, context=context)["answer"]
🔹 Pytest Script
# test_ai_model.py
import pytest
from ai_model import predict
@pytest.mark.parametrize("question, context, expected_output", [
("What is AI?", "Artificial Intelligence (AI) is a branch of computer science...", "Artificial Intelligence"),
("Who discovered gravity?", "Isaac Newton discovered gravity when an apple fell on his head.", "Isaac Newton"),
])
def test_ai_model_predictions(question, context, expected_output):
response = predict(question, context)
assert expected_output in response, f"Unexpected AI response: {response}"
✅ Validates AI responses using predefined test cases.
Run the test:
pytest test_ai_model.py
📌 Step 4: Evaluating Model Performance Using MLflow
MLflow helps track AI model performance across different versions.
📝 Steps:
- Log model predictions.
- Track accuracy, loss, latency, and versioning.
- Compare multiple model versions.
🔹 Log Model Performance Using MLflow
# mlflow_logger.py
import mlflow
import time
from ai_model import predict
# Start MLflow run
mlflow.set_experiment("AI_Model_Tracking")
with mlflow.start_run():
question = "What is AI?"
context = "Artificial Intelligence (AI) is a branch of computer science..."
start_time = time.time()
output = predict(question, context)
end_time = time.time()
latency = end_time - start_time
mlflow.log_param("question", question)
mlflow.log_param("context_length", len(context))
mlflow.log_metric("latency", latency)
print(f"Predicted Answer: {output}")
mlflow.log_artifact("mlflow_logger.py")
✅ Logs AI predictions & latency into MLflow.
📊 Check MLflow Dashboard
Run:
mlflow ui
Then open http://localhost:5000 to visualize logs.
📌 Step 5: Evaluating Model Accuracy Using RAGAS
RAGAS (Retrieval-Augmented Generation Assessment) helps test LLM accuracy, relevance, and faithfulness.
Key RAGAS Metrics
Metric | Description |
---|---|
Faithfulness | Is the response factually correct? |
Relevance | Is the response related to the query? |
Answer Correctness | Is the AI-generated answer meaningful? |
🔹 Running a RAGAS Evaluation
# test_ragas.py
from ragas.metrics import faithfulness, answer_correctness, relevance
from ai_model import predict
# Sample question & AI response
question = "Who discovered gravity?"
context = "Isaac Newton discovered gravity when an apple fell on his head."
response = predict(question, context)
# Evaluate
faithfulness_score = faithfulness([{"question": question, "answer": response, "context": context}])
correctness_score = answer_correctness([{"question": question, "answer": response, "context": context}])
relevance_score = relevance([{"question": question, "answer": response, "context": context}])
# Print results
print(f"Faithfulness Score: {faithfulness_score}")
print(f"Answer Correctness Score: {correctness_score}")
print(f"Relevance Score: {relevance_score}")
✅ Evaluates AI accuracy using RAGAS metrics.
Run:
python test_ragas.py
📌 Step 6: Automating Model Evaluation with Pytest & RAGAS
Now, let's combine RAGAS with Pytest for automated evaluation.
🔹 Pytest-RAGAS Script
# test_ai_ragas.py
import pytest
from ai_model import predict
from ragas.metrics import faithfulness, answer_correctness, relevance
test_cases = [
("Who discovered gravity?", "Isaac Newton discovered gravity...", "Isaac Newton"),
("What is AI?", "Artificial Intelligence is a branch...", "Artificial Intelligence")
]
@pytest.mark.parametrize("question, context, expected_output", test_cases)
def test_ai_ragas(question, context, expected_output):
response = predict(question, context)
faith_score = faithfulness([{"question": question, "answer": response, "context": context}])
correct_score = answer_correctness([{"question": question, "answer": response, "context": context}])
relevance_score = relevance([{"question": question, "answer": response, "context": context}])
assert faith_score > 0.7, f"Low faithfulness: {faith_score}"
assert correct_score > 0.7, f"Low answer correctness: {correct_score}"
assert relevance_score > 0.7, f"Low relevance: {relevance_score}"
✅ Runs automated AI model evaluation using Pytest & RAGAS.
Run:
pytest test_ai_ragas.py
📌 Step 7: Integrating Everything in CI/CD Pipeline
For continuous AI model testing, integrate tests into GitHub Actions, Jenkins, or GitLab CI/CD.
🔹 Sample GitHub Actions Workflow
name: AI Model Testing
on: [push]
jobs:
test_model:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Install Dependencies
run: pip install pytest mlflow ragas transformers datasets
- name: Run AI Tests
run: pytest test_ai_ragas.py
✅ Automatically runs tests on code push.
📌 Summary
Step | Task | Tool |
---|---|---|
1️⃣ | Write functional AI model tests | Pytest |
2️⃣ | Log AI performance & latency | MLflow |
3️⃣ | Evaluate AI responses for accuracy | RAGAS |
4️⃣ | Automate AI model evaluation | Pytest + RAGAS |
5️⃣ | Integrate AI tests in CI/CD | GitHub Actions |
Top comments (0)