I’m a TDD fan!
Don’t get me wrong: I don’t perform literally every change in the code by editing a test first. But I’ve mastered running parts individually without the whole product. This saves me time and even money.
ℹ️ This post is part of the “Crew AI Caveats” series, which I create to fill in the gaps left by official courses and to help you master CrewAI faster and easier.
Stuff must be tested in isolation
Of course, in the practical approach to CrewAI, the first obvious thing to me was to run tasks individually.
Really, when I start, I don’t know what tasks I need. Will a task as I imagine it work in one shot, or should I split it into many? No one knows until we try. And the same is true for improvement ideas regarding tasks we already have.
Additionally, we tune our internal formats and data transformations, and we debug tools. All these things should be runnable in isolation. And all the cloud vendor connections should be mockable. If they are not, then we’re losing on waiting for irrelevant tasks and LLM calls.
But how?
I am only on the way to assembling a TDD-friendly CrewAI environment. You can check my topic in the CrewAI community and share your tricks, architecture, and code snippets.
Here’re my examples.
A tool-testing script
from dotenv import load_dotenv
from src.publishing.tools.get_articles_backlog import GetArticlesBacklog
def main():
# Load environment variables
load_dotenv()
# Initialize the tool
backlog_tool = GetArticlesBacklog()
print("\n=== Testing Articles Backlog Tool ===")
try:
results = backlog_tool._run()
print(results)
print(f"\nFound {len(results)} articles in backlog:")
for i, article in enumerate(results, 1):
print(f"\nArticle {i}:")
for key, value in article.items():
print(f"{key}: {value}")
except Exception as e:
print(f"Error occurred: {str(e)}")
if __name__ == "__main__":
main()
Invoking partial crews
I don’t have an extensive guide on how to run tasks and their combinations individually. I am trying various hacks. They work, but none of them could be recommended as a good practice.
Here’s the extract from crew.py
, though:
@CrewBase
class Publishing():
...
def mock_tasks(self, context: Dict):
for task_name, prev_result in context.items():
prev_task = getattr(self, task_name)()
pydantic_output, json_output = prev_task._export_output(prev_result)
prev_task.output = TaskOutput(
name=prev_task.name,
description=prev_task.description,
expected_output=prev_task.expected_output,
raw=prev_result,
pydantic=pydantic_output,
json_dict=json_output,
agent='Mock',
output_format=prev_task._get_output_format(),
)
@crew
def crew(self, mode: str = None) -> Crew:
if not mode:
return Crew(
agents=self.agents,
tasks=self.tasks,
verbose=True,
)
if mode == 'ideate':
return Crew(
agents=[
self.idea_hunter(),
],
tasks=[
self.ideate_free(),
],
verbose=True,
)
...
if mode == 'research_idea':
return Crew(
agents=[
self.scientific_researcher(),
],
tasks=[
self.prepare_idea_research(),
self.research_idea(),
],
verbose=True,
)
# An example of a completely detached agent for testing-only purposes
if mode == 'test_tool':
tool_tester = Agent(
role="Tool Tester",
goal="Test and validate the provided tool's functionality",
backstory="I am an expert at testing tools and validating their outputs. I ensure tools work as expected. Use word 'dog' as a part of the product name and 'language learning' as domain to search for knowledge.",
tools=[self.search_in_knowledge_base()],
verbose=True,
allow_delegation=False,
)
tool_test_task = Task(
description="Test the {tool} tool by using it and validating its output",
expected_output="A detailed report of the tool's functionality and performance",
agent=tool_tester,
)
return Crew(
agents=[tool_tester],
tasks=[tool_test_task],
process=Process.sequential,
verbose=True,
)
raise ValueError(f"Unknown mode: {mode}")
This way I can:
- Define named sub-sets of tasks.
- Mock results of the preceding tasks.
And here’s how it call that Crew in main.py
:
def assemble_regular_inputs():
with open('current_run/input.md', 'r') as f:
input = f.read()
return {
# Stuff for very different tasks is mixed together,
# because CrewAI interpolates all the task strings, not only the used ones.
'session_date': '2025-01-12',
'task': 'Inspect content plan, suggest ideas to fill the gaps',
'calendar_principles': get_publication_calendar_principles(),
'input': input,
}
def run():
"""
Run the crew.
"""
agentops.init()
cmd = os.getenv('RUN_MODE')
if not cmd:
inputs = assemble_regular_inputs()
Publishing().crew().kickoff(inputs=inputs)
return
if cmd == 'ideate':
inputs = assemble_regular_inputs()
Publishing().crew(mode='ideate').kickoff(inputs=inputs)
return
...
if cmd.startswith('test_tool:'):
tool_name = cmd.split(':', 1)[1]
inputs = {'tool': tool_name}
Publishing().crew(mode='test_tool').kickoff(inputs=inputs)
return
raise ValueError(f"Unknown command: {cmd}")
This post is rather a heads-up that CrewAI won’t make your crew test-friendly in traditional sense. It comes with its own “test” concept that is fantastic, but is not a replacement for traditional test engineering.
The end of the series
As you could see, I dive deep into CrewAI, but also found some problems I’d prefer not to have.
I plan to continue my AI-agents the following flows:
- Keep using CrewAI for at least one of my projects.
- Try to adapt a schema, where CrewAI is a node in LangGraph.
- Study new frameworks.
Happy building!
Top comments (0)