Akshay Ballal

Posted on Sep 29, 2024 • Originally published at akshaymakes.com

Build ReAct Agents using SLMs from Scratch

#llm #ai #python #machinelearning

In this post, I will demonstrate how to create a function-calling agent using Small Language Models (SLMs). Leveraging SLMs offers a range of benefits, especially when paired with tools like LoRA adapters for efficient fine-tuning and execution. While Large Language Models (LLMs) are powerful, they can be resource-intensive and slow. On the other hand, SLMs are more lightweight, making them ideal for environments with limited hardware resources or specific use cases where lower latency is critical.

By using SLMs with LoRA adapters, we can separate reasoning and function execution tasks to optimize performance. For instance, the model can execute complex function calls using the adapter and handle reasoning or thinking tasks without it, thus conserving memory and improving speed. This flexibility is perfect for building applications like function-calling agents without needing the infrastructure required for larger models.

Moreover, SLMs can be easily scaled to run on devices with limited computational power, making them ideal for production environments where cost and efficiency are prioritized. In this example, we'll use a custom model trained on the Salesforce/xlam-function-calling-60k dataset via Unsloth, demonstrating how you can utilize SLMs to create high-performance, low-resource AI applications.

Additionally, the approach discussed here can be scaled to more powerful models, such as LLaMA 3.1-8B, which have in-built function-calling capabilities, offering a smooth transition when larger models are necessary.

1. Initiate the Model and Tokenizer with Unsloth

We’ll first set up the model and tokenizer using Unsloth. Here, we define a max sequence length of 2048, though this can be adjusted. We also enable 4-bit quantization to reduce memory usage, ideal for running models on lower-memory hardware.

from unsloth import FastLanguageModel
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "akshayballal/phi-3.5-mini-xlam-function-calling",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model);

2. Implement Stopping Criteria for Controlled Generation

To ensure that the agent pauses execution after function calls, we define a stopping criteria. This will halt the generation when the model outputs the keyword "PAUSE," allowing the agent to fetch the result of the function call.

from transformers import StoppingCriteria, StoppingCriteriaList
import torch

class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords_ids:list):
        self.keywords = keywords_ids

    def __call__(self, input_ids: torch.LongTensor, _: torch.FloatTensor, **kwargs) -> bool:
        if input_ids[0][-1] in self.keywords:
            return True
        return False

stop_ids = [17171]
stop_criteria = KeywordsStoppingCriteria(stop_ids)

3. Define the Tools for Function Calling

Next, we define the functions the agent will use during execution. These Python functions will act as "tools" that the agent can call. The return type must be clear, and the function should include a descriptive docstring, as the agent will rely on this to choose the correct tool.

def add_numbers(a: int, b: int) -> int:
    """
    This function takes two integers and returns their sum.

    Parameters:
    a (int): The first integer to add.
    b (int): The second integer to add.
    """
    return a + b 

def square_number(a: int) -> int:
    """
    This function takes an integer and returns its square.

    Parameters:
    a (int): The integer to be squared.
    """
    return a * a

def square_root_number(a: int) -> int:
    """
    This function takes an integer and returns its square root.

    Parameters:
    a (int): The integer to calculate the square root of.
    """
    return a ** 0.5

4. Generate Tool Descriptions for the Agent

These function descriptions will be structured into a list of dictionaries. The agent will use these to understand the available tools and their parameters.

tool_descriptions = []
for tool in tools:
    spec = {
        "name": tool.__name__,
        "description": tool.__doc__.strip(),
        "parameters": [
            {
                "name": param,
                "type": arg.__name__ if hasattr(arg, '__name__') else str(arg),
            } for param, arg in tool.__annotations__.items() if param != 'return'
        ]
    }
    tool_descriptions.append(spec)
tool_descriptions

This is how the output looks like

[{'name': 'add_numbers',
  'description': 'This function takes two integers and returns their sum.\n\n    Parameters:\n    a (int): The first integer to add.\n    b (int): The second integer to add.',
  'parameters': [{'name': 'a', 'type': 'int'}, {'name': 'b', 'type': 'int'}]},
 {'name': 'square_number',
  'description': 'This function takes an integer and returns its square.\n\n    Parameters:\n    a (int): The integer to be squared.',
  'parameters': [{'name': 'a', 'type': 'int'}]},
 {'name': 'square_root_number',
  'description': 'This function takes an integer and returns its square root.\n\n    Parameters:\n    a (int): The integer to calculate the square root of.',
  'parameters': [{'name': 'a', 'type': 'int'}]}]

5. Create the Agent Class

We then create the agent class that takes the system prompt, the function calling prompt, the tools and the messages as input and returns the response from the agent.

__call__ is the function that is called when the agent is called with a message. It adds the message to the messages list and returns the response from the agent.
execute is the function that is called to generate the response from the agent. It uses the model to generate the response.
function_call is the function that is called to generate the response from the agent. It uses the function calling model to generate the response.

import ast

class Agent:
    def __init__(
        self, system: str = "", function_calling_prompt: str = "", tools=[]
    ) -> None:
        self.system = system
        self.tools = tools
        self.function_calling_prompt = function_calling_prompt
        self.messages: list = []
        if self.system:
            self.messages.append({"role": "system", "content": system})

    def __call__(self, message=""):
        if message:
            self.messages.append({"role": "user", "content": message})
        result = self.execute()
        self.messages.append({"role": "assistant", "content": result})
        return result

    def execute(self):
        with model.disable_adapter():  # disable the adapter for thinking and reasoning
            inputs = tokenizer.apply_chat_template(
                self.messages,
                tokenize=True,
                add_generation_prompt=True,
                return_tensors="pt",
            )
            output = model.generate(
                input_ids=inputs,
                max_new_tokens=128,
                stopping_criteria=StoppingCriteriaList([stop_criteria]),
            )
            return tokenizer.decode(
                output[0][inputs.shape[-1] :], skip_special_tokens=True
            )

    def function_call(self, message):
        inputs = tokenizer.apply_chat_template(
            [
                {
                    "role": "user",
                    "content": self.function_calling_prompt.format(
                        tool_descriptions=tool_descriptions, query=message
                    ),
                }
            ],
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt",
        )
        output = model.generate(input_ids=inputs, max_new_tokens=128, temperature=0.0)
        prompt_length = inputs.shape[-1]

        answer = ast.literal_eval(
            tokenizer.decode(output[0][prompt_length:], skip_special_tokens=True)
        )[
            0
        ]  # get the output of the function call model as a dictionary
        print(answer)
        tool_output = self.run_tool(answer["name"], **answer["arguments"])
        return tool_output

    def run_tool(self, name, *args, **kwargs):
        for tool in self.tools:
            if tool.__name__ == name:
                return tool(*args, **kwargs)

6. Define System and Function-Calling Prompts

We now define two key prompts:

System Prompt: The core logic for the agent's reasoning and tool use, following the ReAct pattern.
Function-Calling Prompt: This enables function calling by passing the relevant tool descriptions and queries.

system_prompt = f"""
You run in a loop of Thought, Action, PAUSE, Observation.
At the end of the loop you output an Answer
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of the actions available to you - then return PAUSE.
Observation will be the result of running those actions. Stop when you have the Answer. 
Your available actions are:

{tools}

Example session:

Question: What is the mass of Earth times 2?
Thought: I need to find the mass of Earth
Action: get_planet_mass: Earth
PAUSE 

Observation: 5.972e24

Thought: I need to multiply this by 2
Action: calculate: 5.972e24 * 2
PAUSE

Observation: 1,1944×10e25

If you have the answer, output it as the Answer.

Answer: \\{{1,1944×10e25\\}}.
PAUSE
Now it's your turn:
""".strip()

function_calling_prompt = """
You are a helpful assistant. Below are the tools that you have access to.  \n\n### Tools: \n{tool_descriptions} \n\n### Query: \n{query} \n
"""

7. Implement the ReAct Loop

Finally, we define the loop that enables the agent to interact with the user, execute the necessary function calls, and return the correct answers.

import re

def loop_agent(agent: Agent, question, max_iterations=5):

    next_prompt = question
    i = 0
    while i < max_iterations:
        result = agent(next_prompt)
        print(result)
        if "Answer:" in result:
            return result

        action = re.findall(r"Action: (.*)", result)
        if action:
            tool_output= agent.function_call(action)
            next_prompt = f"Observation: {tool_output}"
            print(next_prompt)
        else:
            next_prompt = "Observation: tool not found"
        i += 1
    return result

agent = Agent( system=system_prompt, function_calling_prompt=function_calling_prompt, tools=tools)

loop_agent(agent, "what is the square root of the difference between 32^2 and 54");

Check out the complete notebook on Colab here.

Conclusion

By following this step-by-step guide, you can create a function-calling agent using a custom model trained with Unsloth and LoRA adapters. This approach ensures efficient memory use while maintaining robust reasoning and function execution capabilities.

Explore further by extending this method to larger models or customizing the functions available to the agent.

Top comments (4)

Akshay Ballal • Sep 29 '24

Let me know in the comments if you need a tutorial on how I trained the model for function calling.

Filip Gore • Sep 29 '24

Yes, i really need some training

Akshay Ballal • Sep 29 '24

That's great. You can email me on arballal95@protonmail.com. I can send you the notebook I used for training.

Filip Gore • Sep 29 '24

Thanks for that! I’ll definitely reach out via email. Feel free to send over the notebook, and I’ll take a look. I’m excited to dive into the training data and see how we can work from there.

Forem

Build ReAct Agents using SLMs from Scratch

1. Initiate the Model and Tokenizer with Unsloth

2. Implement Stopping Criteria for Controlled Generation

3. Define the Tools for Function Calling

4. Generate Tool Descriptions for the Agent

5. Create the Agent Class

6. Define System and Function-Calling Prompts

7. Implement the ReAct Loop

Conclusion

Top comments (4)

Read next

Day 40: Constrained Decoding with LLMs

AI-Powered Solution Cuts Mixed-Integer Programming Time by 40% Using Unsupervised Learning

Automated Session Control with Bluetooth: An Insight into ble-lock-session

Mini-Max Sum - HackerRank Problem Solving