On our previous chapters, we built a basic agent enhanced with specific skills via Plugins and Function calling.
On this third chapter, we will add a functionality to inspect and debug in real time the interactions between our agent and the plugins.
Why do we need an inspector?
In a basic agent-human interaction, the system receives a set of instructions and a history chat, and creates a reply accordingly. Although it might not be a mundane task, the number of variables are scoped. However, when we add skills to our agent in the form of plugins, the interactions are much more complex. We need to review descriptions, arguments, etc. In those scenarios, it is key to understand how our agent interacts with the different plugins in real time to be able to adjust the different plugins so the agent calls them when expected. That's the purpose of the Inspector we are going to build on this chapter.
Additionally, it is also important to identify the number of tokens our model consumes on each function call. That information is needed to estimate the cost of our agent, something that is critical on a business scenario when the usage of the agent could grow to hundreds or thousands of users. With these estimations, we can decide if the current solution is cost-effective, or if we need to improve our prompts, functions or just remove some functionalities.
Let's start by understanding what tokens are and how we can extract that usage from Semantic Kernel.
Tokens
A Large Language Model decomposes the text into tokens to analyze the semantics of the text and the connection between the different tokens. In a naive definition, tokens are how the models see the world. If you want to understand better how these models are created, decompose the text and create content, there is a wide amount of literature out there about it.
Model providers, such as OpenAI or Antrophic, make a difference between input and output tokens:
Input tokens (aka prompt tokens) are those sent to the model on each call. On a real scenario the system prompt, chat history and other data, such as function descriptions, are part of the input tokens.
Output tokens (aka completion tokens) are those generated by the model based on the input tokens. The cost of these tokens is usually ~4 times more than that of input tokens.
We can easily check in many online webs, like this tool from Hugging Face, how the text is converted into tokens and compared the differences between models. On each model, the decomposition is defined by the encoding used. For example, the gpt-4o
family uses a o200k_base
encoding, while the gpt3.5-turbo
or text-embedding-ada-002
use cl100k-base
.
Furthermore, it is possible to count the number of tokens directly in the code. In python, we can use the well-known library tiktoken from OpenAI.
import tiktoken
# Get encoder for a specific model
encoder = tiktoken.encoding_for_model("text-embedding-ada-002")
# Decompose text into tokens
tokens = encoder.encode('Some text here')
# Calculate number of tokens
print(f'Number of tokesn: {len(tokens)}')
In some scenarios, it might be useful to be able to count the number of tokens before actually doing a call to the agent. For example, if you need to calculate multiple embeddings, you can calculate the preferred batch size by counting the number of tokens per embedding and taking into account the model's limit. Other scenarios might be to keep the chat history below a threshold to control the cost, or calculate the optimal number of samples provided on a few-shot dynamic prompting.
Tokens on Semantic Kernel
On Semantic Kernel, we can easily get the tokens used on each agent call. Let's try to create some code to gather and track that information.
First, we create a class TokenUsage
that we will use to collect the number of input (prompt) and output (completion) tokens per invokation:
class TokenUsage:
def __init__(self, input_t, output_t):
self.input_tokens = input_t
self.output_tokens = output_t
The usage of tokens is part of the metadata of the messages from the ChatHistory
:
for message in history.messages:
usage = TokenUsage(
input_t=message.metadata['usage'].prompt_tokens,
output_t=message.metadata['usage'].completion_tokens
)
Alternatively, we can track the usage on each agent call but the response
will only hold the last reply from the agent. You might need to inspect the ChatHistory
that has been automatically updated with the call to invoke
:
async for response in self.agent.invoke(self.history):
self.history.add_message(response)
usage = TokenUsage(
input_t=response.metadata['usage'].prompt_tokens,
output_t=response.metadata['usage'].completion_tokens
)
Agent interactions
Our existing Librarian agent supports two types of interactions:
-
Non-function call: the model uses the chat history and the system prompt (or instructions) to reply directly to the user. These interactions are done when the user asks for something that is not directly related to any function. For example:
Hello, how are you today?
-
Function call: the model invokes one or more functions to generate the response to the user. For example:
Find some books about Harry Potter
.
In Semantic Kernel, each of these types of interactions are mapped to a specific class type. For a simple user-agent interaction (non-function call) the message in the history is an instance of TextContent
. For the user-agent-plugin interaction (function call) there are two different messages in the history: first, the FunctionCallContent
that includes the function that has been called and the arguments; second, the FunctionResultContent
that holds the result for the function call.
Now, to apply these concepts to our agent, we first create some classes to collect the information we want to show in the inspector of the chatbot for each type of call:
from abc import ABC
class AgentInvokation(ABC):
role: str
usage: TokenUsage
class AgentTextInvokation(AgentInvokation):
text: str
def __init__(self, role: str, text: str, usage: TokenUsage):
self.role = role
self.text = text
self.usage = usage
class AgentFunctionInvokation(AgentInvokation):
plugin_name: str
function_name: str
function_result: str
function_arguments: str
def __init__(self, role: str, plugin_name: str, function_name: str, arguments: str, usage: UsageRecord):
self.role = role
self.plugin_name = plugin_name
self.function_name = function_name
self.function_arguments = arguments
self.usage = usage
def add_invokation_result(self, result: str):
self.function_result = result
Then, we can create a method that maps the current ChatHistory
from Semantic Kernel into the class system we have created:
from agent.agent_record import AgentInvokation, AgentTextInvokation, AgentFunctionInvokation, TokenUsage
from semantic_kernel.contents.chat_message_content import ChatMessageContent
class LibirarianAssistant:
def invokations(self, user_message: str) -> list[AgentInvokation]
invokations = [
TextInvokation(role='AuthorRole.SYSTEM', text=self.agent.instructions, usage=InvokationUsage(0,0))
]
for message in self.history.messages:
# Get role to differentiate between User and Agent
role = message.role
# Retrieve usage from metadata
usage = self.__get_usage(message)
for item in message.items:
if isinstance(item, TextContent):
# For TextContent, we just get the text
self.records.append(AgentTextInvokation(role, item.text, usage))
elif isinstance(item, FunctionCallContent):
# For FunctionCallContent we get the name of the plugin, the function, and its arguments
self.records.append(AgentFunctionInvokation(role, item.plugin_name, item.function_name, item.arguments, usage))
elif isinstance(item, FunctionResultContent) and isinstance(records[-1], AgentFunctionInvokation):
# For FunctionResultContent, we update last record adding the function result
self.records[-1].add_invokation_result(item.result)
def __get_usage(self, message: ChatMessageContent) -> TokenUsage:
# If usage is not present, return 0
if 'usage' in message.metadata:
return TokenUsage(message.metadata['usage'].prompt_tokens, message.metadata['usage'].completion_tokens)
else:
return TokenUsage(0, 0)
Now, we have all the pieces to present it to the user in the way we prefer. In my sample chatbot, I have decided to use two different tabs, one for the standard chatbot experience, and another one with the details about the function calls and tokens. On the latter, we separate the messages into four types: user messages, system prompt, agent text replies and agent function calls (or tools).
Summary
On this chapter, we have not added any functionality to the agent itself. However, we have enhanced our chatbot with a real time inspector, so it is easy to see how our agent interacts with the different plugins and estimate the usage.
Remember that all the code is already available on my GitHub repository π PyChatbot for Semantic Kernel.
On the next chapter, we will get back to the agent to include Voice capabilities, like speech recognition or text-to-speech.
Top comments (0)