David Sola

Posted on Dec 9, 2024

Chatbot with Semantic Kernel - Part 3: Inspector & tokens 🔎

#ai #semantickernel #python #microsoft

On our previous chapters, we built a basic agent enhanced with specific skills via Plugins and Function calling.

On this third chapter, we will add a functionality to inspect and debug in real time the interactions between our agent and the plugins.

Why do we need an inspector?

In a basic agent-human interaction, the system receives a set of instructions and a history chat, and creates a reply accordingly. Although it might not be a mundane task, the number of variables are scoped. However, when we add skills to our agent in the form of plugins, the interactions are much more complex. We need to review descriptions, arguments, etc. In those scenarios, it is key to understand how our agent interacts with the different plugins in real time to be able to adjust the different plugins so the agent calls them when expected. That's the purpose of the Inspector we are going to build on this chapter.

Additionally, it is also important to identify the number of tokens our model consumes on each function call. That information is needed to estimate the cost of our agent, something that is critical on a business scenario when the usage of the agent could grow to hundreds or thousands of users. With these estimations, we can decide if the current solution is cost-effective, or if we need to improve our prompts, functions or just remove some functionalities.

Let's start by understanding what tokens are and how we can extract that usage from Semantic Kernel.

Tokens

A Large Language Model decomposes the text into tokens to analyze the semantics of the text and the connection between the different tokens. In a naive definition, tokens are how the models see the world. If you want to understand better how these models are created, decompose the text and create content, there is a wide amount of literature out there about it.

Model providers, such as OpenAI or Antrophic, make a difference between input and output tokens:

Input tokens (aka prompt tokens) are those sent to the model on each call. On a real scenario the system prompt, chat history and other data, such as function descriptions, are part of the input tokens.
Output tokens (aka completion tokens) are those generated by the model based on the input tokens. The cost of these tokens is usually ~4 times more than that of input tokens.

We can easily check in many online webs, like this tool from Hugging Face, how the text is converted into tokens and compared the differences between models. On each model, the decomposition is defined by the encoding used. For example, the gpt-4o family uses a o200k_base encoding, while the gpt3.5-turbo or text-embedding-ada-002 use cl100k-base.

Furthermore, it is possible to count the number of tokens directly in the code. In python, we can use the well-known library tiktoken from OpenAI.

import tiktoken

# Get encoder for a specific model
encoder = tiktoken.encoding_for_model("text-embedding-ada-002")

# Decompose text into tokens
tokens = encoder.encode('Some text here')

# Calculate number of tokens
print(f'Number of tokesn: {len(tokens)}')

In some scenarios, it might be useful to be able to count the number of tokens before actually doing a call to the agent. For example, if you need to calculate multiple embeddings, you can calculate the preferred batch size by counting the number of tokens per embedding and taking into account the model's limit. Other scenarios might be to keep the chat history below a threshold to control the cost, or calculate the optimal number of samples provided on a few-shot dynamic prompting.

Tokens on Semantic Kernel

On Semantic Kernel, we can easily get the tokens used on each agent call. Let's try to create some code to gather and track that information.

First, we create a class TokenUsage that we will use to collect the number of input (prompt) and output (completion) tokens per invokation:

class TokenUsage:
    def __init__(self, input_t, output_t):
        self.input_tokens = input_t
        self.output_tokens = output_t

The usage of tokens is part of the metadata of the messages from the ChatHistory:

for message in history.messages:
    usage =  TokenUsage(
        input_t=message.metadata['usage'].prompt_tokens,
        output_t=message.metadata['usage'].completion_tokens
    )

Alternatively, we can track the usage on each agent call but the response will only hold the last reply from the agent. You might need to inspect the ChatHistory that has been automatically updated with the call to invoke:

async for response in self.agent.invoke(self.history):
    self.history.add_message(response)
    usage = TokenUsage(
        input_t=response.metadata['usage'].prompt_tokens,
        output_t=response.metadata['usage'].completion_tokens
    )

Agent interactions

Our existing Librarian agent supports two types of interactions:

Non-function call: the model uses the chat history and the system prompt (or instructions) to reply directly to the user. These interactions are done when the user asks for something that is not directly related to any function. For example: Hello, how are you today?
Function call: the model invokes one or more functions to generate the response to the user. For example: Find some books about Harry Potter.

In Semantic Kernel, each of these types of interactions are mapped to a specific class type. For a simple user-agent interaction (non-function call) the message in the history is an instance of TextContent. For the user-agent-plugin interaction (function call) there are two different messages in the history: first, the FunctionCallContent that includes the function that has been called and the arguments; second, the FunctionResultContent that holds the result for the function call.

Now, to apply these concepts to our agent, we first create some classes to collect the information we want to show in the inspector of the chatbot for each type of call:

from abc import ABC

class AgentInvokation(ABC):
    role: str
    usage: TokenUsage

class AgentTextInvokation(AgentInvokation):
    text: str

    def __init__(self, role: str, text: str, usage: TokenUsage):
        self.role = role
        self.text = text
        self.usage = usage

class AgentFunctionInvokation(AgentInvokation):
    plugin_name: str
    function_name: str
    function_result: str
    function_arguments: str

    def __init__(self, role: str, plugin_name: str, function_name: str, arguments: str, usage: UsageRecord):
        self.role = role
        self.plugin_name = plugin_name
        self.function_name = function_name
        self.function_arguments = arguments
        self.usage = usage

    def add_invokation_result(self, result: str):
        self.function_result = result

Then, we can create a method that maps the current ChatHistory from Semantic Kernel into the class system we have created:

from agent.agent_record import AgentInvokation, AgentTextInvokation, AgentFunctionInvokation, TokenUsage
from semantic_kernel.contents.chat_message_content import ChatMessageContent

class LibirarianAssistant:
    def invokations(self, user_message: str) -> list[AgentInvokation]
        invokations = [
            TextInvokation(role='AuthorRole.SYSTEM', text=self.agent.instructions, usage=InvokationUsage(0,0))
        ]

        for message in self.history.messages:
            # Get role to differentiate between User and Agent
            role = message.role

            # Retrieve usage from metadata
            usage = self.__get_usage(message)

            for item in message.items:
                if isinstance(item, TextContent):
                    # For TextContent, we just get the text
                    self.records.append(AgentTextInvokation(role, item.text, usage))
                elif isinstance(item, FunctionCallContent):
                    # For FunctionCallContent we get the name of the plugin, the function, and its arguments
                    self.records.append(AgentFunctionInvokation(role, item.plugin_name, item.function_name, item.arguments, usage))
                elif isinstance(item, FunctionResultContent) and isinstance(records[-1], AgentFunctionInvokation):
                    # For FunctionResultContent, we update last record adding the function result
                    self.records[-1].add_invokation_result(item.result)

    def __get_usage(self, message: ChatMessageContent) -> TokenUsage:
        # If usage is not present, return 0
        if 'usage' in message.metadata:
            return TokenUsage(message.metadata['usage'].prompt_tokens, message.metadata['usage'].completion_tokens)
        else: 
            return TokenUsage(0, 0)

Now, we have all the pieces to present it to the user in the way we prefer. In my sample chatbot, I have decided to use two different tabs, one for the standard chatbot experience, and another one with the details about the function calls and tokens. On the latter, we separate the messages into four types: user messages, system prompt, agent text replies and agent function calls (or tools).

Summary

On this chapter, we have not added any functionality to the agent itself. However, we have enhanced our chatbot with a real time inspector, so it is easy to see how our agent interacts with the different plugins and estimate the usage.

Remember that all the code is already available on my GitHub repository 🐍 PyChatbot for Semantic Kernel.

On the next chapter, we will get back to the agent to include Voice capabilities, like speech recognition or text-to-speech.

DEV Community

Chatbot with Semantic Kernel - Part 3: Inspector & tokens 🔎

Why do we need an inspector?

Tokens

Tokens on Semantic Kernel

Agent interactions

Summary

Top comments (0)

Read next

AI Meets UI: How GPT Models Are Transforming Frontend Design

What is Shortest ?🔥🔥🔥

Streaming Responses in AI: How AI Outputs Are Generated in Real-Time

From DeLorean to Waymo: The Journey to Autonomous Vehicles