DEV Community

Cover image for Demystifying the dark arts of AI Agents 🤖
Kevin Naidoo
Kevin Naidoo

Posted on • Originally published at kevincoder.co.za

Demystifying the dark arts of AI Agents 🤖

Everyone is talking about AI agents; how great they are, and how they'll replace developers.

Both hold some truth, but there's no need for concern. I am not an AI expert, but for the last few years, I have worked with almost every model that exists from GPT4o to Sonnet 3.5 to even DeepSeek R1.

I have built complex workflows, trained classifiers, built simple chatbots, voice AI solutions, and much more. Throughout this experience, I have learned about the limitations of AI and the difference between actual functionality versus marketing hype.

In this article, we'll go on a mini journey, inside the deep underbelly of the AI world and explore where the technology is today and how it'll impact you as a developer in the coming years. And yes, we'll also build an Agent from scratch in Python.

💡 If you are an eager beaver and just want to see the code. click here.

What is an AI agent anyway?

This term is thrown around all the time as if it's something out of this world. No, it ain't! It's just a program, just functions a tad bit differently.

Agents are just orchestration programs that can parse text (or other media formats), analyze the contextual meaning of that text using LLMs, and then execute tasks based on whatever requirement is set out in that text.

To clarify, take a spec for building an exchange rate calculator. The ticket might look like:

Design a program that will take in a dollar amount 
and generate the relevant Rand or 
Euro amount based on the current exchange rate. 
Enter fullscreen mode Exit fullscreen mode

Currency converter

As a programmer, you now will build an interface with an input box. That input box will have some validation so that it ensures the end user only enters a valid number.

If they try something like this:
"What is $500 in Euros?"

The validation will kick in and alert the user with a message: "Sorry, your input is invalid. Please only enter numbers"

As you can see, a conventional program is more strict where types are enforced and users must input data in a structured way. Agents on the other hand don't have this limitation, you simply can chat with the agent and it'll parse the natural language into something that the computer can understand.

Instead of the programmer explicitly writing every single pathway a program can take, agents are more flexible in that they do have some level of decision-making, enabling them to dynamically build workflows on the fly.

In our currency conversion program, the programmer would have to build some sort of API integration. When the user clicks calculate, behind the scenes the numerical value entered is sent to a currency conversion API, and then the API responds with the relevant conversion.

The programmer needs to cater to every step in this workflow because he/she needs to take that numerical value and build some kind of JSON object. Thereafter, provide some sort of authentication, followed by making a post request, then getting back the post data, parsing it and finally displaying the final result.

In the case of the Agent, we don't need to explicitly write this workflow. The agent can build the JSON object all on its own and read auth credentials from your ENV because it's able to leverage the power of modern LLM coding abilities.

To give you another practical example:

Execution plan

In the above diagram, the agent first asks the LLM for a plan of action, next, it works through that plan step-by-step. At each step, it'll go back to the LLM for any additional information or it might execute some external task like querying an API.

In the case of this diagram, the user asks the agent to fetch the price of a particular Amazon listing. The Agent, in turn, queries the LLM for a list of tasks to do which might look like these:

  • Spin up a Python environment.
  • Write a script to visit the product page.
  • Wait for the DOM to load and scrape the HTML.
  • Use a regex or querySelector to find the DOM element with the price.
  • Extract the price and parse it into a float.
  • etc...

When the agent runs the Python script, it may crash. This now blocks the agent from executing other steps, so it goes back to the LLM and passes the errors to the LLM for further diagnosis. This ability to self-correct is both a blessing and a curse, more on that later...

The LLM then returns back the fixed code, which the agent executes and continues on with its tasks.

In summary; agents can generate code, query LLMs, and even talk to APIs, they are semi-autonomous programs that can function without explicit instructions for every single pathway. Still, as powerful as agents are, they still need you! The programmer.

Love and support

Similar to how you needed your parents to provide you with a support system, education, clothing, food, and so forth... Agents need programmers to build the "Lego blocks" of functionality and define the rules of the environment in which it runs.

By "Lego blocks" I mean adapters, while an LLM could generate code for an API integration, this is probably not a good idea, because a) it can hallucinate and just error out half the time. and b) there's no guarantee we can trust the source API.

A better approach would be to build an adapter for the API of your choosing and just provide the agent with some information on how to use that adapter (LLMs usually refer to these as tools).

Agents are only as good as the best LLM

Take a set of twins, they are identical in most aspects including height and body weight, schooling, etc...

One drives a Toyota Corolla and the other a Ferrari. They both have similar driving skills, they swap cars from time to time and they love racing each other.

Regardless of which brother drives which car, the Ferrari always wins.

Cars Racing

Why!!?

I won't go into the details, but clearly, the Ferrari has a much superior engine, hence why it can beat the Corolla any day without much fuss.

Coming back to agents. Agents do not have any reasoning capability of their own, they rely heavily on LLMs.

These LLMs have no measure of quality or completeness, they just generate tokens that look correct mostly but don't apply the same level of quality checking that a human would do nor do they have the same level of understanding as a human.

Even worse, as you keep prompting the model, it'll keep giving you some sort of response with confidence even if that response is just, well, plain garbage.

Throw in an agent and you got big trouble! Because the agent's execution plan is being built by the LLM. It's just going to loop through each step and just execute. Sure, agents can detect problems and try to re-ask the LLM to fix issues but since the LLM will most likely hallucinate at some point, the agent could actually get stuck in an endless loop or most likely will not complete tasks holistically.

This is why agents need to be programmed in a certain way with loads of guardrails. They are far from autonomous. They need constant fine-tuning, and can only be used for very niche tasks.

Programmers rule the world!

Programmers rule the world

Does Zuck really want to promote agents to mid-level software engineers? Can you imagine the agent getting stuck in an endless loop, constantly adding and deleting lines of code for 4 hours 😂 and then submitting a PR with broken code?

Then another agent merges that code, and 💣 half of Facebook is down for hours! Hey! It's possible, looking at the crappy code that the so-called best LLMs generate, I won't be surprised.

No! This is just marketing gimmicks, yes of course a company like Facebook has some percentage of developers that perform very niche tasks.

These tasks can then be mostly automated with agents, and maybe one or two senior engineers who supervise and merge code, allowing Facebook to let go of some percentage of their workforce. This is highly possible and most likely is happening and will continue to happen.

For the rest of us, agents are going to bring on more work. Someone's going to have to build those APIs and write those guardrails to prevent agents from getting stuck in endless loops.

Furthermore, APIs and software are constantly changing thus one needs to constantly keep the agent code in sync with the rest of the software changes and of course make those software changes.

Besides, I'm pretty sure salespeople are just going to sell more stuff because developers can now automate more things, thus freeing up their time to work on more stuff.

Finally, no tech CTO or founder is just going to hand over their AWS keys to an agent and fire all their developers. That's insane, AI is simply not at that level and probably won't reach that level anytime soon.

PS: You should read my earlier article to learn more about "Tunnel Syndrome" here.

AI agents are the future just like the mobile phone

Futuristic mobile phones

Remember the doomsayers who were saying computers are going to die off because of mobile phones?

We developers, are the computers in this era.

You'll find, while yes a certain group of computer users now opt for a tablet or a mobile phone instead of a laptop, the amount of laptop and desktop users is still pretty huge.

This is because mobile phones are mostly communication and consumption devices, they service a slightly different market and now developers have to make everything responsive so that we can support more devices.

Similar to now having to build two layouts, one for mobile and one for desktop, or adding a ton of media queries to make your website responsive. The workload for developers is just going to increase because now every company is going to want to deploy some AI feature or agent.

Soon, agents will be the new chatbot that everybody needs, so we'd be building agents.

Thus agents are here to stay and may replace some jobs but ultimately will create more jobs because we going to need more developers to build and maintain these agents.

Let's build an agent from scratch

Okay enough talk, let's roll up our sleeves and actually build an Agent. Before we get started, let me just stress this is a very basic example for educational purposes.

In the real world, you would want to error handle a lot better and probably break this up into multiple classes.

The agent:

from pydantic import BaseModel
from openai import OpenAI
import time

class ExecutionPlan(BaseModel):
    steps: list[str]

class AIAgent:
    def __init__(self, debug=False):
        self.client = OpenAI()
        self.debug = debug


    def landing_page(self, **kwargs):
        prompt = kwargs['prompt']
        goal = "build a modern landing page"
        execution_plan = self.build_execution_plan(goal, prompt)
        messages = [
            {"role": "system", "content": "You must build a landing page for this user based on their requirement."},
            {"role": "user", "content": prompt},
        ]

        for step in execution_plan.steps:
            if self.debug:
                print(f"Executing step: {step}.")
            messages.append({"role": "system", "content": step})
            response = self.ask_model(messages)
            if self.debug:
               print(f"Done: {response}")
            messages.append({"role": "assistant", "content": response})

        messages.append({"role": "user", "content": "Please respond with the final design for my landing page requirement. Please return only HTML and no extra commentary."})

        landing_page = self.ask_model(messages)

        with open("landing.html", "w+") as f:
            f.write(landing_page)

        return "Done. Please view landing.html in your browser."

    def invoke_action(self, action, **kwargs):
        if not hasattr(self, action):
            return "Sorry, I have no idea what to do with this request!?"

        getattr(self, action)(**kwargs)

    def ask_model(self, messages, model="gpt-4o-mini", temperature=0.7, response_format=None):
        i = 0
        while i < 3:
            i += 1
            try:
                if response_format is not None:
                    response = self.client.beta.chat.completions.parse(
                        model=model,
                        messages=messages,
                        temperature=temperature,
                        response_format=response_format
                    )
                    return response.choices[0].message.parsed
                else:
                    response = self.client.chat.completions.create(
                        model=model,
                        messages=messages,
                        temperature=temperature
                    )
                    return response.choices[0].message.content

            except Exception as ex:
                print(ex)
                time.sleep(2)

    def intent_router(self, user_prompt) -> ExecutionPlan:
        prompt = """
            You must analyze the following user prompt and determine which action below best describes
            the user's request. Respond only with the action e.g. [email]

            1) [email] - the user wishes to send an email.
            2) [landing_page] - the user wishes to build a landing page.
            3) [book_calendar_date] - the user wishes to book a slot in their calendar.
        """

        intent = self.ask_model([
            {"role": "system", "content": prompt},
            {"role": "user", "content": user_prompt}
        ])

        if self.debug:
            print(f"RAW intent: {intent}")

        return intent.replace("[", "").replace("]", "").strip()

    def build_execution_plan(self, goal, prompt):
        system_prompt = f"""
            Given the current goal:'{goal}' and the user prompt, return a step-by-step execution plan
            for an Agent to work through, so that it can adequately and efficiently fulfill the users request.
        """
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]

        return self.ask_model(messages, response_format=ExecutionPlan)

    def run(self, prompt):
        intent = self.intent_router(prompt)
        if self.debug:
            print(f"Intent for: '{prompt}' === {intent}")

        return self.invoke_action(intent, prompt=prompt)

def main():
    agent = AIAgent(debug=True)
    result = agent.run("I would like to build a landing page for my plumbing business. Please include images for pixels.com and generate copy that makes sense for industry.")

    print(result)

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Let's break this up step-by-step. The first thing you want to do in the agent process is determine what the user is actually trying to do, this is very similar to routing in a traditional web application where you map a URL route to a controller.

In the case of our Agent, we don't have URLs, instead the LLM parsers the text prompt and determines a keyword (i.e. the "intent") that best describes what they want to achieve:


    def intent_router(self, user_prompt) -> ExecutionPlan:
        prompt = """
            You must analyze the following user prompt and determine which action below best describes
            the user's request. Respond only with the action e.g. [email]

            1) [email] - the user wishes to send an email.
            2) [landing_page] - the user wishes to build a landing page.
            3) [book_calendar_date] - the user wishes to book a slot in their calendar.
        """

        intent = self.ask_model([
            {"role": "system", "content": prompt},
            {"role": "user", "content": user_prompt}
        ])

        if self.debug:
            print(f"RAW intent: {intent}")

        return intent.replace("[", "").replace("]", "").strip()
Enter fullscreen mode Exit fullscreen mode

Once we know the intent, we can easily determine what method/action to execute:

    def run(self, prompt):
        intent = self.intent_router(prompt)
        if self.debug:
            print(f"Intent for: '{prompt}' === {intent}")

        return self.invoke_action(intent, prompt=prompt)
Enter fullscreen mode Exit fullscreen mode

In our "run" method which is the entry point that starts up the agent, we use the intent to dynamically execute a method on the "AIAgent" class.

Taking a peek inside "invoke_action", you will see that it's basically just checking if the class has a method that's named exactly the same as the relevant intent.

    def invoke_action(self, action, **kwargs):
        if not hasattr(self, action):
            return "Sorry,...."

        getattr(self, action)(**kwargs)
Enter fullscreen mode Exit fullscreen mode

If the method exists, we just execute it and we pass any extra arguments in as well via "kwargs".

In my example I am asking the agent to build a landing page, thus the intent is "[landing_page]" and if you look inside the "AIAgent" class you will notice a method with this same name:

    def landing_page(self, **kwargs):
        prompt = kwargs['prompt']
        goal = "build a modern landing page"
        execution_plan = self.build_execution_plan(goal, prompt)
        messages = [
            {"role": "system", "content": "You must build a landing page for this user based on their requirement."},
            {"role": "user", "content": prompt},
        ]

        for step in execution_plan.steps:
            if self.debug:
                print(f"Executing step: {step}.")
            messages.append({"role": "system", "content": step})
            response = self.ask_model(messages)
            if self.debug:
               print(f"Done: {response}")
            messages.append({"role": "assistant", "content": response})

        messages.append({"role": "user", "content": "Please respond with the final design for my landing page requirement. Please return only HTML and no extra commentary."})

        landing_page = self.ask_model(messages)

        with open("landing.html", "w+") as f:
            f.write(landing_page)

        return "Done. Please view landing.html in your browser."
Enter fullscreen mode Exit fullscreen mode

The method starts off by defining a "goal", this is more descriptive than the single keyword intent and will be used to prompt the model again to get a detailed step-by-step workflow.

execution_plan = self.build_execution_plan(goal, prompt)
Enter fullscreen mode Exit fullscreen mode

Looking inside the execution plan method, we use structured outputs so that we can tell the LLM to return a list of steps in the format of "ExecutionPlan" which is just a plain pydantic model:

class ExecutionPlan(BaseModel):
    steps: list[str]
...
    def build_execution_plan(self, goal, prompt):
        system_prompt = f"""
            Given the current goal:'{goal}' and the user prompt, return a step-by-step execution plan
            for an Agent to work through, so that it can adequately and efficiently fulfill the users request.
        """
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]

        return self.ask_model(messages, response_format=ExecutionPlan)

    def run(self, prompt):
        intent = self.intent_router(prompt)
        if self.debug:
            print(f"Intent for: '{prompt}' === {intent}")

        return self.invoke_action(intent, prompt=prompt)
Enter fullscreen mode Exit fullscreen mode

Finally, we just loop through all the steps and prompt the model one by one until the final landing page is built.

Conclusion: AI Agents are here to stay

Don't be fooled by these big-name CEOs and tech entrepreneurs who just spread propaganda so that they can drive more investment into their companies. They use AI agents as a means to stir up market interest and get non-technical people all excited.

This is an age-old marketing trick! To these people, they are obsessed with the bottom line. A sad side-effect of capitalism! Greed for growth at any cost is the endgame.

The reality is that the technology while useful, is just over-hyped. It's been a good few years since the first version of ChatGPT was released and yet AI products just fall into two categories: AI Wrappers or value adds.

At the end of the day, Agents, AI, and all these tools are just that, tools. They are here to stay and plenty useful, but do not have reasoning and thinking capabilities on the same level as that of a human being, and probably never will!

Top comments (0)