Forem

hayerhans
hayerhans

Posted on

Web Automation in Plain English: Browser Use Changes Everything

Introduction

Hey everyone! Today we're diving into Browser Use, an incredible new library that's revolutionizing web automation. If you've ever struggled with Selenium or Playwright, dealing with selectors and timeouts, you're going to love this. Let's build something cool together!

Prefer video content? Check out my detailed walkthrough on YouTube: https://youtu.be/RsGTT7J7Po8

Setup Section

First, let's get our environment ready. I'll walk you through this step by step:

  1. Create a fresh project folder and open your favorite IDE
  2. Install UV - it's a super fast alternative to pip
curl -LsSf https://astral.sh/uv/install.sh | sh
Enter fullscreen mode Exit fullscreen mode
  1. Create a virtual environment with Python 3.11 (Browser Use requirement):
uv venv --python 3.11
source .venv/bin/activate
Enter fullscreen mode Exit fullscreen mode
  1. Install Browser Use and Playwright:
uv pip install browser-use
playwright install
Enter fullscreen mode Exit fullscreen mode

Creating Our First Agent

Let's write our first Browser Use agent. Here's the minimal code you need:

from browser_use import Agent
from langchain_openai import ChatOpenAI

agent = Agent(
    task="Search for latest news about AI",
    llm=ChatOpenAI(model="gpt-4o"),
)
Enter fullscreen mode Exit fullscreen mode

What's cool here is that we only need two main parameters:

  • task: Just tell it what you want to do in plain English
  • llm: Specify which language model to use

Advanced Configuration

Now, let's look at some powerful features. Browser Use gives us tons of configuration options:

agent = Agent(
    task="your task",
    llm=llm,
    controller=custom_controller,  # For custom tool calling
    use_vision=True,              # Enable vision capabilities
    save_conversation_path="logs/conversation.json"  # Save chat logs
)
Enter fullscreen mode Exit fullscreen mode

The use_vision parameter is particularly interesting - it lets your agent actually see and understand what's on the webpage. Just keep in mind that for GPT-4o, each image processed costs about 800-1000 tokens (roughly $0.002 USD).

Working with Browser Sessions

One of the coolest features is the ability to connect to your existing Chrome instance. This is super helpful for situations where you need to be logged in. Here's how:

from browser_use import Agent, Browser

# Create and reuse a browser instance
browser = Browser()
agent = Agent(
    task=task1,
    llm=llm,
    browser=browser  # Browser instance will be reused
)

await agent.run()

# Don't forget to close when done
await browser.close()
Enter fullscreen mode Exit fullscreen mode

Structured Output

If you need structured data, Browser Use has you covered. You can define custom output formats using Pydantic:

from pydantic import BaseModel

class Post(BaseModel):
    post_title: str
    post_url: str
    num_comments: int
    hours_since_post: int

class Posts(BaseModel):
    posts: List[Post]

controller = Controller(output_model=Posts)
Enter fullscreen mode Exit fullscreen mode

Getting Results and History

After running your agent, you get access to tons of useful information:

history = await agent.run()

# Access various types of information
urls = history.urls()              # URLs visited
screenshots = history.screenshots()       # Screenshot paths
actions = history.action_names()      # Actions taken
content = history.extracted_content() # Extracted data
errors = history.errors()           # Any errors
model_actions = history.model_actions()     # All actions with parameters
Enter fullscreen mode Exit fullscreen mode

Bonus: Using a Planner Model

For complex tasks, you can even use a separate model for high-level planning:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model='gpt-4o')
planner_llm = ChatOpenAI(model='o3-mini')

agent = Agent(
    task="your task",
    llm=llm,
    planner_llm=planner_llm,           # Planning model
    use_vision_for_planner=False,      # Disable vision for planner
    planner_interval=4                 # Plan every 4 steps
)
Enter fullscreen mode Exit fullscreen mode

This setup lets you use a smaller, cheaper model for planning while keeping the powerful GPT-4o for execution.

Closing

That's it for today's tutorial! We've covered everything from basic setup to advanced features like browser session management and structured output. Drop a comment below if you'd like to see more Browser Use tutorials, maybe something about custom functions or system prompts?

Top comments (0)