hayerhans

Posted on Feb 11

Web Automation in Plain English: Browser Use Changes Everything

#ai #webdev #opensource

Introduction

Hey everyone! Today we're diving into Browser Use, an incredible new library that's revolutionizing web automation. If you've ever struggled with Selenium or Playwright, dealing with selectors and timeouts, you're going to love this. Let's build something cool together!

Prefer video content? Check out my detailed walkthrough on YouTube: https://youtu.be/RsGTT7J7Po8

Setup Section

First, let's get our environment ready. I'll walk you through this step by step:

Create a fresh project folder and open your favorite IDE
Install UV - it's a super fast alternative to pip

curl -LsSf https://astral.sh/uv/install.sh | sh

Create a virtual environment with Python 3.11 (Browser Use requirement):

uv venv --python 3.11
source .venv/bin/activate

Install Browser Use and Playwright:

uv pip install browser-use
playwright install

Creating Our First Agent

Let's write our first Browser Use agent. Here's the minimal code you need:

from browser_use import Agent
from langchain_openai import ChatOpenAI

agent = Agent(
    task="Search for latest news about AI",
    llm=ChatOpenAI(model="gpt-4o"),
)

What's cool here is that we only need two main parameters:

task: Just tell it what you want to do in plain English
llm: Specify which language model to use

Advanced Configuration

Now, let's look at some powerful features. Browser Use gives us tons of configuration options:

agent = Agent(
    task="your task",
    llm=llm,
    controller=custom_controller,  # For custom tool calling
    use_vision=True,              # Enable vision capabilities
    save_conversation_path="logs/conversation.json"  # Save chat logs
)

The use_vision parameter is particularly interesting - it lets your agent actually see and understand what's on the webpage. Just keep in mind that for GPT-4o, each image processed costs about 800-1000 tokens (roughly $0.002 USD).

Working with Browser Sessions

One of the coolest features is the ability to connect to your existing Chrome instance. This is super helpful for situations where you need to be logged in. Here's how:

from browser_use import Agent, Browser

# Create and reuse a browser instance
browser = Browser()
agent = Agent(
    task=task1,
    llm=llm,
    browser=browser  # Browser instance will be reused
)

await agent.run()

# Don't forget to close when done
await browser.close()

Structured Output

If you need structured data, Browser Use has you covered. You can define custom output formats using Pydantic:

from pydantic import BaseModel

class Post(BaseModel):
    post_title: str
    post_url: str
    num_comments: int
    hours_since_post: int

class Posts(BaseModel):
    posts: List[Post]

controller = Controller(output_model=Posts)

Getting Results and History

After running your agent, you get access to tons of useful information:

history = await agent.run()

# Access various types of information
urls = history.urls()              # URLs visited
screenshots = history.screenshots()       # Screenshot paths
actions = history.action_names()      # Actions taken
content = history.extracted_content() # Extracted data
errors = history.errors()           # Any errors
model_actions = history.model_actions()     # All actions with parameters

Bonus: Using a Planner Model

For complex tasks, you can even use a separate model for high-level planning:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model='gpt-4o')
planner_llm = ChatOpenAI(model='o3-mini')

agent = Agent(
    task="your task",
    llm=llm,
    planner_llm=planner_llm,           # Planning model
    use_vision_for_planner=False,      # Disable vision for planner
    planner_interval=4                 # Plan every 4 steps
)

This setup lets you use a smaller, cheaper model for planning while keeping the powerful GPT-4o for execution.

Closing

That's it for today's tutorial! We've covered everything from basic setup to advanced features like browser session management and structured output. Drop a comment below if you'd like to see more Browser Use tutorials, maybe something about custom functions or system prompts?

Forem

Web Automation in Plain English: Browser Use Changes Everything

Introduction

Setup Section

Creating Our First Agent

Advanced Configuration

Working with Browser Sessions

Structured Output

Getting Results and History

Bonus: Using a Planner Model

Closing

Top comments (0)

Read next

OpenRewrite: Refactoring as code

Django Codebase Updates: January 2025

What On Earth Is The system_instruction Parameter in Gemini (It's More Powerful Than You Think)

Weekly JavaScript Roundup: Friday Links 17, February 07, 2025