Introduction
Hey everyone! Today we're diving into Browser Use, an incredible new library that's revolutionizing web automation. If you've ever struggled with Selenium or Playwright, dealing with selectors and timeouts, you're going to love this. Let's build something cool together!
Prefer video content? Check out my detailed walkthrough on YouTube: https://youtu.be/RsGTT7J7Po8
Setup Section
First, let's get our environment ready. I'll walk you through this step by step:
- Create a fresh project folder and open your favorite IDE
- Install UV - it's a super fast alternative to pip
curl -LsSf https://astral.sh/uv/install.sh | sh
- Create a virtual environment with Python 3.11 (Browser Use requirement):
uv venv --python 3.11
source .venv/bin/activate
- Install Browser Use and Playwright:
uv pip install browser-use
playwright install
Creating Our First Agent
Let's write our first Browser Use agent. Here's the minimal code you need:
from browser_use import Agent
from langchain_openai import ChatOpenAI
agent = Agent(
task="Search for latest news about AI",
llm=ChatOpenAI(model="gpt-4o"),
)
What's cool here is that we only need two main parameters:
-
task
: Just tell it what you want to do in plain English -
llm
: Specify which language model to use
Advanced Configuration
Now, let's look at some powerful features. Browser Use gives us tons of configuration options:
agent = Agent(
task="your task",
llm=llm,
controller=custom_controller, # For custom tool calling
use_vision=True, # Enable vision capabilities
save_conversation_path="logs/conversation.json" # Save chat logs
)
The use_vision
parameter is particularly interesting - it lets your agent actually see and understand what's on the webpage. Just keep in mind that for GPT-4o, each image processed costs about 800-1000 tokens (roughly $0.002 USD).
Working with Browser Sessions
One of the coolest features is the ability to connect to your existing Chrome instance. This is super helpful for situations where you need to be logged in. Here's how:
from browser_use import Agent, Browser
# Create and reuse a browser instance
browser = Browser()
agent = Agent(
task=task1,
llm=llm,
browser=browser # Browser instance will be reused
)
await agent.run()
# Don't forget to close when done
await browser.close()
Structured Output
If you need structured data, Browser Use has you covered. You can define custom output formats using Pydantic:
from pydantic import BaseModel
class Post(BaseModel):
post_title: str
post_url: str
num_comments: int
hours_since_post: int
class Posts(BaseModel):
posts: List[Post]
controller = Controller(output_model=Posts)
Getting Results and History
After running your agent, you get access to tons of useful information:
history = await agent.run()
# Access various types of information
urls = history.urls() # URLs visited
screenshots = history.screenshots() # Screenshot paths
actions = history.action_names() # Actions taken
content = history.extracted_content() # Extracted data
errors = history.errors() # Any errors
model_actions = history.model_actions() # All actions with parameters
Bonus: Using a Planner Model
For complex tasks, you can even use a separate model for high-level planning:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model='gpt-4o')
planner_llm = ChatOpenAI(model='o3-mini')
agent = Agent(
task="your task",
llm=llm,
planner_llm=planner_llm, # Planning model
use_vision_for_planner=False, # Disable vision for planner
planner_interval=4 # Plan every 4 steps
)
This setup lets you use a smaller, cheaper model for planning while keeping the powerful GPT-4o for execution.
Closing
That's it for today's tutorial! We've covered everything from basic setup to advanced features like browser session management and structured output. Drop a comment below if you'd like to see more Browser Use tutorials, maybe something about custom functions or system prompts?
Top comments (0)