One of the primary reasons for getting blocked while performing web scraping is the use of improper or default user-agents.
Fortunately, adding random fake user-agents to your Go Colly scrapers is straightforward and simple.
What Are Fake User-Agents?
User-agents are strings used by websites to identify the client making the request, providing information about the application, operating system (e.g., Windows, macOS, Linux), and browser (e.g., Chrome, Firefox, Safari) being used. These strings are sent to servers as part of the HTTP request headers.
For instance, here’s an example of a user-agent when accessing a website using Chrome on an Android device:
'User-Agent': 'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Mobile Safari/537.36'
When scraping a website, it’s crucial to set a user-agent for each request. If you don’t, the website can detect non-human traffic and block your scraper.
By default, Go Colly uses this user-agent for its requests:
"User-Agent": "colly - https://github.com/gocolly/colly",
This default user-agent reveals that Colly is being used, making your scraper easy to detect and block. To avoid this, customizing the user-agent is essential.
That is why we need to manage the user-agents Go Colly sends with our requests.
How To Set A Fake User-Agent In Go Colly
Implementing a fake user-agent with Go Colly is a breeze. You can modify the request headers by setting a custom user-agent in the OnRequest() callback, ensuring each outgoing request uses a different or randomized string.
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Set Fake User Agent
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "1 Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148")
})
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// Fetch httpbin.org/headers five times
for i := 0; i < 5; i++ {
c.Visit("http://httpbin.org/headers")
}
}
From here our scraper will use this user-agent for every request.
However, if you are scraping at scale then using the same user-agent for every request isn't best practice as it makes it easier for the website to detect you as a scraper.
To solve this problem we will need to configure our Go Colly scraper to use a random user-agent with every request.
How To Rotate Through Random User-Agents
For using the generated user agent, it is enough to use special packages, in the following examples we will use github.com/lib4u/fake-useragent
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
uaFake "github.com/lib4u/fake-useragent"
)
func main() {
// Init user-agent faker
ua, err := uaFake.New()
if err != nil {
fmt.Println(err)
}
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Set Fake User Agent
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", ua.Filter().GetRandom())
})
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// Fetch httpbin.org/headers five times
for i := 0; i < 5; i++ {
c.Visit("http://httpbin.org/headers")
}
}
Now thanks to the fact that we added just a couple of lines of code, we can get the random user-agent.
However, using a simple random user-agent is not always enough; below we will look at options where we will use specific fake user agents.
The library github.com/lib4u/fake-useragent offers thousands of fake user agents, from the real world database.
// Get random user-agent in string
fmt.Println(ua.GetRandom()) // Mozilla/5.0 (iPhone; CPU iPhone OS 18_1_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.1.1 Mobile/15E148 Safari/604.1
// Get user-agent string from a specific browser
fmt.Println(ua.Filter().Chrome().Get())
// Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Mobile Safari/537.36
fmt.Println(ua.Filter().Firefox().Get())
//Mozilla/5.0 (Android 14; Mobile; rv:133.0) Gecko/133.0 Firefox/133.0
fmt.Println(ua.Filter().Safari().Get())
//Mozilla/5.0 (iPhone; CPU iPhone OS 18_1_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.1.1 Mobile/15E148 Safari/604.1
Now I’ll show you using the Go Colly library as an example.
package main
import (
"bytes"
"log"
"github.com/gocolly/colly"
uaFake "github.com/lib4u/fake-useragent"
)
func main() {
// Init user-agent faker
ua, err := uaFake.New()
if err != nil {
fmt.Println(err)
}
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Set Fake User Agent
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", ua.Filter().Chrome().Platform(uaFake.Desktop).Get())
})
// Print the Response
c.OnResponse(func(r *colly.Response) {
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// Fetch httpbin.org/headers five times
for i := 0; i < 5; i++ {
c.Visit("http://httpbin.org/headers")
}
}
This way we set up the generation of random user-agents for the desktop version of the browser Google Chrome.
For now every time we visit, the website will identify us as a random desktop user, thus a simple substitution of the user agent can help you when parsing websites, but do not forget about the use of proxies and other system headers.
https://github.com/lib4u/fake-useragent
https://github.com/gocolly/colly
Top comments (0)