If you're like me sometimes you want to scrape a web page so bad. You probably want some data in a readable format or just need a way to re-crunch that data for other purposes.
I solemnly swear that I am up to no good.
I've found my optimal setup after many tries with Guzzle, BeautifulSoup, etc... Here it is:
- Node.js
- Puppeteer: check https://github.com/GoogleChrome/puppeteer
- A little Raspberry Pi where my scripts can run all day long.
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
What does it mean? It means you can run a Chrome instance and put it at your service. Cool, isn't it?
Let's see how to do it.
Setup
Yes, the usual setup. Fire up your terminal, create a folder for your project and run npm init
in the folder.
When you're setup you'll probably have a package.json
file. We're good to go. Now run npm i -S puppeteer
to install Puppeteer.
A little warning. Puppeteer will download a full version of Chromium in your node_modules
folder
Don't worry: since version 1.7.0
Google publishes the puppeteer-core
package, a version of Puppeteer that doesn't download Chromium by default.
So, if you're willing to try it, just run npm i -S puppeteer-core
puppeteer-core
is intended to be a lightweight version of puppeteer for launching an existing browser installation or for connecting to a remote one.
Ok, we're good to go now.
Your first scraper
Touch an index.js
file in the project folder and paste this code in it.
That's all you need to setup a web scraper. You can also find it in my repo https://github.com/napolux/puppy.
Let's dig a bit in the code
For the sake of our example we'll just grab all the post titles and URLs from my blog homepage. To add a nice touch we'll change our user-agent in order to look like a good old iPhone while browsing the webpage we're scraping.
And because we're lazy, we'll inject jQuery to the page in order to use it's wonderful CSS selectors.
So... Let's go line by line:
- Line 1-2 we'll require Puppeteer and configure the website we're going to scrape
- Line 4 we're launching Puppeteer. Please remember we're in the kingdom of Lord Asynchronous, so everything is a Promise, is async, or has to wait for something else ;) As you can see the conf is self-explanatory. We're telling the script to run Chromium headless (no UI).
-
Line 5-10 The browser is up, we create a new page, we set the viewport size to a mobile screen, we set a fake user-agent and we open the webpage we want to scrape. In order to be sure that the page is loaded, we wait for the selector
body.blog
to be there. - Line 11 As I said, we are injecting jQuery into the page
- Line 13-28 Here is where the magic happens: we evaluate our page and run some jQuery code in order to extract the data we need. Nothing fancy, if you ask me.
- Line 31-37 We're done: we close the browser and print out our data:
Run from the project folder node index.js
and you should end up with something like...
Post: Blah blah 1? URL: https://coding.napolux.com/blah1/
Post: Blah blah 2? URL: https://coding.napolux.com/blah2/
Post: Blah blah 3? URL: https://coding.napolux.com/blah3/
Recap
So, welcome to the world of web scraping. It was easier than expected, right? Just remember that web scraping is a controversial matter: please scrape only websites you're authorized to scrape.
No. As the owner of https://coding.napolux.com I don't authorize you
I leave to you how to scrape AJAX based webpages ;)
Originally published @ https://coding.napolux.com
Top comments (6)
This is a great and concisely well-explained article. I decided to try using the whole block within lines 13-28, and I keep getting errors of
< (node:65901) UnhandledPromiseRejectionWarning: Error: Evaluation failed: ReferenceError: reject is not defined
How could I resolve this error?
Well, the
puppeteer.launch().then(async browser => {
etc... is a promise itself, so the reject is there.Just tried the code and it still works.
Hi,
Francesco Napoletano,
Your code is great !!!
But, I can not save data to a .txt file. It reports an Undefined error. Help me fix it. Why use:
for(var i = 0; i < result.length; i++) {
console.log('Post: ' + result[i].title + ' URL: ' + result[i].url);
}
I can not export the value, it just seems to print to the screen
If exported to .txt file, it appears Undefined error. Please help me export the .txt file
!!! Thanks
NO ERROR BUT devnew Undefined !!!
var devnew = result.title ;
fs.writeFile('devnew.txt',devnew,'utf8');
Title says
scrap
instead ofscrape
how to save the result on a MySQL database?
Hi , if you to want to export the result to mysql db , you can download a library mssql , and usse it to connect to your db , then make a request to it from your script to export the data
Ref : npmjs.com/package/mysql