DEV Community

Cover image for Tutorial - How to build your own LinkedIn Profile Scraper in 2022
chamodperera
chamodperera

Posted on • Edited on

Tutorial - How to build your own LinkedIn Profile Scraper in 2022

LinkedIn is the world's largest professional network on the internet. You can use LinkedIn to find the right job or internship, connect and strengthen professional relationships, and learn the skills you need to succeed in your career. A complete LinkedIn profile can help you connect with opportunities by showcasing your unique professional story through experience, skills, and education.

In this tutorial, Let's look at how to implement a web scraper to gather job details and company profiles from a posted jobs list on Linkedin and save them in a .JSON file using Python.

This tutorial is a complete beginner guide to web scraping using python.

What is Web Scraping?

Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API. In most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.

Getting started

In order to complete this task, we need these widely used python libraries in web scraping.

  1. Selenium
  2. BeautifulSoup

The first one, Selenium is used to navigate web pages and interact with them. The other one is widely used to scrape data from web pages.

So let's install them.

pip install selenium
Enter fullscreen mode Exit fullscreen mode
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Installing the web driver

To work with Selenium you need to install the web driver for your browser. WebDriver is an open source tool for automated testing of web apps across many browsers. It provides capabilities for navigating web pages, user input, JavaScript execution, and more. If you are using chrome, you can download the driver by this link.

It's important to check your browser version before downloading the driver.

Task explanation

On LinkedIn, a user account is not compulsory to search for jobs. We can simply navigate to linkedin/jobs and search for any job vacancies available in your area.

Image description

So that our task is to automatically search for jobs on Linkedin and save the job list and company profiles as.JSON.

Image description

Navigating to webpage

As mentioned earlier, we use Selenium for navigation purposes. Let's import it into our program.

from selenium import webdriver
Enter fullscreen mode Exit fullscreen mode

Then we need to establish our web driver as a driver object.

driver = webdriver.Chrome(location)
Enter fullscreen mode Exit fullscreen mode

Replace location with your web driver location. Also see other supported browsers.

Now we can use the get() method to locate the website by URL.

driver.get("https://www.linkedin.com/jobs") #URL
Enter fullscreen mode Exit fullscreen mode

If you run the program now, you can see that it spins up a new browser and navigates to the URL. You'll also notice that it has a address bar saying it is being controlled by automated test software.
Image description

Interacting with the page

The next step is to search for jobs. First, let's save the job title and location that we want to search for in separate strings.

job_title = 'software engineer' 
job_location = 'sri lanka' 
Enter fullscreen mode Exit fullscreen mode

A web page consists of HTML elements. In order to interact with it, we need to find the elements we need to act on and then find the selector or locator information for those elements of interest. The easiest way is to Inspect pages using developer tools. Place the cursor anywhere on the webpage, right-click to open a pop-up menu, then select the Inspect option. In the Elements window, move the cursor over the DOM structure of the page until it reaches the desired element. From there, we can find the HTML tag, the defined attribute, and the attribute values.

Image description

Next, we need to pass this information to the selenium web driver to simulate user actions on elements. Selenium provides various find_element methods to find elements based on their attribute/value criteria or selector value that we supply in our script. For that, the By class needs to import from Selenium.

from selenium.webdriver.common.by import By
Enter fullscreen mode Exit fullscreen mode

These are the various ways the attributes are used to locate elements on a page.

find_element(By.ID, "id")
find_element(By.NAME, "name")
find_element(By.XPATH, "xpath")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.TAG_NAME, "tag name")
find_element(By.CLASS_NAME, "class name")
find_element(By.CSS_SELECTOR, "css selector")

Enter fullscreen mode Exit fullscreen mode

In our case I am using XPATH method to locate the input tags of the title and location.

XPATH is the language used for locating nodes in an XML document. As HTML can be an implementation of XML (XHTML), Selenium users can leverage this powerful language to target elements in their web applications.

You can copy the XPATH by right clicking on the element.
Image description

Now save them in separate variables.

search_title = driver.find_element(By.XPATH, '//*[@id="JOBS"]/section[1]/input')
search_location = driver.find_element(By.XPATH, '//*[@id="JOBS"]/section[2]/input')
Enter fullscreen mode Exit fullscreen mode

Now to pass the string values to the inputs, we can use send_keys() method.

search_title.send_keys(job_title)
search_location.clear()
search_location.send_keys(job_location,Keys.ENTER)
Enter fullscreen mode Exit fullscreen mode

The location input is sometimes auto-filled by default with a location based on the IP address. clear() method is used clear any default values in the input.

Keys.ENTER argument is used to send ENTER key after passing the input values. Before that Keys should be imported.

from selenium. webdriver. common. keys import Keys
Enter fullscreen mode Exit fullscreen mode

It is important to stop the program for some time as the search results should be properly loaded before the next steps. For this, we can use the inbuilt time library.

import time
Enter fullscreen mode Exit fullscreen mode
time.sleep(3) #sleeps for 3 seconds
Enter fullscreen mode Exit fullscreen mode

Finally we can get the UL which contains the job list by calling the By.CLASS_NAME method.

jobs_list = driver.find_element(By.CLASS_NAME,'jobs-search__results-list')
Enter fullscreen mode Exit fullscreen mode

Image description

Scrape data from the web page

Now we can use BeautifulSoup to scrape necessary data from the job list. Let's import is first.

from bs4 import BeautifulSoup
Enter fullscreen mode Exit fullscreen mode

As an initial step, we need to direct the job_list to BeautifulSoup.

soup = BeautifulSoup(jobs_list.get_attribute('outerHTML'), 'html.parser')
Enter fullscreen mode Exit fullscreen mode

Similar to Selenium we can retrieve all li items in the UL by the tag name into a list. See the Bs4 documentation for more information.

jobs = soup('li')
Enter fullscreen mode Exit fullscreen mode

Now let's make a list of the information we need to extract from every job item.

  • job title
  • location
  • link to the job details
  • link to the company profile

We use a for loop to iterate through each item in the jobs list and retrieve information.

data=[]
for job in jobs:
    item ={}
    item["job_title"] = job.find("h3",class_="base-search-card__title").text.strip(" \n")
    item["company"] = job.find("h4",class_="base-search-card__subtitle").text.strip(" \n")
    item["location"] = job.find("span",class_="job-search-card__location").text.strip(" \n")

    job_details = job.find("a",class_="base-card__full-link")
    item["job_details"] = job_details["href"].split('?', 1)[0]

    company_profile = job.find("a",class_="hidden-nested-link")
    item["company_profile"] = company_profile.attrs["href"].split('?', 1)[0]
    data.append(item)
Enter fullscreen mode Exit fullscreen mode

In the above code, I declared an empty data array to store the information. And In every iteration in the for loop, I used find methods to locate the elements which we need to get data from.

text method is used to get the innerHTML values of the HTML tags and the attrs[] method is used to get the attribute values of an element. split() & split() methods are used for basic text formatting.

Image description

In every iteration, all the information is saved in a separate JSON object.

Save data in a JSON file

Now that all the retreived data is passed to the data array, we can use the built in json library to save them in a new json file.

import json
Enter fullscreen mode Exit fullscreen mode
with open("jobs.json", "w") as writeJSON:
   json.dump(data, writeJSON, indent=4)
Enter fullscreen mode Exit fullscreen mode

The final output will look like this.
Image description

Conclusion

The final program

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium. webdriver. common. keys import Keys
from bs4 import BeautifulSoup
import json
import time


job_title = 'software engineer' #replace job title
job_location = 'sri lanka' #replace location


driver = webdriver.Chrome('webdriver/chrome/chromedriver') #replace the webdriver location

driver.get("https://www.linkedin.com/jobs")


search_title = driver.find_element(By.XPATH, '//*[@id="JOBS"]/section[1]/input')
search_location = driver.find_element(By.XPATH, '//*[@id="JOBS"]/section[2]/input')
search_title.send_keys(job_title)
search_location.clear()
search_location.send_keys(job_location,Keys.ENTER)
time.sleep(3)

jobs_list = driver.find_element(By.CLASS_NAME,'jobs-search__results-list')
soup = BeautifulSoup(jobs_list.get_attribute('outerHTML'), 'html.parser')

jobs = soup('li')

data=[]
for job in jobs:
    item ={}
    item["job_title"] = job.find("h3",class_="base-search-card__title").text.strip(" \n")
    item["company"] = job.find("h4",class_="base-search-card__subtitle").text.strip(" \n")
    item["location"] = job.find("span",class_="job-search-card__location").text.strip(" \n")

    job_details = job.find("a",class_="base-card__full-link")
    item["job_details"] = job_details["href"].split('?', 1)[0]

    company_profile = job.find("a",class_="hidden-nested-link")
    item["company_profile"] = company_profile.attrs["href"].split('?', 1)[0]
    data.append(item)

with open("jobs.json", "w") as writeJSON:
   json.dump(data, writeJSON, indent=4)

driver.quit()
Enter fullscreen mode Exit fullscreen mode

With this program, you can easily scrape job details and company profile URLs on Linkedin.

You can also download the program via my Github repository.

Thank You

Top comments (0)