Web Scraping job postings with salaries

In this article I will show You how I scraped job posts from local labor  market platform in Latvia. We will exclusively scrape only offers that have some info about expected salary (range or approximate amount).

Recently a new regulation took place forcing all employers to state at least approximate salary figures or ranges in their job posts. Now I am checking out new positions even though I don’t want to change my job. I am just curios how much companies offer now and for what. So let’s stop this manual and time consuming process and build a web scraper instead.

The plain & simple approach to web scraping

I had done some web scraping before but just recently I learned one trick that turned web scraping form complicated to really easy (at least for non-javascript pages).  I want to share this basic thing to you right now.

If You are more experienced in web scraping than I am than You might already know all about HTML and BS4 but I was never willing to learn more HTML than is needed to scrape whatever I wanted to be scraped so I was struggling each page.

That is until now. Beautiful Soup 4 module has plenty of features and functions but one of most used is find and find_all. To find what You need in the page, You will need to inspect the code behind it and as you might notice it’s very structured within these “<>” that are separating elements. 

When ever You use find or find_all you are first of all looking for specific element type but You can also specify more exact parameters for the elements You want to find. (e.g. specific class). In order to find element li (stands for List) with class “match” we need to type the following:

As you noticed we can state parameters in a form of dictionary. From here I believe it should be very easy to understand from here regarding find and find_all.

Another thing is about how to use .get() function. To put it in a single sentence – You state the parameter (the thing that you would type as a key in find or find_all) in these brackets to get the value. Most common example is .get(“href”).

Please don’t laugh if this was self-explanatory to You because it really wasn’t for me. I can’t say the materials on the web are always beginner-friendly.

Who needs this information

To my mind – every participant of labor market.

Employers

Companies are the ones most interested in analytical approach to this one as there are many companies paying survey companies to get such data. Even companies in neighbor countries could use these figures to approximate salaries in their country based on macroeconomic factor differences.

For companies it is important to know not only how much competitors are willing to pay for your labor but also estimate where the market is going and how much money they will have to allocate for salaries in next budget period.

Employees

Well obviously we as employees would like to know if our salaries are fairly priced according to market.

And if we are only deciding what career path to take we might want to see in which area labor prices are either stable or above average.

Another thing we might want to know – which companies are more generous than others? What skills are worth more than others. And by how much exactly? Actually this one is relevant to companies as well.

How fully completed project should look like

  1. Scrape all job posts from job lists
  2. Save to data frame and hard drive
  3. Drop duplicates if any
  4. Clean salary figures
  5. Go through each job post link. Look for
    • Responsibilities 
    • Requirements 
    • Additional bonuses 
    • Check if there is an industry stated within HTML code. (The site is somehow grouping open positions into categories)
  6. Need to find keywords for responsibilities and requirements 
  7. Calculate attribution to total salary for responsibilities, requirements and additional bonuses 
  8. Analyze skill worth within each category 
  9. Visualize results and write a kick-ass article  

What we aim for in this project

  1. Scrape all job posts from job lists
  2. Save to data frame and hard drive
  3. Clean salary figures

I know I know. Only 3 out of 9 but actually it would take some serious effort to complete all 9 steps I described above. Maybe slowly I will try to complete it but for now let’s focus on these first steps.

 What are current outcomes?

Just to show what kind of information is available straight from the bat, without any machine learning or exploratory data analysis. 

Best paying companies

Just be clear from start I want to point out that these figures are not to be taken seriously. We would need to have more observations per company to actually present this kind of chart.   

Highest paying companies in Latvia

This plot will get closer to reality as we get more and more data.

Average salary by most needed professions  

Here we face the same issue as before – lack of observations. But this time we also have a problem with very different position titles. Companies can get really creative with them which is why I offer to analyze professions by category and skill requirements and/or responsibilities. But it will be really it will everything except easy to accomplish.

Salaries for most common positions

As you can see the column name “Monthly Salary” is misleading. It should be called simply “Count”. I suppose you are not familiar with Latvian language so I will also point out that there are duplicate positions that have been spelled differently.

Here at the very start we see couple of issues that is handled by data cleaning. A topic that we would have to look into right after we complete this web scraping part.

Future work

Obviously if one would like to develop this project further there is a need to generalize positions, read position requirements and build an offline database to store the data for longer periods.

This can become a serious tool for HR if taken seriously.

The code

Leave a Reply

Your email address will not be published. Required fields are marked *