Evaluating apartment market price with Machine Learning

I have taken a pause form blogging for quite some time because we (me and my fiance) were busy lately. Busy looking for our first home – an apartment which we could design as we please and then later on rent out once we can afford to buy a house. #DreamBig

If You have gone through this process of manually scrapping post boards with hope to find the “perfect one” then You know it’s a nightmare. Does not matter in what country or city You live in. It’s terrible. So I decided to build a tool that would assist me with this task. So the web scraping project was started and it quickly turned into machine learning project as I understood that data can be misleading and I am not competent enough to evaluate what could be market price for given apartment.

Since the web scraping part is pretty darn easy (and surprisingly long due to all the cleaning that has to be done) I will skip on that and let’s review the Machine Learning algorithm instead. Also the script is pretty messy and is running stable only for 15 days now. I just don’t want to clean it up so that I could show it to public. Not this week.

As usual, I will give the full code at the end of article.

The beginning: Imports

So for this project I currently use two algorithms: Gradient Boosting and Random Forest. (Yes, I love that lazy random forest… ūüôā ) And as the luck might have it – both are part of Skicit-Learn package – ensemble.

Other important mentions are pandas and default package for training set split (sklearn.cross_validation.train_test_split).

The setup

As You might have suspected the data-set (even though it’s “in-house” built) is neither split into training and testing sets nor is it perfectly filtered and formatted. Although I did make sure it’s formatted as well as possible withing reasonable amount of code lines. So we won’t exaggerate over formatting and pre-processing.

What we will have to do however is read the data into dataframe from csv that my previous code exported and filter the data on columns and drop NaN values.


Nuts and bolts: The functions

So here comes the stuff that makes things more interesting – functions for calculating average prices on 3 different levels:

  • Street – Region – District – Project
  • Region – District – Project
  • Region – Project

In this data Street is street first word form street name, Region is state or city in which the Street and District are in and finally District is district in which the apartment is in. Whit Project we understand the type of building in which the apartment is located. This is mandatory field in Sell posts so sellers should be aware what type their building is.

Important note: data consists from information provided by sellers. And if it’s not obvious in country where you live in I will point out that sellers are not always concerned about correctness of information in their posts. Sometimes mistakes are made by accident but sometimes intentionally. Often issue in this dataset is incorrect deal type (Sell/Rent) stated or incorrect Project stated.

With three functions we will try to get average prices excluding the one in question. So if we are reading data for 15th apartment in data set we will start by excluding it from data and only then calculate the mean price per square meter. In case algorithm fails to get the value we are looking for then we fall back on larger aggregation: Street Price -> Local Price -> Project Price.

We aim for two things with this:

  1. Letting machine learning algorithm know that there this specific flat is somewhat different from several others based on average values. Each city and street should have different average values given that there is enough data for it.
  2.  Providing a simple hint on what the price should be like since most if not all real estate valuation methods imply comparison with similar real estates.


The loop

The big event is here. The loop through dataframe in order to clean, reformat and add data to existing dataset.

Basically we will just make sure that we have integers for Rooms, Floor, Max Floor and floats for prices and apartment size. And also we will add data using our previous 3 functions but first we will check if we have link as a string and floors are not astronomical. ūüôā

Once we have gathered everything in a list of lists, we will build another dataframe named xData which will be actual source for our machine learning practice.

I had some bizarre errors along the way when writing this short program so i added few more rows to it just to check what’s left of my initial dataset and also check if data types are correct.

Finally: The ML part

So finally we got to the fun part – training our machine learning models and testing them. Actually this part is so straight forward, I could had copied it from actual source example showed on¬†http://scikit-learn.org (but I didn’t). Still, I won’t get into too much detail here as you have seen this code for dozens of times if You are interested in ML.

I would recommend playing around with the features I used here until You get best results for yourself. From what I noticed changing other features does not positively affect the resulting model scores.

Tip: I observed that on my data Gradient Boosting model with max_depth equal to given number of features provides highest score rates!


Despite the simplicity of this model and limited observation count it provides surprisingly high accuracy – missing the price (more than 30%) only around 10% of time. Which to my mind is quite low taking into account that there are flats that are extremely bad and then this value represents potential price after renovation or extremely luxury in which case it’s not for You. Otherwise you wouldn’t be looking at actual market value of property.

Accuracy (% of total)

As always there are things to be improved here but overall I am satisfied with the result since the outcome is getting more precise each day I get new data in. to improve this algorithm I would need to add dates and rent/sell fields into play as well as treat the post text somehow. Possibly bag of words method. But I am not sure if i will be doing any of that since I have found the apartment we were looking for and bought it. ūüôā

Full code as promised:

Leave a Reply

Your email address will not be published. Required fields are marked *