Machine Learning EN

Analysis of apartment prices in Warsaw – linear regression

For as long as I can remember, I have always been interested in the real estate market. This was due to my frequent change of apartments and the fact that each time I carried out all sales and purchase formalities myself. Searching for a good location is an extremely exciting, but also time-consuming process. In addition to gaining insight into prices, I became familiar with the topography of the city and learned first-hand how important good communication to the center is.

Recently, I decided to analyze apartment prices using machine learning algorithms and see if, with raw data, I can find the relationships that shape the final value of real estate.

The ideal model for this type of considerations would be linear regression. It is a relatively simple algorithm that can find relationships between the features of the object — i.e. the explanatory variables and the response (i.e. the final price).

The simplest case is simple linear regression. In such a model, only one explanatory variable affects the final answer (in our case, price). We can, of course, generalize the model to multivariate regression, where the final answer is shaped by many factors.

Using the scraper I created while writing the article How to teach a program to understand colloquial speech, I obtained a database of approximately 850 properties with features such as the number of rooms, area, district, price per square meter and for the entire property. So I had data on the vast majority of apartments sold in Warsaw at that time.

Data quality.

Getting down to specifics, let’s display the data using the DataFrame structure from the Pandas library:

import pandas as pd

df = pd.read_csv('C:\\UczenieMaszynowe\\mieszkania.csv', header=None, sep=';')
df.columns = ['DZIELNICA', 'CENA', 'CENA_ZA_M2', 'LICZBA_POKOI', 'POWIERZCHNIA']

df

and let’s sort them by the PRICE_PER_M2 column:

df.sort_values("CENA_ZA_M2")

As you can see, some of the records are damaged and have values of the NaN (“Not a Number”) type — a numeric data type value denoting an undefined quantity.

Let’s get rid of worthless rows of data using the drop method of the DataFrame object:

df = df.dropna()

There are still 825 records left, so that’s still a lot.

We can now look at column types. In problematic cases, they can even be converted explicitly using code:

df = df.astype({'LICZBA_POKOI':'int'})

A cursory analysis shows that Praga-Południe, Ursus, Wesoła and Białołęka are the cheapest districts, while in Śródmieście prices reach PLN 30,000 and in three cases even over PLN 40,000 per m²! The latest record, in which an apartment costs PLN 20 million, with the price per square meter exceeding PLN 100,000, is an almost 200-meter apartment in the Złota 44 (Sail) investment.

Data mining

Let’s now create a dispersion matrix and check whether the data can be automatically correlated, i.e. whether any of the explanatory variables influences the response (i.e. price).

For this purpose, let’s use the pairplot function from the seaborn library.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style='whitegrid', context='notebook')
cols = [ 'CENA_ZA_M2', 'LICZBA_POKOI', 'POWIERZCHNIA', 'CENA']

sns.pairplot(df[cols], height=2.5)
plt.show()

You can see that there is a linear relationship between the features:

  • AREA and PRICE,
  • PRICE_PER_METER and PRICE

Using the functions of the seaborn library, let’s now generate a scatter matrix in the form of a heat map.

import numpy as np

cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.5)

hm = sns.heatmap(cm, 
                 cbar=True, 
                 annot=True,square=True,  
                 fmt='.2f',
                 annot_kws={'size': 15}, 
                 yticklabels=cols, 
                 xticklabels=cols)
plt.show()

The map confirms the correlation between AREA and PRICE (0.82), and PRICE_PER_METER and PRICE (0.69). These are values close to the number 1.

I decided to enrich the data and look for further features that could influence the final price of the property. For this purpose, using data from the website Panorama of Warsaw’s districts in 2019 [1].

I added to each property the total number of crimes within the district.

This turned out to be a dead end and I found no connection between this data. After thinking about it, I came to the conclusion that crime is high in both Praga-Południe and Śródmieście, and these districts differ significantly in terms of the price per square meter of real estate.

Pursuing the topic further, I used the scraper again, but this time I downloaded the latitude and longitude values of selected properties from the portal code, used to mark the location on the map.

"location":{"latitude":52.2856221,"longitude":21.0445192

Then, using Google maps, I determined the approximate travel time to Śródmieście from several dozen randomly selected apartments.

With such a small sample of data, I was able to determine that the PRICE_ZA_M2 value is inversely proportional to the number of minutes of travel to the center. This means that as the travel time increases, the price of the property decreases.

Linear Regression

Let’s move on to linear regression analysis. Let’s use the LinearRegression object from the scikit-learn library for this purpose.

from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

Let’s assign the values of the AREA and PRICE fields to the X and y variables.

X = df[['POWIERZCHNIA']].values
y = df['CENA'].values

But this time let’s work with standardized data, which we will finally assign to new variables X_std and y_std.

Standardizing data for machine learning models involves transforming the original data so that their distribution has an average value of 0 and a standard deviation of 1.

This procedure is used when we correlate variables on completely different scales. For example, one variable oscillates between values from 1 to 10, and the other one in the range of millions, i.e. several orders of magnitude higher.

After standardizing the variables, the scale of their values ceases to matter, and only their spread, i.e. variance, matters.

Note that standardizing data does not guarantee that it will fall between 0 and 1.

Standardization should be performed using the StandardScaler class from the scikit-learn library.

Let’s now display the AREA column before standardization:

and after standardization:

Now let’s move on to determining the regression line:

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
sc_y = StandardScaler()
X_std = sc_x.fit_transform(X)
y_std = sc_y.fit_transform(y[:, np.newaxis]).flatten()
slr = LinearRegression()
slr.fit(X_std, y_std)
y_pred = slr.predict(X)

print('Nachylenie: %.3f' % slr.coef_[0])
print('Punkt przeciecia: %.3f' % slr.intercept_)

Nachylenie: 0.837
Punkt przecięcia: 0.000

def lin_regplot(X, y, model):
    plt.scatter(X, y, c='lightblue')
    plt.plot(X, model.predict(X), color='red', linewidth=2)    
    return

lin_regplot(X_std, y_std, slr)

plt.xlabel('POWIERZCHNIA')
plt.ylabel('CENA')

plt.tight_layout()
plt.show()

Thanks to the above algorithm, we obtained a graph of the relationship between standardized apartment prices and their standardized areas, with a linear regression line plotted.

However, if we stayed with the raw data — which are still available in the X and y variables (apartment prices in PLN millions and area in m²), the chart would become a bit more clear.

slr = LinearRegression()
slr.fit(X, y)
y_pred = slr.predict(X)

print('Nachylenie: %.3f' % slr.coef_[0])
print('Punkt przeciecia: %.3f' % slr.intercept_)

Nachylenie: 20484.317
Punkt przecięcia: -235808.352

lin_regplot(X, y, slr)

plt.xlabel('POWIERZCHNIA')
plt.ylabel('CENA')

plt.tight_layout()
plt.show()

As you can see, on the PRICE axis, the values are given in millions of PLN and AREA in m². Data standardization should therefore be used when indicated.

In the above article, I showed how to use machine learning algorithms to find relationships that shape real estate prices.

This algorithm is also used in other fields such as economics, sociology and business.


Sources:

[1] Panorama of Warsaw’s districts in 2019 https://warszawa.stat.gov.pl/publikacje-i-foldery/inne-opracowania/panorama-dzielnic-warszawy-w-2019-r-,5,21.html

GitHub repository Dr. Sebastian Raschek https://github.com/rasbt