The real estate industry is one of the industries that generates the most activity and growth in the world. In the United States, according to annualized data from the Bureau of Economic Analysis (BEA) for the second quarter of 2021, the sector linked to housing is one of the industries that generates the greatest economic activity in that country with a GDP of 2,908 trillion dollars. According to data also annualized from the IMF for October 2021, this figure exceeds the GDP of countries such as Spain, Italy, Brazil, Australia, Canada, Russia, South Korea and is equivalent to 48.38 times that of Uruguay.
The value buying a property depends on many factors, among which experts highlight the following.
Location: One of the main ones is linked to the famous phrase of the real estate industry “location, location, location” making reference to the fact that a large part of the value of a property depends on the quality of opportunities and services that can be obtained by living in that place, such as the possibilities of obtaining a good job, security, proximity to quality schools, etc.
Space: It is a fundamental factor to determine its value, with special emphasis on usable space, which is the one in which the area corresponding to attics, basements and places that people rarely use in their day to day are not counted. Many people choose to live in suburbs where they can have larger interiors, gardens or funds than to live in the center of the city in a smaller property even though this deprives them of the possibility of accessing any opportunity or service with the same ease. This can be seen to a great extent in countries such as the United States or Canada, but also in Uruguay with Ciudad de la Costa.
Format: Buyers are willing to pay to get a home that meets their current needs or to their future prospects, as would be the case of a family that has expectations of having more children and is looking for a house large enough to do so. Some of these characteristics stand out, such as the number of rooms available, the number of bathrooms and whether the property has a garage or not. On the other hand, the format of the house itself is strongly related to its price, a house is not the same as an apartment with all the variants that these may have.
Status: The status of a property is a fundamental factor for its price, this being the reason why it is economically convenient to recondition a property before selling it. On the other hand, the older the constructions are, the more likely they are to have defects and therefore modernity is valued.
Understanding the business
This case of The purpose of this study is to be able to develop the CRISP-DM model on a problem for academic purposes and not with a deployment objective or as part of a larger product.
Data preparation, modeling and evaluation will be carried out to predict the value of homes from characteristics of the properties or their environments using Python and libraries such as SciKitLearn, Pandas, Numpy, Seaborn, among others. The complete code referred to in this work is available in the last section of the Annex.
Data knowledge
For this project we have Property sales data for the town of 66,000 inhabitants called Ames in the state of Iowa United States between 2006 and 2010 obtained from the Ames City Assessor’s Office and preprocessed by Dean De Cock PhD in Probability from Iowa State University. The records are in two datasets, one for training and one for test.
Downloads:
Data description
train.csv
test.csv
Analysis of attributes
The structure of the datasets to be used has 78 predictors plus the continuous numerical target variable in the case of the train predictor. This will be predicted with the modeling to be developed, thus being a supervised regression problem.
Of the 78 input variables, it is very important to take into account their different types, which in this case are defined in the description of data for each. In this way we have:
44 categorical (23 nominal and 21 ordinal)
34 numerical (19 continuous and 15 discrete)
Although the nominal and Ordinals are categorical, they differ in that the former are orderable while the latter are simply characteristics whose values are hardly comparable between them in the sense that one can be considered greater or less than the other by itself.
A better explanation could be given considering attributes of the problem with ExterQual (ordinal) and Foundation (nominal).
ExterQual: refers to the quality of the materials outside the house and its possible values are Ex ( excellent), Gd (good), TA (average), Fa (fair), Po (poor) for which it is easy to transfer to a scale in which the first values are greater than the last since it is indisputable that in any case excellent quality is better than average or poor.
Foundation: describes the cimie nts of a house with the classes BrkTil (bricks and tiles), CBlock (cement block), PConc (concrete), Slab (slab), Stone (stone) and Wood (wood). Given that in some areas it might be better to have foundations of one type and in others of another, so the order would depend on circumstances unrelated to the variable itself, it can hardly be ordered.
Analysis of examples
As for the number of examples, there are 1460 labeled data in the train and 1459 unlabeled data in the test. The case study will focus on working with the training dataset since it allows us to measure performance when developing prediction models by comparing the values obtained by them with the labels provided. In this way, all the analysis that will be carried out below refers to that data set.
Missing Data
Analyzing the missing data, it can be seen that they all correspond to the continuous variables LotFrontage (28), GarageYrBlt (81) and MasVnrArea (8).
By the way, it is convenient to point out that in many categorical variables there is the word NA as one of the possible values being an example of them GarageType in which case it is used to refer to properties that do not have a garage. These observations with NA should not be confused with missing data.
Correlations
It is interesting to observe the correlation that each one maintains both with the target variable and with each other.
We will study the correlation matrix generated from the Pearson coefficient, which measures the linear correlation between two variables on a scale between -1 and 1, interpreted as the closer to zero, the less linear dependence, and the further, greater. be direct in the case of the approach to 1 or opposite in the case of approaches to -1. This coefficient can only be calculated for numerical variables or, failing that, as will be done below, they can also be added to the ordinal variables after converting them to a numerical scale.
Filtering between the attributes that have a correlation greater than 0.5 or less than -0.5 with respect to the target variable according to Pearson, the following 13 predictors are obtained in an orderly manner.
Performing the calculations for the correlations with the target variable with 55 predictors between the numeric and the ordinal ones after having been coded so that they are also ordinal, only 13 show a prediction greater than 0.5 with this one, which is 23%. In addition, these relationships are all direct and none are opposite.
Next, the linear correlations between the predictors are calculated, resulting in this Pearson matrix.
Regarding the correlations between the predictors themselves, positive values were obtained in all cases, with which they all have a certain direct linear dependence, no matter how minimal it may be . Values greater than >0.66 were considered high correlations, which occurs between the pairs GrLivArea and TotRmsAvgGrd with 0.82, GarageArea and GarageCars with 0.88, TotalBsmtSF and 1stFlrSF with 0.81.
Outliers=, =p
Biases
Viewing the histograms, biases to the right could be seen in YearBuilt and YearRemodAdd, which correspond respectively to the year of construction and to the last remodeling of the properties. It can be seen that in the registry there is much more data on houses built or remodeled a few years ago, with 40% and 59% of the sample having a year after 1980 in each case.
Nominal attribute distributions
When performing the analysis of the nominal variables, it is useful to observe their different categories in relation to the target variable, which is possible with box plots that allow us to visualize how much variability there is in the dataset between the classes of a nominal category with respect to the target variable with the range, median and outliers they have. If the variability is high then it is a good variable to learn from.
The most notable diagrams are found later.
-Neighborhood
-HouseStyle
-Foundation
Data preparation
Training sets generated
3 data sets were assembled for training different characteristics to apply models on them and be able to make comparisons.
Training data 1: Contains all the predictors of the dataset with the coded numerical and categorical variables and without missing data.
Training data 2: With attributes selected as explained below, numeric and categorical variables coded, but without further pre-processing. Its objective is to test the feature selection made.
Training data 3: It includes the same attributes selected for the previous case, numeric and categorical variables encoded, but also other data pre-processing techniques shown were performed on this below. Its purpose is to test the applied pre-processing methods.
Loading data
The data was loaded and the values with NA of the categorical variables were replaced by a word that means that there is not what the predictor is about. which can be garages, basements, etc.
Coding ordinal attributes (for training data 1, 2 and 3)
Ordinal attributes were coded transforming their scales to discrete with SKLearn’s OrdinalEncoder(), the important thing here is to maintain the ordering and not to happen as in Python that encodes backwards and then when calculating the correlations, for example with the target variable, the opposite value is obtained from what it should be, but this can still be easily fixed as shown in the Annex.
Selection of numerical and ordinal attributes (for training data 2 and 3)
We kept the predictors that maintained a higher correlation to 0.5 or less than -0.5 with respect to the objective attribute and a correlation with the others that are a less than 0.66 or greater than -0.66 in all cases. Obtaining the following variables.
Nominal attribute selection (for training data 2 and 3)
Box plot diagrams were observed, analyzing which ones provide the greatest variability with respect to the objective variable. From this, the predictors HouseStyle, Foundation, CentralAir and Neighborhood were obtained. The others were discarded because they could not justify that they have considerable variability between their different classes with respect to the objective variable that is decisive for the problem.
Coding of nominal attributes (for training data 1 , 2 and 3)
The nominal attributes were encoded with the one hot encoding found in SKLearn and Pandas. What it does is generate dummy variables that are nothing more than new binary attributes for each of the different values that the selected nominal variables can have, which means that the original variables do not lose their nominal nature of not having a defined order.= ,=br
Removal of outliers (for training data 3)
After having selected the attributes with which we will work and basing ourselves on the analysis done in advance, we proceeded to eliminate outliers by removing the values of GrLivArea<4000 and TotalBsmtSF>3000.
GrLivArea
TotalBsmtSF
Treatment of bias (for training data 3)
Treatment of bias in YearBuilt and YearRemodAdd attributes with logarithmic transformations.
YearBuilt
YearRemodAdd
Missing data (for training data 1)
We substitute the average of the attributes in the rows that contained missing data for case 1. Sets 2 and 3 had no missing values.
Standardization (for training data 3)
Numerical and ordinal variables were standardized so that their scales are not determinant.
Modeling
In deciding which models to develop, it is essential to consider the type of problem, which in this case consists of predicting the value of a property from a set of predictors and therefore it is a supervised one and regression. There is a set of algorithms that can be used in these conditions, among which are linear regression considering also its Lasso and Ridge variations, decision trees, random forests, gradient boosting regression, k-nearest neighbors, support vector regression, among others. To decide between the applicable models it is necessary to see the structure of the data we have in this particular problem.
As studied in the data analysis section with the Pearson coefficients and the plots, the predictors maintain a linear relationship with the output, which is accentuated for the attributes selected and worked on in the data pre-processing. As said algorithm assumes a linear relationship between the inputs and the output, applying linear regression for this case is a good idea, so it will be used below. In addition, we will apply the Random Forest ensemble algorithm to make a comparison between the two.
The selected models will be trained with the 3 training data sets. For these, cross validation will be used with 10 specifications in order to avoid the problem in which the testing part may become easier or more difficult for a specific model, as could happen with the hold out technique.
Results
Two coefficients will be used to measure the results, the first is RMSE or root mean squared error, which measures the square root of the average of the squares of the differences between the predictions and the labeled data of the dataset are trained squared, so the smaller the better.
The following figures were obtained.
Linear regression
Training data 1: RMSE = 25935
Training data 2: RMSE = 25744
Training data 3: RMSE = 24911
Random Forest
With 1000 trees and a maximum of attributes to consider in the splits of 20.
Training data 1: RMSE = 24564
Training data 2: RMSE = 27302
Training data 3: RMSE = 2769 2
Conclusions
In the results it can be seen that selecting both ordinal and nominal attributes carefully and applying other methods of pre-processing Data such as standardization, removal of outliers and transformations is worthwhile for the case of linear regression since this optimized the parameter used to measure performance, which can be seen both from cases 1 to 2 and from 2 to 3.
Regarding Random Forest, it is observed that it gives better results with the original attributes without performing attribute selection or other data processing methods. This may be due to the fact that with these modifications part of the good representation of the problem is lost, which is the only requirement demanded by this algorithm in terms of data preparation.
With the above it is verified in a practical case that what may be good data preparation for one machine learning algorithm may be bad for another. In this way, to obtain a better performance for a problem, it is not only enough to apply different algorithms to the same prepared test set and observe which one gives the best, but it is also necessary to test different data preparations to be applied to the different machine algorithms. learning in such a way as to get the most out of them.
On the other hand, it is interesting that based on the development carried out both in data analysis, for example, when seeing the high correlation between the available size and the price of the property as in the preparation of data and modeling, being a concrete case the selection of certain nominal variables that greatly improve the performance of the models, as is the case of the neighborhood, it is possible to see how they affect the characteristics of the properties in a real case that the experts mention as the most important what was analyzed at the beginning in the context section. You can see how correctly including and processing those factors that stand out the most is what ends up determining that when building our models they make better predictions.
Top comments (0)