Kaggle House Prices challenge
This is an overview of the approach used on the Kaggle House Prices challenge. The training dataset consists of 1460 house sale price observations with 81 features, while the test set consists of 1459 houses with sale price omitted. The metric is RMSE of the log of the sale price.
Data Cleaning
In the R Markdown file https://github.com/richcorrado/Housing-Prices-ART/blob/master/EDA_plus_feateng.Rmd, I do an exhaustive exploratory data analysis. I find that around 30 of the feature columns contain missing entries, so the first steps are to determine the nature of the missing data. Where appropriate, I use decision trees with the rpart package to impute missing values using a subset of relevant features to predict the missing values. I expect that this is slightly better than using mean/median or nearest neighbors.
A complication here is that 11 features correspond to the houses’ basement, while around 80 houses from the combined training + test set have no basement at all. Similarly, 7 features correspond to the garage, while around 159 houses have no garage. In the analysis, I explore fitting models to the various subsets I can obtain from NoBasement and NoGarage boolean features.
I also examine every feature for typos or mutual inconsistencies. For instance, there is a feature called MSSubClass that is a number corresponding to an industry code for the type of dwelling. In many cases, this feature was found to be inconsistent with another feature called HouseStyle, which contained roughly the same information stored as a factor whose levels were the dwelling types. In reconciling errors, I assumed that the HouseStyle error was less prone to coding error, since the levels were words rather than numerical codes.
Feature Engineering
In a second round through the list of features, I consider the tidy data obtained in the first round and begin engineering new features.
For categorical features, I generally used two types of encoding. First, uniformly I used one-hot encoding via the caret package in a round of preprocessing in a later stage of the analysis. However, during the second round, in most cases, I computed median log(SalePrice) values for each level and also defined an ordinal encoding based on the median prices. The corresponding ordinals for the SaleCondition, GarageQual and MSZoning turned out to be top 20 features in my Lasso models. The Neighborhood feature had 25 levels, so I also defined another feature assigning houses to each of 6 bins based on median price.
After dealing with all of the categorical features, I move on to analyzing the original numerical features. I start by analyzing the correlation between the response and the features.
GrLivArea, corresponding to the above grade living area of the house, had a 0.7 correlation with log(SalePrice). Based on the correlation and the general idea that house size is an important component of house price, it was expected to be a strong regressor. I found the frequency distribution of the variable to be skewed, as well as pressure from outliers at high area and low price. After taking the log(GrLivArea), I found that much of the pressure from outliers was removed. However, I still identified 4 houses in the training set with extremely large area that were lying far away from the log-linear trend, as well as 5 houses with lower areas that had extremely low prices compared to the trend. This set of 9 points were considered outliers and I later examined the effect of their removal on the validation error for linear models. Ultimately I selected a dataset where 5 outliers had been removed to use as a basis for my kaggle submissions.
For the other numerical variables, I included log versions whenever that improved skewness. I also found that, in several cases, I could improve linearity by engineering a new feature that was a linear combination of the feature and its square root.
The presence of houses without garage or basement caused problems for several of the associated features. In particular, a numerical feature like GarageArea has an accumulation around 0. In this particular case, these values did not seem to put an overly large amount of pressure on a linear fit.
In addition to examining the subsets based on presence of a garage or basement, I also engineered various features based on the sums of the numerical features. These had the benefit of not suffering from accumulation around zero, even when basement or garage features were included. I also included log versions and linearized versions where appropriate.
Final Preprocessing
I then performed a final round of preprocessing. To begin, I used caret to convert my categorical variables to one-hot encoding. This step balloons the number of features to over 400.
During my EDA, I noticed some category feature levels were present in the training set, but not in the test set. Since I don’t want my models to learn features that can’t be used to predict over the training set, I had to remove the corresponding one-hot features from the data.
Next I looked for near zero-variance predictors and features with extremely low statistics. I chose to drop any feature that had 5 or less observations in the training set, since I could not have statistical confidence in parameters learned from such a small sample.
I then removed features which had > 0.99 correlation with other features. This was mainly due to one-hot variables that had perfect correlation with NoGarage and NoBasement.
I were left with 346 features, of which roughly half are one-hot features of varying sparseness.
Later I added a section to the Rmd file to make it easy to generate datasets in which some subset of outliers had been removed. In particular, this allowed us to refit engineered features to the retained data. These tidy datasets were saved in csv format for later use.
Analysis of Outliers
Further analysis was done in python using scikit-learn and associated toolkits. First was an analysis of the outliers I had identified in the EDA of GrLivArea. This was done in the notebook https://github.com/richcorrado/Housing-Prices-ART/DetectOutliers.ipynb. For validation purposes, I made a validation set using 20% of the training data. I then chose the LassoLarsCV model and RMSE as my metric.
The best validation error came from dropping 5 outliers, which resulted in a 6% improvement in RMSE. There were a large number of datasets that had a similar performance.
Feature Selection
During my feature engineering, I created many features that were highly correlated with other original or new features. For the most part, I were planning on using regularized models to avoid problems arising from this. However, I felt that I should still consider whether models would be improved by dropping some features by hand. This analysis was performed for the best dataset in https://github.com/richcorrado/Housing-Prices-ART/FeatureSelection-111001001.ipynb.
The first thing I note in the this notebook is that the baseline validation error after dropping the five outliers and then refitting the engineered features is now almost 13%. I attribute this improvement entirely to the refit on the engineered features, many of which depended heavily on the GrLivArea feature.
Next, I find highly correlated sets of features and examine the effect on validation error by refitting in the absence of subsets of them. In the end, I identify a set of 4 features which, if dropped, results in a 1.6% increase in validation error.
Modeling
Next I selected various candidate models:
Linear models: Ridge, Lasso, Lasso Lars, Elastic Net, Orthogonal Matching Pursuit
Tree models: Random Forest, XGBoost
Support Vector Regressor
Multilevel Perceptron Regressor
This was done in the notebook https://github.com/richcorrado/Housing-Prices-ART/Modeling-111001001.ipynb. My strategy was to again choose a 20% validation set. Then I used random search and grid search in 10-fold cross-validation to select optimal hyperparameters over the 80% training set. The resulting models, together with optimal hyperparameters, were used to fit the training set and their validation error computed.
My general results were that linear models had the best performance, likely due to all of the linearized features that I engineered. The nonlinear models XGBoost and MLP were slightly behind LassoLARS, I also decided against using datasets where I dropped features.
Subsetting
With promising models, I returned to the question of how to best deal with houses with no basement or no garage. This was analyzed in https://github.com/richcorrado/Housing-Prices-ART/blob/master/Subsetting%20111001001.ipynb using the same validation strategy as before.
In my total training set (with outliers removed), there were 80 houses with no basement. Fits on 80% of this subset had a worse validation error on the remaining 20% than I obtained from the baseline of no subsetting. Similarly subsetting to the 37 houses with no garage obtained worse results than the baseline. This could be expected from the low statistics involved.
I found 1345 houses with both a garage and basement. Validation error on this subset was 7% better than the baseline. Therefore I chose a strategy where I subsetted the houses with both a garage and basement for a best case fit. Then I generated predictions for the remaining subset of houses with no basement or garage by fitting a model on the total training data, which was my baseline.
I generated submissions for 3 models:
1. Lasso Lars alone. This had a Private Leaderboard score of 0.12700 which placed in only the top 30% of submissions.
2. Blended Lasso Lars, XGBoost and MLP. Blending was done by an average using the inverse validation error as weights. This model had a Private Leaderboard score of 0.12242 which would have been a top 16% submission. However, the Public Leaderboard score was worse than the other submissions, so I did not select it.
3. Blend of Lasso Lars and Elastic Net. This has a Private Leaderboard score of 0.12740, which was worse than Lasso Lars alone.
Conclusions
This was an excellent project to learn many data science techniques. The data set was messy enough to require extensive tidying, but small enough that this was manageable. However there is no doubt that the small size of the dataset precluded some of the subsetting techniques that I might have used to lower our validation errors.
The nature of the problem and the features allowed a large amount of feature engineering to be done. Early modeling with less features had linear models performing behind the tree and other nonlinear models. Given more time, I would like to explore how tree models like XGBoost behave when we remove subsets of features.