Use case: Kaggle House Prices(Advanced Regression Technique)

Image source:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview

Kaggle house prices competition is a well known and very useful regression case study for freshers, newbie or even experienced people who want some hands-on with data science. This use case requires advanced regression and feature engineering techniques to solve the problem.

Regression was also my second data science project 2 years back when I started learning data science. In this blog post, I will describe the case study in well manner.

Table of contents

  1. Understand the problem and hypothesis generation
  2. Data acquiring and exploring
  3. Feature engineering 
  4. Model building
  5. Submission

Understanding the problem and hypothesis generation

The problem is to predict the price of each house on the basis of given 79 explanatory variables. The description of features and other information given in a file called data_description.txt. 

Some of the influencing variable from user perspective which can potentially influence the predictive pricing.

  • OverallQual: Rates the overall material and finish of the house
  • LotArea: Lot size in square feet
  • Neighborhood: Physical locations within Ames city limits
  • YearBuilt: Original construction date
  • TotalBsmtSF: Total square feet of basement area
  • GrLivArea: Above grade (ground) living area square feet
  • FullBath: Full bathrooms above grade
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • GarageCars: Size of garage in car capacity
  • 1stFlrSF: First Floor square feet
  •  2ndFlrSF: Second floor square feet

These are just a few examples to think and performing hypothesis testing further. People can also think more about the explanatory variable and make some own hypothesis.  

Data acquiring and exploring

Data is very essential part of a data science project. The dataset is available here either you can download it locally or you can use Kaggle kernel for the competitions.

Some data exploration part for different files.

  1. train.csv:- The training datasets for predicting house price.
  2. test.csv:- The testing datasets for house pricing.
  3. data_description.txt:- Full description of each column.
  4. sampe_submission.csv:- A submission file in the correct format.

Some insights from datasets

Total number of columns in training data: 81

Total number of records or data points in train data:1460

Total number of variables or columns in test data:80

Total number of records or data points in test data:1459

Total missing value in training data set

Electrical         1
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtFinType1 37
BsmtExposure 38
BsmtFinType2 38
GarageCond 81
GarageQual 81
GarageFinish 81
GarageType 81
GarageYrBlt 81
LotFrontage 259
FireplaceQu 690
Fence 1179
Alley 1369
MiscFeature 1406
PoolQC 1453
Missing value in training datasets

Total missing value in test data sets

TotalBsmtSF        1 
GarageArea 1
GarageCars 1
KitchenQual 1
BsmtUnfSF 1
BsmtFinSF2 1
BsmtFinSF1 1
SaleType 1
Exterior1st 1
Exterior2nd 1
Functional 2
Utilities 2
BsmtHalfBath 2
BsmtFullBath 2
MSZoning 4
MasVnrArea 15
MasVnrType 16
BsmtFinType2 42
BsmtFinType1 42
BsmtQual 44
BsmtExposure 44
BsmtCond 45
GarageType 76
GarageFinish 78
GarageQual 78
GarageCond 78
GarageYrBlt 78
LotFrontage 227
FireplaceQu 730
Fence 1169
Alley 1352
MiscFeature 1408
PoolQC 1456

Total number of quantitative variables in the training set: 38

Total number of qualitative variables in testing set:43

Distribution of SalesPrice column

SalesPrice[target variable) distribution
 


Result

My present rank is 658 out of 4498 participant which is under 15% of the top, now I can improve rank using more feature engineering, using ensemble method or ANN.

I used to read and took advantages of different techniques used by people in the Kaggle kernel. I was very inspired by this profile’s kernel and also used some code from there.

End Notes

In this article, I described my approach. This is a good competition for people who is a newbie in Machine learning.

Did you like my method and have any question please feel free to drop a note in the comment box, I will glad to discuss.

About Mitra N Mishra 35 Articles
Mitra N Mishra is working as a full-stack data scientist.

Be the first to comment

Leave a Reply