
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
Kaggle house prices competition is a well known and very useful regression case study for freshers, newbie or even experienced people who want some hands-on with data science. This use case requires advanced regression and feature engineering techniques to solve the problem.
Regression was also my second data science project 2 years back when I started learning data science. In this blog post, I will describe the case study in well manner.
Table of contents
- Understand the problem and hypothesis generation
- Data acquiring and exploring
- Feature engineering
- Model building
- Submission
Understanding the problem and hypothesis generation
The problem is to predict the price of each house on the basis of given 79 explanatory variables. The description of features and other information given in a file called data_description.txt.
Some of the influencing variable from user perspective which can potentially influence the predictive pricing.
- OverallQual: Rates the overall material and finish of the house
- LotArea: Lot size in square feet
- Neighborhood: Physical locations within Ames city limits
- YearBuilt: Original construction date
- TotalBsmtSF: Total square feet of basement area
- GrLivArea: Above grade (ground) living area square feet
- FullBath: Full bathrooms above grade
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- GarageCars: Size of garage in car capacity
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
These are just a few examples to think and performing hypothesis testing further. People can also think more about the explanatory variable and make some own hypothesis.
Data acquiring and exploring
Data is very essential part of a data science project. The dataset is available here either you can download it locally or you can use Kaggle kernel for the competitions.
Some data exploration part for different files.
- train.csv:- The training datasets for predicting house price.
- test.csv:- The testing datasets for house pricing.
- data_description.txt:- Full description of each column.
- sampe_submission.csv:- A submission file in the correct format.
Some insights from datasets
Total number of columns in training data: 81
Total number of records or data points in train data:1460
Total number of variables or columns in test data:80
Total number of records or data points in test data:1459
Total missing value in training data set
Electrical 1
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtFinType1 37
BsmtExposure 38
BsmtFinType2 38
GarageCond 81
GarageQual 81
GarageFinish 81
GarageType 81
GarageYrBlt 81
LotFrontage 259
FireplaceQu 690
Fence 1179
Alley 1369
MiscFeature 1406
PoolQC 1453

Total missing value in test data sets

TotalBsmtSF 1
GarageArea 1
GarageCars 1
KitchenQual 1
BsmtUnfSF 1
BsmtFinSF2 1
BsmtFinSF1 1
SaleType 1
Exterior1st 1
Exterior2nd 1
Functional 2
Utilities 2
BsmtHalfBath 2
BsmtFullBath 2
MSZoning 4
MasVnrArea 15
MasVnrType 16
BsmtFinType2 42
BsmtFinType1 42
BsmtQual 44
BsmtExposure 44
BsmtCond 45
GarageType 76
GarageFinish 78
GarageQual 78
GarageCond 78
GarageYrBlt 78
LotFrontage 227
FireplaceQu 730
Fence 1169
Alley 1352
MiscFeature 1408
PoolQC 1456
Total number of quantitative variables in the training set: 38
Total number of qualitative variables in testing set:43
Distribution of SalesPrice column

Result
My present rank is 658 out of 4498 participant which is under 15% of the top, now I can improve rank using more feature engineering, using ensemble method or ANN.
I used to read and took advantages of different techniques used by people in the Kaggle kernel. I was very inspired by this profile’s kernel and also used some code from there.
End Notes
In this article, I described my approach. This is a good competition for people who is a newbie in Machine learning.
Did you like my method and have any question please feel free to drop a note in the comment box, I will glad to discuss.
Leave a Reply
You must be logged in to post a comment.