Kaggle house prices competition is a well known and very useful regression case study for freshers, newbie or even experienced people who want some hands-on with data science. This use case requires advanced regression and feature engineering techniques to solve the problem.
Regression was also my second data science project 2 years back when I started learning data science. In this blog post, I will describe the case study in well manner.
Table of contents
- Understand the problem and hypothesis generation
- Data acquiring and exploring
- Feature engineering
- Model building
Understanding the problem and hypothesis generation
The problem is to predict the price of each house on the basis of given 79 explanatory variables. The description of features and other information given in a file called data_description.txt.
Some of the influencing variable from user perspective which can potentially influence the predictive pricing.
- OverallQual: Rates the overall material and finish of the house
- LotArea: Lot size in square feet
- Neighborhood: Physical locations within Ames city limits
- YearBuilt: Original construction date
- TotalBsmtSF: Total square feet of basement area
- GrLivArea: Above grade (ground) living area square feet
- FullBath: Full bathrooms above grade
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- GarageCars: Size of garage in car capacity
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
These are just a few examples to think and performing hypothesis testing further. People can also think more about the explanatory variable and make some own hypothesis.
Data acquiring and exploring
Data is very essential part of a data science project. The dataset is available here either you can download it locally or you can use Kaggle kernel for the competitions.
Some data exploration part for different files.
- train.csv:- The training datasets for predicting house price.
- test.csv:- The testing datasets for house pricing.
- data_description.txt:- Full description of each column.
- sampe_submission.csv:- A submission file in the correct format.
Some insights from datasets
Total number of columns in training data: 81
Total number of records or data points in train data:1460
Total number of variables or columns in test data:80
Total number of records or data points in test data:1459
Total missing value in training data set
Total missing value in test data sets
Total number of quantitative variables in the training set: 38
Total number of qualitative variables in testing set:43
Distribution of SalesPrice column
My present rank is 658 out of 4498 participant which is under 15% of the top, now I can improve rank using more feature engineering, using ensemble method or ANN.
I used to read and took advantages of different techniques used by people in the Kaggle kernel. I was very inspired by this profile’s kernel and also used some code from there.
In this article, I described my approach. This is a good competition for people who is a newbie in Machine learning.
Did you like my method and have any question please feel free to drop a note in the comment box, I will glad to discuss.