# Use case: Kaggle House Prices(Advanced Regression Technique)

Kaggle house prices competition is a well known and very useful regression case study for freshers, newbie or even experienced people who want some hands-on with data science. This use case requires advanced regression and feature engineering techniques to solve the problem.

Regression was also my second data science project 2 years back when I started learning data science. In this blog post, I will describe the case study in well manner.

1. Understand the problem and hypothesis generation
2. Data acquiring and exploring
3. Feature engineering
4. Model building
5. Submission

Understanding the problem and hypothesis generation

The problem is to predict the price of each house on the basis of given 79 explanatory variables. The description of features and other information given in a file called data_description.txt.

Some of the influencing variable from user perspective which can potentially influence the predictive pricing.

• OverallQual: Rates the overall material and finish of the house
• LotArea: Lot size in square feet
• Neighborhood: Physical locations within Ames city limits
• YearBuilt: Original construction date
• TotalBsmtSF: Total square feet of basement area
• GrLivArea: Above grade (ground) living area square feet
• FullBath: Full bathrooms above grade
• TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
• GarageCars: Size of garage in car capacity
• 1stFlrSF: First Floor square feet
•  2ndFlrSF: Second floor square feet

These are just a few examples to think and performing hypothesis testing further. People can also think more about the explanatory variable and make some own hypothesis.

Data acquiring and exploring

Data is very essential part of a data science project. The dataset is available here either you can download it locally or you can use Kaggle kernel for the competitions.

Some data exploration part for different files.

1. train.csv:- The training datasets for predicting house price.
2. test.csv:- The testing datasets for house pricing.
3. data_description.txt:- Full description of each column.
4. sampe_submission.csv:- A submission file in the correct format.

Some insights from datasets

Total number of columns in training data: 81

Total number of records or data points in train data:1460

Total number of variables or columns in test data:80

Total number of records or data points in test data:1459

Total missing value in training data set

`Electrical         1MasVnrType         8MasVnrArea         8BsmtQual          37 BsmtCond          37 BsmtFinType1      37 BsmtExposure      38 BsmtFinType2      38 GarageCond        81 GarageQual        81 GarageFinish      81GarageType        81 GarageYrBlt       81 LotFrontage      259 FireplaceQu      690Fence           1179 Alley           1369 MiscFeature     1406 PoolQC          1453`

Total missing value in test data sets

`TotalBsmtSF        1 GarageArea         1 GarageCars         1 KitchenQual        1 BsmtUnfSF          1 BsmtFinSF2         1 BsmtFinSF1         1 SaleType           1 Exterior1st        1 Exterior2nd        1 Functional         2 Utilities          2 BsmtHalfBath       2 BsmtFullBath       2 MSZoning           4 MasVnrArea        15 MasVnrType        16 BsmtFinType2      42 BsmtFinType1      42 BsmtQual          44 BsmtExposure      44 BsmtCond          45 GarageType        76 GarageFinish      78 GarageQual        78 GarageCond        78 GarageYrBlt       78 LotFrontage      227 FireplaceQu      730 Fence           1169 Alley           1352 MiscFeature     1408 PoolQC          1456`

Total number of quantitative variables in the training set: 38

Total number of qualitative variables in testing set:43

Distribution of SalesPrice column

```

```

Result

My present rank is 658 out of 4498 participant which is under 15% of the top, now I can improve rank using more feature engineering, using ensemble method or ANN.

I used to read and took advantages of different techniques used by people in the Kaggle kernel. I was very inspired by this profile’s kernel and also used some code from there.

End Notes

In this article, I described my approach. This is a good competition for people who is a newbie in Machine learning.

Did you like my method and have any question please feel free to drop a note in the comment box, I will glad to discuss.