# Location, Location, Location?

#### Machine Learning regression using the Ames Housing dataset.

Reading time: 15 minutes

This post introduces Machine Learning regression. Regression is the prediction of a quantity/number, in contrast with classification where a class is predicted. We will use the Ames Housing Dataset compiled by Dean De Cock to predict the sale price of houses in Ames, Iowa.

In this post, you'll learn how to:

- present data using the Seaborn data visualisation package;
- train and evaluate a regression model; and
- combine multiple models for a better result.

By the end, you'll learn how to predict the sale price of all 1,459 houses in `test.csv`

. Here are a preview of the first 5 houses:

## Import Packages¶

We'll be importing the same packages as the previous post, namely `scipy`

, `sklearn`

and `pandas`

. We'll also include the data visualisation library `seaborn`

since it contains some useful plots.

```
from scipy.special import boxcox1p, inv_boxcox
from scipy.stats import skew
from pandas.api.types import CategoricalDtype
from sklearn.base import clone, BaseEstimator, RegressorMixin
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import Lasso, Ridge
import pandas as pd
import seaborn as sns
```

## Load Data¶

Let's load the training data in `train.csv`

into a Pandas DataFrame with `Id`

as index.

Unlike the previous post, where we only had one row to predict, we wish to predict all 1,459 rows inside `test.csv`

. So, load the training and test data from two separate files into two DataFrames: `df_train`

(1,460 rows) and `df_test`

(1,459 rows). Any analysis and insights should only be obtained from the `df_train`

and not be informed by `df_test`

.

```
df_train = pd.read_csv('data/house_prices/train.csv', index_col='Id')
df_test = pd.read_csv('data/house_prices/test.csv', index_col='Id')
df_train.head()
```

The DataFrame has 80 columns:

```
df_train.info(verbose=False)
```

Some of the features include `SalePrice`

, `MSSubClass`

, `MSZoning`

and `LotFrontage`

:

Feature | Description |
---|---|

SalePrice | Property's Sale Price in Dollars |

MSSubClass | Building Class |

MSZoning | General Zoning Classification |

LotFrontage | Linear Feet of Street Connected to Property |

... | ... |

The data visualisation library, Seaborn, has lots of interesting plots like Hexbin, Kernel Density Estimate (KDE) and Violin plots.

This Heatmap plot shows the relationships of the 7 features most correlated with `SalePrice`

:

```
k = 8 # number of features for heatmap
correlation_matrix = df_train.corr()
columns = correlation_matrix.nlargest(k, 'SalePrice').SalePrice.index
covariance_matrix = np.corrcoef(df_train[columns].values.T)
sns.heatmap(covariance_matrix)
```

From the Heatmap, you can see that the top 7 features most highly correlated with `SalePrice`

are:

`OverallQual`

: overall material and finish quality;`GrLivArea`

: above grade (ground) living area, in square feet;`GarageCars`

: size of garage, in car capacity;`GarageArea`

: size of garage, in square feet;`TotalBsmtSF`

: total square feet of basement area;`1stFlrSF`

: size of first floor, in square feet; and`FullBath`

: full bathrooms, above grade.

## Data Pre-processing¶

The data pre-processing steps here are similar to the ones in the previous post. We want to:

- convert all columns to their correct data types, removing columns of
`object`

data type; - impute (meaningfully fill in) or remove missing data;
- one-hot encode categories; and
- fix skewness for numerical features.

### Convert Columns to Correct Data Types¶

Columns of `object`

data type are either actual strings or categories in the form of strings.

Let's check all the columns of `object`

data type and see how many unique values they contain:

```
df_train.select_dtypes(include='object').nunique().sort_values(ascending=False)
```

Since `df_train`

has a total of 1,460 rows, we can assume that even the column with 25 unique values is a category column. So, we convert all columns of `object`

data type into `category`

:

```
object_columns = df_train.select_dtypes(include='object').columns
for object_column in object_columns:
# ensure same categories across df_train and df_test
categories = set(pd.concat([df_train[object_column], df_test[object_column]]).dropna())
df_train[object_column] = df_train[object_column].astype(CategoricalDtype(categories=categories))
df_test[object_column] = df_test[object_column].astype(CategoricalDtype(categories=categories))
```

### Impute/Remove Missing Data¶

Let's create a missing data row count for each feature:

```
missing_data = pd.DataFrame({'Count': df_train.isnull().sum(),
'MissingPercentage': df_train.isnull().sum() / df_train.isnull().count() * 100})
missing_data[missing_data.Count > 0].sort_values(by='Count', ascending=False)
```

We can remove features with missing data in more than 1% of its rows. Since we can't have any missing values after this step, if we don't remove the feature, we must remove the entire row instead. The key is to strike a balance between removing too many features and removing too many rows. The 1% figure is arbitrary but seems to be a good balance from observing the data.

```
columns_to_remove = list(missing_data[missing_data.MissingPercentage > 1].index)
df_train.drop(columns_to_remove, axis=1, inplace=True)
df_test.drop(columns_to_remove, axis=1, inplace=True)
```

Now, we delete the remaining rows that contain missing values:

```
df_train.dropna(inplace=True)
df_test.dropna(inplace=True)
```

### One-Hot Encoding¶

```
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)
```

### Fix Skewness for Numerical Features¶

Although we briefly covered the need to fix skewness for numerical features in the previous post, let's use Seaborn to plot the target feature, `SalePrice`

as a distribution plot:

```
sns.distplot(df_train.SalePrice);
```

Now, we apply Box Cox transform to the features with skew > 0.75:

```
lambda_ = 0.15
skewness = pd.DataFrame({'Skew': df_train.apply(lambda x: skew(x.dropna())).sort_values(ascending=False)})
skewed_columns = skewness[abs(skewness) > 0.75].dropna().index
print(f"There are {skewed_columns.shape[0]} skewed features to Box Cox transform.")
for column in skewed_columns:
df_train[column] = boxcox1p(df_train[column], lambda_)
if column != 'SalePrice':
df_test[column] = boxcox1p(df_test[column], lambda_)
```

Let's look at a distribution plot of `SalePrice`

again after skew correction:

```
sns.distplot(df_train.SalePrice);
```

Now the features are ready for the machine learning regression model!

## Model¶

To prepare the data for the models, we separate the features (X) from the target label (y):

```
X = df_train.drop('SalePrice', axis=1)
y = df_train.SalePrice.values.ravel()
```

First, we'll go through a simple regression model that is similar to the classification model in the previous post. Then, we'll delve into a more complicated ensemble model.

### Simple Model¶

In the previous post, we used `train_test_split`

to split `X`

and `y`

into a train/test set before fitting and scoring the model. We can get a more reliable score by using the `cross_val_score`

function to do this multiple times (on a different train/test split each time) and take the average. This is called $k$-fold cross-validation, where $k$ is the number of train/test splits. The value of $k$ can range from 2 (a single train/test split) to $N$, the total number of rows in the data set. The extreme case where $k = N$ is also called leave-one-out cross-validation.

Here, we use the 5-fold cross-validation (where $k = 5$) to evaluate the model:

```
model = RandomForestRegressor()
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print('Score: {:.5f} ({:.5f})'.format(scores.mean(), scores.std()))
```

Now let's train the model using all of `X`

and `y`

and then use the trained model to make predictions on `df_test`

:

```
model = RandomForestRegressor()
model.fit(X, y)
pred = model.predict(df_test)
pred
```

Wait, these predicted sale prices don't look quite right. Since we applied Box-Cox transform to help reshape `SalePrice`

, we need to undo it by calling `inv_boxcox`

:

```
inv_boxcox(pred, lambda_)
```

```
pd.read_csv('data/house_prices/test.csv', index_col='Id').head()
```

So, the first 5 houses of `df_test`

(above) have predicted `SalePrice`

of:

```
inv_boxcox(pred[:5], lambda_)
```

### Ensemble Model¶

An ensemble model is one that combines the strengths of several simpler base models. There are three common types of ensemble models:

**boosting:**uses multiple, identical models that are trained in sequence, with each subsequent model trained emphasising the mis-classifications of the previous model;**bootstrap aggregation (or bagging):**uses multiple, identical models that are trained on randomly drawn subsets of the training set; and**stacking:**uses different models to train the data.

Actually, the `RandomForestClassifier`

and `RandomForestRegressor`

which you've already been using are also ensemble models since they combine multiple decision trees using majority voting, with each decision tree generated using a random subset of features.

Here is an example of stacking. We create an instance of a `StackedRegressor`

that combines the `Lasso`

, `RandomForestRegressor`

and `Ridge`

regression models for a better prediction:

```
class StackedRegressor(BaseEstimator, RegressorMixin):
def __init__(self, models):
self.models = models
def fit(self, X, y):
self.models_ = [clone(model) for model in self.models]
for model in self.models_:
model.fit(X, y)
return self
def predict(self, X):
predictions = np.column_stack([model.predict(X) for model in self.models_])
return np.mean(predictions, axis=1)
stacked_model = StackedRegressor([
Lasso(),
RandomForestRegressor(),
Ridge(),
])
scores = cross_val_score(stacked_model, X, y, cv=5, scoring='r2')
print('Score: {:.5f} ({:.5f})'.format(scores.mean(), scores.std()))
```

We see that the `StackedRegressor`

(86.763%) performed better than the `RandomForestRegressor`

by itself (85.688%).

## Summary and Next Steps¶

In this post, you've learnt how to:

- present data using the Seaborn data visualisation package;
- train and evaluate a regression model; and
- combine multiple models for a better result.

Now you're familar with both types of Supervised Learning: classification (predicting a class) and regression (predicting a quantity). There are two other broad types of Machine Learning: (i) Unsupervised Learning is when you are only given inputs and tasked to find interesting attributes of the dataset, such as clusters or outliers; and (ii) Reinforcement Learning is where an agent interacts with an environment and changes its action based on rewards. These will be explored in future posts. Stay tuned!

## Acknowledgements¶

This post has been inspired by posts from: