Tabular Modeling with Random Forest and Neural Network
Tabular modeling takes data in the form of a table (like a spreadsheet or CSV). The objective is to predict the value in one column based on the values in the other columns.
This article is based on the Jupyter notebook I forked from the repository called fastai/fastbook
. Feel free to use it as a reference to follow along for the fully functioning code.
We will see how we can feed columns that contain categories into a model that expects numbers by using embeddings.
An embedding refers to a mathematical representation of a categorical variable, such as words in natural language processing, or categorical features in tabular data. What it does is multiply by a one-hot-encoded matrix, using the computational shortcut that can be implemented by simply indexing directly.
one_hot_3 = one_hot(3, 5).float()
one_hot_3
tensor([0., 0., 0., 1., 0.]) # output
# https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html
n_users = 10
n_factors = 5
matrix = torch.randn(n_users, n_factors)
one_hot_4 = one_hot(4, n_users).float()
# they are essentially the same. Accessing directly
# via the index is much more efficient.
matrix[4]
tensor([ 1.5109, -0.2732, -1.3178, -0.0542, 0.1486]) # output
matrix.t() @ one_hot_4 == matrix[4]
tensor([True, True, True, True, True]) # output
%%timeit
matrix[4]
653 ns ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%%timeit
matrix.t() @ one_hot_4
1.92 µs ± 13.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Categorical Embeddings
In tabular data, some columns may contain numerical data, like “age,” while others contain string values, like “sex.” The numerical data can be directly fed to the model (with some optional preprocessing), but the other columns need to be converted to numbers. Those columns are variables (or features), and there are two types:
Continuous variables: the numerical values that can directly feed into the model without preprocessing, such as age, height, etc.
Categorical variables: the non-numerical values that correspond to different categories, such as eye color, dog breed, etc.
Dataset
I will be using the dataset from the old Kaggle competition. Read more about the data the competition has.
# You could run this code to install the dataset.
comp = 'bluebook-for-bulldozers'
path = URLs.path(comp)
!kaggle competitions download -c 'bluebook-for-bulldozers' -p {path}
!cd {path} && unzip 'bluebook-for-bulldozers.zip' && rm 'bluebook-for-bulldozers.zip'
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.
def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)
Preprocessing data
Key fields in the training dataset are:
- SalesID: the unique identifier of the sale
- MachineID: the unique identifier of a machine. A machine can be sold multiple times
- saleprice: what the machine sold for at auction (only provided in train.csv)
- saledate: the date of the sale
df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)
df.columns
Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
dtype='object')
Ordinal columns
A good next step is to handle ordinal columns. This refers to columns containing strings or similar, but where those strings have a natural ordering. For instance, here are the levels of ProductSize
:
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
Dependent variable
The most important data column is the dependent variable. Recall that a model’s metric is a function that uses the dependent variable to reflect how good the predictions are.
dep_var = 'SalePrice'
df[dep_var] = np.log(df[dep_var])
df[dep_var]
0 11.097410
1 10.950807
2 9.210340
3 10.558414
4 9.305651
...
412693 9.210340
412694 9.259131
412695 9.433484
412696 9.210340
412697 9.472705
Name: SalePrice, Length: 412698, dtype: float64
Decision Trees
A decision tree asks a series of binary (yes or no) questions to make predictions.
The basic steps to train a decision tree are:
- Loop through each column of the dataset in turn.
- For each column, loop through each possible level of that column in turn.
- Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).
- Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple “model” where our predictions are simply the average sale price of the item’s group.
- After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.
- We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each by going back to step 1 for each group.
- Continue this process recursively, until you have reached some stopping criterion for each group — for instance, stop splitting a group further when it has only 20 items in it.
Handling dates
Dates are a bit different from most ordinal values in that some dates are qualitatively different from others in a way that is often relevant to the systems we are modeling. We want our tabular model to know more than whether the date is more recent or not. We might want our model to make decisions based on that date’s day of the week, on whether a day is a holiday, on what month it is in, etc. These are the categorical variables that we consider useful.
from fastai.tabular.all import *
df = add_datepart(df, 'saledate')
' '.join(o for o in df.columns if o.startswith('sale'))
'saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed'
Using TabularPandas and TabularProc
TabularPanda wraps the Panda DataFrame to provide a few conveniences such as handling missing values, converting columns to numeric categorical columns, and splitting datasets into training and validation sets.
cond = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx),list(valid_idx))
# TabularPandas needs to be told which columns are continuous and which are categorical.
# We can handle that automatically using the helper function cont_cat_split:
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
cont,cat
(['SalesID',
'MachineID',
'ModelID',
...
'saleYear',
'saleMonth',
'saleWeek',
'saleDay',
'saleDayofweek',
'saleDayofyear',
'saleElapsed'],
['UsageBand',
'fiModelDesc',
...
'Differential_Type',
'Steering_Controls',
'saleIs_month_end',
'saleIs_month_start',
'saleIs_quarter_end',
'saleIs_quarter_start',
'saleIs_year_end',
'saleIs_year_start'])
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
len(to.train),len(to.valid)
(404710, 7988)
to1 = TabularPandas(df, procs, ['state', 'ProductGroup', 'Drive_System', 'Enclosure'], ['fiBaseModel'], y_names=dep_var, splits=splits)
to1.show(5) # output shown below
to.classes['ProductSize']
['#na#', 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact']
to1.items[['state', 'ProductGroup', 'Drive_System', 'Enclosure']].head(3)
Creating the Decision Tree
Once all the data are numeric, we can start building the tree.
# Independent (x) and dependent (y) variables
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y
len(xs)
404710
m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(xs, y);
draw_tree(m, xs, size=10, leaves_parallel=True, precision=2)
The total number of samples is 404710. The DecisionTreeRegressor picks the column to do the binary split recursively, and there will be a total of 4 leaf nodes at the end. We can also visualize the data to see if there are any anomalies.
# visualize decision tree
import dtreeviz
dep_var = 'SalePrice'
samp_idx = np.random.permutation(len(y))[:500]
viz_model = dtreeviz.model(m,
X_train=xs.iloc[samp_idx],
y_train=y.iloc[samp_idx],
feature_names=xs.columns,
target_name=dep_var)
viz_model.view(
scale=1.6,
orientation='LR',
fontname='DejaVu Sans',
label_fontsize=10
)
It looks like we do have some anomalies. The YearMade column indicates that some cars are made in the year 1000. We can fix that by replacing the values with the oldest date.
xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950
valid_xs.loc[valid_xs['YearMade']<1900, 'YearMade'] = 1950
xs,y = to.train.xs,to.train.y
# print(xs,y)
m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
(0.211677, 0.268163)
Random Forests
A random forest adopts the procedure called bagging. The procedure is as follows:
- Randomly choose a subset of the rows of your data (i.e., “bootstrap replicates of your learning set”).
- Train a model using this subset.
- Save that model, and then return to step 1 a few times.
- This will give you a number of trained models. To make a prediction, predict using all of the models, and then take the average of each of those models’ predictions.
It is based on a deep and important insight: although each of the models trained on a subset of data will make more errors than a model trained on the full dataset, those errors will not be correlated with each other. It means that we can improve the accuracy of nearly any kind of machine learning algorithm by training it multiple times, each time on a different random subset of the data, and averaging its predictions.
'''
n_estimators - number of decision trees to create in random forest.
max_samples - max. number of rows to sample for training each tree.
max_features - max. ratio of columns to sample at each binary split.
min_samples_leaf - min. amount of samples per leaf node.
'''
def rf(xs, y, n_estimators=40, max_samples=200_000,
max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
max_samples=max_samples, max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)
rf_model = rf(xs, y);
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
(0.211677, 0.268163)
Out-of-Bag Error
The OOB error is a way of measuring prediction error on the training set by only including in the calculation of a row’s error trees where that row was not included in the training set. This allows us to see whether the model is overfitting, without needing a separate validation set i.e., how well the model can predict on unseen data.
r_mse(rf_model.oob_prediction_, y)
0.211107
We can see that our OOB error is much lower than our validation set error. This means that something else is causing that error, in addition to normal generalization error. We’ll discuss the reasons for this later.
Model Interpretation
For tabular data, model interpretation is particularly important. For a given model, the things we are most likely to be interested in are:
- How confident are we in our predictions using a particular row of data?
- What were the most important factors for predicting a particular row of data, and how did they influence that prediction?
- Which columns are the strongest predictors, and which can we ignore?
- Which columns are effectively redundant with each other, for purposes of prediction?
- How do predictions vary, as we vary these columns?
Tree Variance for Prediction Confidence
The random forest model calculates the mean of the tree’s predictions to get an overall prediction. However, to analyze confidence in the predictions, we need to look into standard deviations. Low standard deviations mean high confidence where trees agree on the predictions.
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
preds_std[:5]
array([0.31865899, 0.13626066, 0.08452598, 0.2605348 , 0.09774961])
Feature Importance
We can gain insight into which features (or independent variables) determine the sales price. The feature importance algorithm loops through each tree, and then recursively explores each branch. At each branch, it looks to see what feature was used for that split, and how much the model improves as a result of that split. The improvement (weighted by the number of rows in that group) is added to the importance score for that feature. This is summed across all branches of all trees, and finally, the scores are normalized such that they add to 1.
def rf_feat_importance(m, df):
return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
).sort_values('imp', ascending=False)
fi = rf_feat_importance(m, xs)
fi[:10] # Result below
Removing Low-Importance variables
We can check if the removal of low-importance features alters our accuracy.
to_keep = fi[fi.imp>0.005].cols
to_keep
57 YearMade
6 ProductSize
30 Coupler_System
7 fiProductClassDesc
54 ModelID
65 saleElapsed
3 fiSecondaryDesc
32 Hydraulics_Flow
31 Grouser_Tracks
12 Enclosure
1 fiModelDesc
9 ProductGroup
52 SalesID
2 fiBaseModel
59 saleYear
10 ProductGroupDesc
53 MachineID
5 fiModelDescriptor
23 Hydraulics
28 Tire_Size
11 Drive_System
Name: cols, dtype: object
xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]
new_rf_model = rf(xs_imp, y)
m_rmse(new_rf_model, xs_imp, y), m_rmse(new_rf_model, valid_xs_imp, valid_y)
(0.181304, 0.232254)
Previously, it was (0.211677, 0.268163). By using feature importance, the model has improved its result.
plot_fi(rf_feat_importance(new_rf_model, xs_imp));
Tree Interpreter
We could utilize a tree interpreter to check how these columns influence the predictions.
#hide
import warnings
warnings.simplefilter('ignore', FutureWarning)
from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall
prediction,bias,contributions = treeinterpreter.predict(m, rows.values)
prediction[0], bias[0], contributions[0].sum()
(array([9.96608845]), 10.104403498584746, -0.13831504470110959)
prediction
is simply the prediction that the random forest makes. bias
is the prediction based on taking the mean of the dependent variable (i.e., the model that is the root of every tree). contributions
is the total change in prediction due to each of the independent variables. Therefore, the sum of contributions
plus bias
must equal the prediction
, for each row.
waterfall(valid_xs_final.columns, contributions[0], threshold=0.08,
rotation_value=45,formatting='{:,.3f}');
You can use this visualized data to provide useful information to users of your data product about the underlying reasoning behind the predictions.
Extrapolation in Random Forests and Neural Networks
Extrapolation is a problem that can occur in random forests when the model tries to make predictions outside the range of the data it was trained on. This can happen when the model is presented with input data that is outside the range of the feature values seen during training, or when the model is presented with input data that is in a different domain than the training data. Let’s take a look at the example below:
x_lin = torch.linspace(0,20, steps=40)
y_lin = x_lin + torch.randn_like(x_lin)
m_lin = RandomForestRegressor().fit(xs_lin[:30],y_lin[:30])
plt.scatter(x_lin, y_lin, 20)
plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5);
Red dots represent the random forest model, and blue dots are the actual values that we want the model to predict. The model has been trained with 30 out of 40 data points. Beyond the 30th data point, the model does not predict the outcomes (y-axis) based on previous inputs. Random forests cannot extrapolate outside of the data types they have seen.
Finding out-of-domain data
The random forest can help identify which features (or columns) differ significantly between training and validation sets.
df_dom = pd.concat([xs_final, valid_xs_final])
is_valid = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))
m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)[:6]
Once we identify the columns with the highest feature importance, we can check if dropping each column will offer better accuracy.
m = rf(xs_final, y)
print('orig', m_rmse(m, valid_xs_final, valid_y))
for c in ('SalesID','saleElapsed','MachineID'):
m = rf(xs_final.drop(c,axis=1), y)
print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))
orig 0.232106
SalesID 0.230855
saleElapsed 0.235484
MachineID 0.231039
Dropping SalesID and MachineID columns will improve the accuracy of our model.
time_vars = ['SalesID','MachineID']
xs_final_time = xs_final.drop(time_vars, axis=1)
valid_xs_time = valid_xs_final.drop(time_vars, axis=1)
m = rf(xs_final_time, y)
m_rmse(m, valid_xs_time, valid_y)
0.229883
Removing these variables has slightly improved the model’s accuracy; but more importantly, it should make it more resilient over time, and easier to maintain and understand. It can often uncover subtle domain shift issues that you may otherwise miss.
One thing that might help in our case is to simply avoid using old data. Old data shows relationships that just are no longer valid. Let’s try just using the most recent few years of the data.
xs['saleYear'].hist();
m = rf(xs_final, y)
print('orig', m_rmse(m, valid_xs_final, valid_y))
filt = xs['saleYear']>2004
xs_filt = xs_final_time[filt]
y_filt = y[filt]
m = rf(xs_filt, y_filt)
m_rmse(m, valid_xs_time, valid_y)
orig 0.232511
0.22889
The result is slightly better than using the entire dataset but this proves that using the subset of data can have better accuracy.
Using neural network
We can use the same approach to build a neural network model. Let’s first replicate the steps we took to set up the TabularPanda
object.
from fastai.tabular.all import *
# Create a DataFrame Panda object.
df_nn = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)
# Set ProductSize as category.
df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
# Take log() of the dependable variable to be more evenly distributed.
# This makes data less skewed in standard deviation.
dep_var = 'SalePrice'
df_nn[dep_var] = np.log(df_nn[dep_var])
# add_datepart breaks down date in much granular level:
# saleYear, saleMonth, saleWeek, etc.
df_nn = add_datepart(df_nn, 'saledate')
# xs_final_time.columns are the ones from random forest model.
df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]
Categorical columns are handled very differently in neural networks, compared to decision tree approaches. The neural network handles categorical variables by using embeddings. To create embeddings, fastai needs to determine which columns should be treated as categorical variables. It does this by comparing the number of distinct levels in the variable to the value of the max_card
parameter. If it's lower, fastai will treat the variable as categorical. Embedding sizes larger than 10,000 should generally only be used after you've tested whether there are better ways to group the variable, so we'll use 9,000 as max_card
:
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
cont_nn,cat_nn
(['saleElapsed'],
['YearMade',
'ProductSize',
'Coupler_System',
'fiProductClassDesc',
'ModelID',
'fiSecondaryDesc',
'Hydraulics_Flow',
'Enclosure',
'fiModelDesc',
'ProductGroup',
'fiModelDescriptor',
'Hydraulics',
'Tire_Size',
'Drive_System'])
Let’s take a look at the cardinality of each of the categorical variables that we have chosen so far:
df_nn_final[cat_nn].nunique()
YearMade 73
ProductSize 6
Coupler_System 2
fiProductClassDesc 74
ModelID 5281
fiSecondaryDesc 177
Hydraulics_Flow 3
Enclosure 6
fiModelDesc 5059
ProductGroup 6
fiModelDescriptor 140
Hydraulics 12
Tire_Size 17
Drive_System 4
dtype: int64
There are two variables (ModelID and fiModelDesc) with similar cardinalities. This means that they are similar and potentially redundant features. We can use the random forest to quickly check the impact of removing one of these model columns.
xs_filt2 = xs_filt.drop('fiModelDesc', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDesc', axis=1)
m2 = rf(xs_filt2, y_filt)
m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
(0.180523, 0.232354) # comparing with 0.22889, 0.232354 it's a little inaccurate.
The result shows that removing fiModelDesc
has a tiny impact on the random forest model. If we decide to keep fiModelDesc
, we are adding 5059 embeddings to the model. Therefore, we will remove it from our deep learning model to improve performance since it doesn’t have much impact on accuracy based on our quick test with the random forest model.
cat_nn.remove('fiModelDescriptor')
cat_nn
['YearMade',
'ProductSize',
'Coupler_System',
'fiProductClassDesc',
'ModelID',
'fiSecondaryDesc',
'Hydraulics_Flow',
'Enclosure',
'fiModelDesc',
'ProductGroup',
'Hydraulics',
'Tire_Size',
'Drive_System']
We can create our TabularPandas
object in the same way as when we created our random forest, with one significant addition: normalization. A random forest does not need any normalization—the tree-building procedure cares only about the order of values in a variable, not at all about how they are scaled. But as we have seen, a neural network definitely does care about this. Therefore, we add the Normalize
processor when we build our TabularPandas
object.
procs_nn = [Categorify, FillMissing, Normalize]
# Split the dataset into training and validation sets based on the year.
cond = (df_nn.saleYear<2011) | (df_nn.saleMonth<10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx),list(valid_idx))
to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,
splits=splits, y_names=dep_var)
By default, for tabular data fastai creates a neural network with two hidden layers, with 200 and 100 activations, respectively. This works quite well for small datasets, but here we’ve got quite a large dataset, so we increase the layer sizes to 500 and 250.
# Set batch size to 1024.
dls = to_nn.dataloaders(1024)
learn = tabular_learner(dls, y_range=(8,12), layers=[500,250],
n_out=1, loss_func=F.mse_loss)
preds,targs = learn.get_preds()
r_mse(preds,targs)
0.228958 # similar to random forest accuracy: 0.22889
Ensembling
An ensemble of random forest and neural network further increases the accuracy of our model. The errors from each set of predictions do not correlate to each other, so the average of these two reduce the errors.
rf_preds = m.predict(valid_xs_time)
ens_preds = (to_np(preds.squeeze()) + rf_preds) /2
r_mse(ens_preds,valid_y)
0.222134
Boosting
Boosting is another method of ensembling that is different from bagging which involves averaging out the losses of multiple models. Here is how boosting works:
- Train a small model that under-fits your dataset.
- Calculate the predictions in the training set for this model.
- Subtract the predictions from the targets; these are called the “residuals” and represent the error for each point in the training set.
- Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.
- Continue doing this until you reach some stopping criterion, such as a maximum number of trees, or you observe your validation set error getting worse.
Conclusion
There are two approaches to tabular data: random forest, gradient boosting, and neural network.
- Random forests are the easiest to train because they are extremely resilient to hyperparameter choices and require very little preprocessing. They are very fast to train, and should not overfit if you have enough trees. But they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods.
- Gradient boosting machines in theory are just as fast to train as random forests, but in practice, you will have to try lots of different hyperparameters. They can overfit, but they are often a little more accurate than random forests.
- Neural networks take the longest time to train, and require extra preprocessing, such as normalization; this normalization needs to be used at inference time as well. They can provide great results and extrapolate well, but only if you are careful with your hyperparameters and take care to avoid overfitting.
Start the analysis with a random forest. This gives a strong baseline and determines if it’s a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data. From that foundation, you can try neural nets and GBMs, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them. If decision tree ensembles are working well for you, try adding the embeddings for the categorical variables to the data, and see if that helps your decision trees learn better.