A stroll in the Random Forest as a software engineer
What is a random forest and how is it used? Software engineer’s deep dive into Machine learning concept: random forest.
Introduction
A random forest is composed of decision trees, and a decision tree is made of binary splits. I will start explaining from the bottom up (Binary splits to the random forest). The code snippets shown here are from my FastAI notebook (Kaggle).
What is binary split?
A binary split is where all rows are placed into one of two groups, based on whether they’re above or below some column threshold.
For example, we could split the rows of our dataset into males and females, by using the threshold 0.5
and the column Sex
(since the values in the column are 0
for females and 1
for males).
import seaborn as sns
fig,axs = plt.subplots(1,2, figsize=(15,5))
# Left graph
sns.barplot(data=df, y=dep, x="Sex", ax=axs[0]).set(title="Survival rate")
# Right graph
sns.countplot(data=df, x="Sex", ax=axs[1]).set(title="Histogram (count)");
There are multiple columns (a.k.a. features) in this dataset. We can determine which column does the best binary split by using impurity to measure how much the binary split creates two groups where the rows in a group (i.e. female group) are each similar to each other, or dissimilar. Basically, if all the people in the female group survived (survived = 0), the value will be 0, and the opposite is valid as well (i.e. all did not survive). Lower the impurity value, the higher probability of purity (more homogeneous).
# y axis is the dependent variable.
def _side_score(side, y):
tot = side.sum()
if tot<=1: return 0
# This measures how closely the x values are similar to each other.
return y[side].std()*tot
'''
score(col, y, split) - determines the impurity score of a column.
col - x values of the column or independent variable.
y - y values (dependent variable) or outcome we want to predict.
split - value used to do binary split. It applies to categorical or continuous
variables.
'''
def score(col, y, split):
lhs = col<=split
return (_side_score(lhs,y) + _side_score(~lhs,y))/len(y)
'''
min_col(df, nm) - finds the x value of the column that
results in the lowest impurity score. It tries out all possible binary splits
based on the x values.
'''
def min_col(df, nm):
col,y = df[nm],df[dep]
unq = col.dropna().unique()
scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
idx = scores.argmin()
return unq[idx],scores[idx]
{o:min_col(trn_df, o) for o in cols}
# Output
{'Sex': (0, 0.40787530982063946) # Sex wins! :),
'Embarked': (0, 0.47883342573147836),
'Age': (6.0, 0.478316717508991),
'SibSp': (4, 0.4783740258817434),
'Parch': (0, 0.4805296527841601),
'LogFare': (2.4390808375825834, 0.4620823937736597),
'Pclass': (2, 0.46048261885806596)}
# LogFare - log(fare price)
# Pclass - passenger class
According to this, Sex<=0
is the best split we can use. Let’s try it out by finding mean_absolute_error
which gives us the confidence that the column Sex
has the lowest error rate when survived
values are similar between actual y
values (boolean
) and predicted values. This method is called OneR classifier.
from sklearn.metrics import mean_absolute_error
# Convert Sex (0 - female, 1 - male) into boolean.
# We are predicting that if you are female, you are more likely to survive.
preds = val_xs.Sex==0
mean_absolute_error(val_y, preds)
0.21524663677130046
# Compare with LogFare.
preds = val_xs.LogFare>2.7
mean_absolute_error(val_y, preds)
0.336322869955157
How can we improve our OneR classifier, which predicts survival based only on Sex
? We can improve it with a decision tree.
What is a decision tree?
At a high level, a decision tree is a tree structure of binary splits that outputs a binary answer based on a range of independent variables.
Previously, we did the binary split on the column Sex
, female
and male
, and for a decision tree, we want to create one more binary split for each of them. That is: fine the single best split for females, and the single best split for males. To do this, all we have to do is repeat the previous steps, once for males, and once for females.
# Remove Sex column since it's been used.
cols.remove("Sex")
ismale = trn_df.Sex==1
males,females = trn_df[ismale],trn_df[~ismale]
# Find the the split with lowest impurity.
{o:min_col(males, o) for o in cols}
{'Embarked': (0, 0.3875581870410906),
'Age': (6.0, 0.3739828371010595), # Age wins.
'SibSp': (4, 0.3875864227586273),
'Parch': (0, 0.3874704821461959),
'LogFare': (2.803360380906535, 0.3804856231758151),
'Pclass': (1, 0.38155442004360934)}
{o:min_col(females, o) for o in cols}
{'Embarked': (0, 0.4295252982857327),
'Age': (50.0, 0.4225927658431649),
'SibSp': (4, 0.42319212059713535),
'Parch': (3, 0.4193314500446158),
'LogFare': (4.256321678298823, 0.41350598332911376),
'Pclass': (2, 0.3335388911567601)} # PClass wins.
# .... So forth. The concept is we repeat this process until we reach
# the max. # of leaf nodes that we set.
We can see that the best next binary split for males is Age<=6
, and for females is Pclass<=2
.
By adding these rules, we have created a decision tree, where our model will first check whether Sex
is female or male, and depending on the result will then check either the above Age
or Pclass
rules, as appropriate. We could then repeat the process, creating new additional rules for each of the four groups we've now created.
We can automate this with DecisionTreeClassifier
, from sklearn, which does exactly that for us:
m = DecisionTreeClassifier(min_samples_leaf=50)
m.fit(trn_xs, trn_y)
draw_tree(m, trn_xs, size=12)
mean_absolute_error(val_y, m.predict(val_xs))
0.18385650224215247
Compared to ~0.21
from the binary split of the column Sex
, 0.18
is a good improvement!
Can we improve this further?
What is a random forest?
The idea is that we want each model’s predictions in the averaged ensemble to be uncorrelated with each other model. That way, if we average the predictions, the average will be equal to the true target value — that’s because the average of lots of uncorrelated random errors is zero.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(200, min_samples_leaf=5)
rf.fit(trn_xs, trn_y);
mean_absolute_error(val_y, rf.predict(val_xs))
0.18385650224215247
Random forest in this case did not improve and performed about the same as our decision tree model. Why is that? Let’s analyze.
pd.DataFrame(dict(cols=trn_xs.columns, imp=m.feature_importances_)).plot('cols', 'imp', 'barh');
The diagnostic chart above displays that features, Sex
and Pclass
, dictate 90% of the outcome whether a passenger survives or not. Therefore, we can understand why creating many different variants of trees won’t help much as there are only few features that will dominate the fate of a passenger. Sad.
Conclusion
I think the first thing I’d note from this is that, clearly, more complex models aren’t always better. Our OneR
model, consisting of a single binary split, was nearly as good as our more complex models. Perhaps in practice a simple model like this might be much easier to use, and could be worth considering. Our random forest wasn’t an improvement on the single decision tree at all.
So we should always be careful to benchmark simple models, as see if they’re good enough for our needs. In practice, you will often find that simple models will have trouble providing adequate accuracy for more complex tasks, such as recommendation systems, NLP, computer vision, or multivariate time series. But there’s no need to guess — it’s so easy to try a few different models, there’s no reason not to give the simpler ones a go too!
References
https://www.kaggle.com/code/stephenslee/how-random-forests-really-work