Elena' s AI Blog

Decision Tree versus Random Forest, and Hyperparameter Optimisation

06 Nov 2023 / 25 minutes to read

Elena Daehnhardt


Midjourney, November 2023


Introduction

Decision trees, with their elegant simplicity and transparency, stand in stark contrast to the robust predictive power of Random Forest, an ensemble of trees. In this post, we compare the key distinctions, advantages, and trade-offs between these two approaches. We will use Scikit-Learn for training and testing both models and also perform hyperparameter optimisation to find both model parameters for improved performance.

Machine Learning with Scikit-learn

Scikit-learn, often called sklearn, is a versatile and comprehensive machine-learning library in Python. It offers a rich collection of tools and functions for building, training, and evaluating machine learning models.

Scikit-learn has a variety of supported algorithms. It covers various machine-learning tasks, including classification, regression, clustering, dimensionality reduction, model selection, and more. Scikit-learn provides a solid foundation for machine learning experiments, from data preprocessing to model evaluation.

Scikit-learn also provides helpful tools for data splitting, cross-validation, hyperparameter tuning and metrics for assessing model performance.

You can install scikit-learn and its dependencies using pip, a popular Python package manager. Open your terminal or command prompt and enter the following command to install scikit-learn:

pip install scikit-learn

Once installed, you can import scikit-learn into your Python code using the following import statement:

import sklearn

We have a well-defined interface to scikit-learn machine learning functionality with the following methods that can be used with different algorithms:

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

Titanic dataset

We will follow the essential machine-learning steps described in my previous post Machine Learning Process.

We will work with the Titanic dataset, which is one of the most famous and widely used datasets in the field of data science and machine learning. It contains information about the passengers aboard the RMS Titanic, the ill-fated ocean liner that sank on its maiden voyage in April 1912 after striking an iceberg.

The Titanic dataset has become a standard resource for data analysis, predictive modelling, and machine learning exercises due to its historical significance and the variety of features it includes.

Code in GitHub

You can follow the code in the Colab notebook. I appreciate your stars in the repository shared for free. Thanks!

A decision tree

A decision tree is a widely used machine learning model and a fundamental tool in data analysis. It is a tree-like structure that represents a set of decisions and their possible consequences.

Decision trees are used for both classification and regression tasks. They are prevalent for their simplicity, interpretability, and effectiveness in various domains.

Here are the key characteristics and concepts associated with decision trees:

  1. Tree Structure: A decision tree is structured like a flowchart or a tree, with nodes representing decisions, branches representing possible outcomes, and leaves (terminal nodes) representing the final predictions or classifications.

  2. Nodes: The decision tree consists of different types of nodes, including:
    • Root Node: The top node of the tree, representing the initial decision or feature used to split the data.
    • Internal Nodes: Nodes in the middle of the tree that represent subsequent decisions or feature splits.
    • Leaf Nodes: Terminal nodes that provide the final predictions or classifications.
  3. Splits: At each internal node, the data is split into two or more branches based on a particular feature or condition. These splits are determined through questions or criteria about the input data.

  4. Decision Rules: Each split is guided by a decision rule or feature condition. These rules are based on the values of input features and help determine which branch to follow.

  5. Predictions: At the leaf nodes, decision trees provide the final predictions (in regression tasks) or class labels (in classification tasks) for the given input.

  6. Interpretability: Decision trees are highly interpretable, as one can easily trace the decision path from the root to the leaves, making them valuable for understanding and explaining the model’s reasoning.

  7. Training: Decision trees are trained on labelled data, and the tree structure is learned through various algorithms, such as ID3, C4.5, CART, or random forests.

  8. Overfitting: Decision trees are prone to overfitting, where they become too complex and fit the training data noise. Techniques like pruning and using ensemble methods like random forests are employed to address this issue.

  9. Applications: Decision trees are used in various applications, including classification problems (e.g., spam email detection, disease diagnosis) and regression tasks (e.g., price prediction, demand forecasting).

Next, we will walk you through the process with an example using the Titanic dataset, a classic dataset for binary classification tasks. In this example, we’ll build a decision tree to predict whether passengers survived or not based on various features.

Import Libraries

We start by importing the necessary libraries and modules:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Load and Prepare Data

We can Load the Titanic dataset from a URL (my GitHub repository) and further prepare it for modelling:

# Load the dataset
url = 'https://raw.githubusercontent.com/edaehn/python_tutorials/main/titanic/train.csv'
titanic = pd.read_csv(url)

# Drop unnecessary columns and handle missing values
titanic = titanic.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)
titanic['Age'].fillna(titanic['Age'].mean(), inplace=True)
titanic['Fare'].fillna(titanic['Fare'].mean(), inplace=True)
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1})
titanic['Survived'] = titanic['Survived'].astype(int)

# Define features (X) and target (y)
X = titanic.drop('Survived', axis=1)
y = titanic['Survived']

Let’s check the X features table.

X.head()
  Pclass Sex Age SibSp Parch Fare
0 3 0 22 1 0 7.25
1 1 1 38 1 0 71.2833
2 3 1 26 0 0 7.925
3 1 1 35 1 0 53.1
4 3 0 35 0 0 8.05

Split the Data

Split the data into training and testing sets:

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the data shapes
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}\ny_train: {y_train.shape}, y_test: {y_test.shape}")
X_train: (712, 6), X_test: (179, 6)
y_train: (712,), y_test: (179,)

Create and Train the Decision Tree Model

Initialise a DecisionTreeClassifier, fit it to the training data, and make predictions:

# Create a DecisionTreeClassifier
decision_tree_clf = DecisionTreeClassifier()

# Fit the model to the training data
decision_tree_clf.fit(X_train, y_train)

# Make predictions on the test data
decision_tree_y_pred = decision_tree_clf.predict(X_test)

Evaluate the Model

Assess the model’s performance:

# Calculate the accuracy of the model
decision_tree_accuracy = accuracy_score(y_test, decision_tree_y_pred)
print(f"Accuracy: {decision_tree_accuracy:.2f}")
Accuracy: 0.77

The accuracy score is not bad; however, it can be further improved. This example demonstrates using a decision tree classifier from scikit-learn to build a predictive model.

Decision trees are highly interpretable, and you can visualise them using scikit-learn’s tools to better understand how the model makes decisions.

# Import tree
from sklearn import tree

# Plot the decision tree
tree.plot_tree(decision_tree_clf, feature_names=X_train.columns, class_names=['Died','Survived'], filled=True)

Please note that I have reused the decision tree graph from my previous post below.

Decision Tree trained on the Titanic data
Machine Learning Tests
Interested in ML with Titanic Dataset? - refer to my post Machine Learning Tests using the Titanic dataset. We compare the performance of the Logistic Regression, Decision Tree and Random Forest from Python's library scikit-learn and a Neural Network created with TensorFlow. The Random Forest Performed the best! Machine Learning Tests

Random forest

1. Random Forest is my go-to ML model because it's like a magician who can handle missing data and still pull an accurate prediction out of the hat.

2. It's like the Swiss Army Knife of ML – it can handle classification, regression, and even feature selection with ease.

3. Because it's like a forest party, and everyone's invited – even the outliers!

4. Random Forest is my favorite ML model because it's like a team of detectives solving a complex crime – they all have different clues, but together, they crack the case!

We use the steps above to load and prepare data and split it into training and test sets.

To use the Random Forest classifier from scikit-learn we have to import it first.

from sklearn.ensemble import RandomForestClassifier

Initialise a RandomForestClassifier, fit it to the training data, and make predictions.

# Create a RandomForestClassifier
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model to the training data
random_forest_clf.fit(X_train, y_train)

# Make predictions on the test data
random_forest_y_pred = random_forest_clf.predict(X_test)

Assess the RandomForestClassifier’s performance with the accuracy_score function.

# Calculate the accuracy of the Random Forest model
random_forest_clf_accuracy = accuracy_score(y_test, random_forest_y_pred)
print(f"Accuracy: {random_forest_clf_accuracy:.2f}")
Accuracy: 0.80

The accuracy is slightly improved. Please notice that we still need to perform hyperparameters tuning.

Performance Graphs

Next, we create performance graphs for both classifiers to visualise their accuracy. We draw graphs with the matplotlib library, which we first import.

# import matplotlib
import matplotlib.pyplot as plt

# Create a bar chart to compare model accuracies
models = ['Decision Tree', 'Random Forest']
accuracies = [decision_tree_accuracy, random_forest_clf_accuracy]

plt.bar(models, accuracies, color=['blue', 'green'])
plt.xlabel('Classifier')
plt.ylabel('Accuracy')
plt.title('Classifier Performance Comparison')
plt.ylim(0, 1)

plt.show()

Hyperparameter optimisation techniques

You can further fine-tune the model’s hyperparameters, such as the maximum depth of the tree or the minimum number of samples required to split a node, to improve its performance or control overfitting (I am going to further explore the overfitting in my next post).

Here are Python examples of using three different hyperparameter optimisation techniques: Grid Search, Random Search, and Bayesian Optimization. We’ll use the scikit-learn library for Grid and Random Search and the scikit-optimize library for Bayesian Optimization.

In these examples, we’ll perform hyperparameter tuning for the Random Forest classifier.

We will also test the execution time of the hyperparameter optimisation techniques using ipython-autotime

Grid Search exhaustively searches the entire hyperparameter space defined by a predefined grid. It’s a systematic but computationally expensive method.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Perform Grid Search
grid_search = GridSearchCV(random_forest_clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
time: 35.4 s (started: 2023-11-06 11:57:52 +00:00)
best_params
{'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}
best_score
0.8313897370235399

Random Search explores the hyperparameter space by randomly selecting parameter combinations. It’s less computationally intensive than Grid Search and can often discover good parameter settings quickly.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define the parameter distribution
param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Perform Random Search
random_search = RandomizedSearchCV(random_forest_clf, param_distributions=param_dist, n_iter=10, cv=5)
random_search.fit(X_train, y_train)

# Best parameters and score
best_params = random_search.best_params_
best_score = random_search.best_score_
time: 28.5 s (started: 2023-11-06 12:00:46 +00:00)
best_params
{'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 186}
best_score
0.8313897370235399

Random search performs quite well. As in this example, we have found out that we can achieve a comparable accuracy score with fewer number of trees in the forest :)

Bayesian Optimisation

Bayesian Optimisation uses probabilistic models to estimate the objective function, making it an efficient method for hyperparameter tuning. We’ll use the scikit-optimize library for Bayesian Optimization.

You will have to install the library with the following:

pip install scikit-optimize
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter search space
param_space = {
    'n_estimators': (50, 200),
    'max_depth': (None, 10, 20),
    'min_samples_split': (2, 5, 10)
}


# Perform Bayesian Optimization
bayes_search = BayesSearchCV(random_forest_clf, param_space, cv=5, n_iter=10, n_jobs=-1)
bayes_search.fit(X_train, y_train)

# Best parameters and score
best_params = bayes_search.best_params_
best_score = bayes_search.best_score_
time: 14.8 s (started: 2023-11-06 12:01:57 +00:00)
best_params
OrderedDict([('max_depth', 10),
             ('min_samples_split', 10),
             ('n_estimators', 106)])
best_score
0.8299911356249385

Choose the hyperparameter optimisation technique that best fits your problem’s computational constraints and the desired level of exploration in the parameter space. Bayesian Optimisation is often the most efficient, but it may require additional libraries like scikit-optimize.

Random Forests versus Decision Trees

What are the key characteristics, advantages and drawbacks of decision trees and random forests?

Decision Trees:

Key Characteristics:

  1. Interpretable: Decision trees are highly interpretable, allowing you to understand the decision-making process by following the tree’s branches and nodes.

  2. Simplicity: They are relatively simple to understand and implement, making them accessible for beginners.

  3. Nonlinear Relationships: Decision trees can model complex, nonlinear relationships in the data.

  4. Feature Importance: They provide information about feature importance, helping you identify which features are most relevant to the target variable.

Advantages:

  1. Interpretability: Decision trees are transparent and easy to explain, making them useful for decision support and domain-specific insights.

  2. Versatility: They can be used for classification and regression tasks.

  3. Handles Mixed Data: Decision trees can handle a mix of categorical and numerical data without extensive preprocessing.

  4. Resistance to Outliers: Decision trees are less affected by outliers in the data than other models.

Drawbacks:

  1. Overfitting: Decision trees can be prone to overfitting, where they capture noise in the training data, resulting in poor generalisation to new data.

  2. Instability: Small changes in the data can lead to significantly different tree structures, making them unstable.

  3. Bias: They can have high bias, mainly when the tree is too shallow or the dataset is imbalanced.

  4. Inadequate Handling of Missing Values: Traditional decision tree algorithms do not handle missing data well, requiring imputation or additional preprocessing.

Random Forest:

Key Characteristics:

  1. Ensemble Method: Random Forest is an ensemble learning method that combines multiple decision trees to improve predictive performance.

  2. Bagging: It uses a bagging (Bootstrap Aggregating) technique to create subsets of the data and train individual trees on each subset.

  3. Feature Sampling: Random Forest employs feature sampling, meaning that not all features are considered at each split, reducing the risk of overfitting.

  4. Predictive Accuracy: Random Forest offers improved predictive accuracy and generalisation compared to single decision trees.

Advantages:

  1. Improved Performance: Random Forest tends to provide higher predictive performance than individual decision trees, thanks to the ensemble of trees.

  2. Reduced Overfitting: Combining multiple trees and feature sampling reduces the risk of overfitting.

  3. Handles High-Dimensional Data: Random Forest can handle high-dimensional data with many features.

  4. Automatic Feature Selection: It assesses feature importance during training, making it easier to identify relevant features.

Drawbacks:

  1. Reduced Interpretability: Random Forest models are less interpretable than single decision trees, making it harder to extract insights.

  2. Computationally Intensive: Training a Random Forest can be computationally intensive, particularly when dealing with many trees and features.

  3. Memory Usage: Random Forests can consume significant memory, especially when the dataset is large.

  4. Hyperparameter Tuning: Finding the optimal hyperparameters for a Random Forest can be time-consuming.

Are you curious about bias-variance characteristics and their trade-off? Come again, I am preparing a little post about these concepts.

In summary, decision trees offer transparency and simplicity but can suffer from overfitting. At the same time, Random Forest addresses this drawback by using ensembles, providing better predictive performance at the cost of reduced interpretability.

The choice between them depends on the problem, the need for interpretability, and the trade-offs between model complexity and accuracy.

Conclusion

In summary, a decision tree is a graphical representation of a decision-making process that helps make predictions or classifications by recursively splitting data based on features and conditions.

Random Forest is an ensemble machine learning method that combines multiple decision trees to improve predictive accuracy and reduce overfitting. It is often considered superior to individual decision trees because it leverages the collective wisdom of multiple trees, resulting in improved generalisation and robustness, making it more resilient to overfitting and providing higher predictive accuracy.

By comparing the accuracy of decision trees and Random Forests, we gained insights into the trade-offs between interpretability and predictive power. Decision trees, with their transparent decision paths, provide a valuable window into the model’s reasoning. In contrast, Random Forest, as an ensemble of trees, emerged as a robust and high-performing alternative.

We have also explored the hyperparameter optimisation to find both model parameters for improved performance. The Bayesian Optimisation algorithm performed very well in the shortest time as measured with the ipython-autotime.

Next, we will go in-depth about the model complexity and overfitting problem.

Did you like this post? Please let me know if you have any comments or suggestions.

Posts about Machine Learning that might be interesting for you



References

1. All the code in the Colab notebook

2. Machine Learning Process

3. ipython-autotime

4. The Titanic dataset from Kaggle

5. Python tutorials repository

6. Machine Learning Tests using Titanic dataset

7. New Chat (chatGPT by OpenAI)

desktop bg dark

About Elena

Elena, a PhD in Computer Science, simplifies AI concepts and helps you use machine learning.




Citation
Elena Daehnhardt. (2023) 'Decision Tree versus Random Forest, and Hyperparameter Optimisation', daehnhardt.com, 06 November 2023. Available at: https://daehnhardt.com/blog/2023/11/06/decision_trees_vs_random_forest_hyperparameters/
All Posts