Elena' s AI Blog

Feature preprocessing

29 Jan 2022 (updated: 03 Jul 2026) / 17 minutes to read

Elena Daehnhardt


Midjourney AI-generated art


TL;DR:
  • Preprocess ML data: one-hot encode categoricals, scale numericals. Use Pandas for encoding, Keras preprocessing for scaling. Mixed data types need different handling—do this before training.

Previous: Part 4 — TensorFlow on M1

Next: Part 6 — Tensors in TensorFlow

Feature Preprocessing for Machine Learning: One-Hot Encoding and Scaling

Feature preprocessing is the set of transformations that convert raw inputs into a form a Machine Learning algorithm can consume, applied before model training. When a dataset mixes feature types, we must prepare the data before feeding it into a Machine Learning algorithm. This happens when inputs (also called features or covariates) include categories such as gender or geographic region alongside features on different numerical scales, for instance a person’s weight or height.

A Machine Learning algorithm typically requires data in a specific type — often numerical only. ML algorithms also perform better or converge faster when data is preprocessed before training. Because the step happens before model training, we call it preprocessing. This article focuses on two main feature-preprocessing methods: feature scaling (normalisation) and feature standardisation.

Data Exploration with Pandas: info(), describe(), groupby()

To decide what we do with the data and apply Machine Learning to it, we need to analyse the dataset. We want to determine what features we have, whether they are helpful for our ML goals, how clean the dataset is, the presence of missing or noisy data. Quite often, we need also to perform data cleaning or wrangling.

It is pretty useful to visualise features, create tables, remove irrelevant features, or change the input values in different data types. To start playing with data, we first download our dataset provided by Kaggle, we use Pandas Python library for downloading data directly from GitHub:

# Importing libraries further used in code
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
insurance.head(10)

The table shows the first ten rows of the insurance charges dataset we just downloaded. Our main goal is to predict medical insurance charges having a person’s age, sex, BMI, number of children, geographic region, and smoking status.

Insurance table

We can use Pandas functions such as info() and describe() to explore our features more in-depth. The info() function prints out a summary of the data with column names data types and finds any missing values.

insurance.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Should we require to change any data type, we use astype() function. For instance, we want to change the age and number of children from int64 (default type assigned when we downloaded the dataset) into the int8, which will need less memory size for storage. If you are interested in playing with different data types, read the “Overview of Pandas Data Types” by Chris Moffitt.

insurance['age']= insurance['age'].astype('int8')
insurance['children']= insurance['children'].astype('int8')

When we run the info() function once again, we can see that we saved already 18KB ad=fter changing to int8.

insurance.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int8   
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int8   
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int8(2), object(3)
memory usage: 55.0+ KB

The describe function is helpful for numerical features to get their statistical characteristics such as counts, mean, standard deviation.

insurance.info()
index age bmi children charges
count 1338.0 1338.0 1338.0 1338.0
mean 39.20702541106129 30.663396860986538 1.0949177877429 13270.422265141257
std 14.049960379216172 6.098186911679017 1.2054927397819095 12110.011236693994
min 18.0 15.96 0.0 1121.8739
25% 27.0 26.29625 0.0 4740.28715
50% 39.0 30.4 1.0 9382.033
75% 51.0 34.69375 2.0 16639.912515
max 64.0 53.13 5.0 63770.42801

We can do further data analysis with a grouping function to observe that in average, there are larger mdeical insurance charges for smokers.

insurance.groupby("smoker")["charges"].mean()
smoker
no      8434.268298
yes    32050.231832
Name: charges, dtype: float64

We can also draw plots with Pandas. For instance, we can plot a histogram of the “bmi” feature.

insurance["bmi"].plot(kind="hist")

BMI Frequency Plot

Since the data exploration is not our main topic, you can read more about it at “Data Exploration 101 with Pandas” by Günter Röhrich. On possible tools to automate the data exploration, read the article by Abdishakur “4 Tools to Speed Up Exploratory Data Analysis (EDA) in Python”.

Data Preprocessing: Normalisation vs. Standardisation

To prepare our data for ML, we transform it into a more machine-readable form. For instance, we can convert string categories into numerical features and rescale numerical features with normalisation or standardisation.

Normalisation rescales data to a common range (0 to 1); scikit-learn implements it as MinMaxScaler. Standardisation removes the mean and divides each value by the standard deviation; scikit-learn implements it as StandardScaler. Both methods can improve performance or speed up convergence. The contrast between them:

Method scikit-learn class Output When to prefer
Normalisation MinMaxScaler Values in [0, 1] Neural networks; bounded inputs
Standardisation StandardScaler Zero mean, unit variance Linear/logistic regression, nearest neighbours, features with outliers

Linear and logistic regression, nearest neighbours, and neural networks all benefit from feature scaling. Neural networks often work better with normalisation, but it is worth experimenting with both methods to see which gives better training speed or model performance (accuracy or error) for your data.

Transforming Features with MinMaxScaler and OneHotEncoder

Let’s transform features with MinMaxScaler and OneHotEncoder (scikit-learn) for our insurance charges dataset. Both methods are combined in a single preprocessing step with make_column_transformer(). We fit the column transformer on the training data only, then apply it to the testing data — this ordering prevents data leakage, where test-set statistics contaminate training. Read more in “How to Avoid Data Leakage When Performing Data Preparation” by Jason Brownlee.

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split

# Create column transformer for feature preprocessing
column_transformer = make_column_transformer(
    (MinMaxScaler(), ["age", "bmi", "children"]),
    (OneHotEncoder(handle_unknown="ignore"), ["sex", "smoker", "region"])
)

# Create X and y
X = insurance.drop("charges", axis=1)
y = insurance["charges"]

# Build our train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    random_state=57)

# Fit the column transformer to our training data
column_transformer.fit(X_train)

# Transform training and test data with normalisation (MinMaxScaler) 
# and OneHotEncoder
X_train_normalised = column_transformer.transform(X_train)
X_test_normalised = column_transformer.transform(X_test)

# Normalised features example
X_train_normalised[0] 
array([0.02173913, 0.34217191, 0.        , 1.        , 0.        ,
       0.        , 1.        , 0.        , 0.        , 0.        ,
       1.        ])

Creating and Evaluating a Keras Neural Network on Preprocessed Data

We build a neural network to fit the normalised, preprocessed data. We use a Keras Sequential model with two hidden ReLU layers and a single linear output neuron for regression. Evaluating the model on the test set yields MAE=1697.4192.

# Build a neural network to fit the normalised data
model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(100, activation="relu"),
        tf.keras.layers.Dense(10, activation="relu"),
        tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.mae,
                 optimizer=tf.keras.optimizers.Adam(lr=0.1),
                 metrics=["mae"])

history = model.fit(X_train_normalised, y_train, epochs=100, verbose=0)

model.evaluate(X_test_normalised, y_test)
/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/adam.py:105: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  super(Adam, self).__init__(name, **kwargs)
9/9 [==============================] - 0s 2ms/step - loss: 1697.4192 - mae: 1697.4192
[1697.419189453125, 1697.419189453125]

Fixing “UserWarning: The lr argument is deprecated, use learning_rate instead”

This warning appears when compiling a Keras optimiser with the old lr= keyword. Cause: TensorFlow 2.3+ renamed the optimiser argument from lr to learning_rate. Fix it by renaming the keyword:

# Deprecated:
optimizer=tf.keras.optimizers.Adam(lr=0.1)

# Current:
optimizer=tf.keras.optimizers.Adam(learning_rate=0.1)

Conclusion: A One-Step Column Transformer Before Model Training

In this post, we downloaded the insurance charges dataset from GitHub and preprocessed it with the Pandas and scikit-learn libraries. scikit-learn let us combine one-hot encoding and feature scaling in a single column-transforming step performed before building the ML model. We then trained and evaluated a Keras neural network on the prepared data. A make_column_transformer pipeline fitted on training data only is the standard scikit-learn pattern for preprocessing mixed numerical and categorical features without data leakage.

Feature Preprocessing FAQ

What is the difference between normalisation and standardisation?

Normalisation rescales features to a fixed range (typically 0 to 1) using scikit-learn’s MinMaxScaler. Standardisation removes the mean and divides by the standard deviation using StandardScaler, producing zero-mean, unit-variance features. Neural networks often converge faster with normalisation, but test both for your data.

How do you one-hot encode categorical features in scikit-learn?

Use OneHotEncoder from sklearn.preprocessing, ideally inside a make_column_transformer that applies it only to categorical columns (here sex, smoker, region) while a MinMaxScaler handles numerical columns. Set handle_unknown="ignore" so categories unseen during fitting do not raise an error at transform time.

How do you prevent data leakage when scaling features?

Fit the MinMaxScaler / OneHotEncoder (or the make_column_transformer) on the training set only, then apply the fitted transformer to both train and test sets. Fitting on the full dataset before splitting leaks test-set statistics into training and inflates measured performance.

I am affiliated with and recommend the following fantastic books for learning Python and data wrangling and analysis skills.

Python for Data Analysis. Data Wrangling with Pandas, Numpy, and Jupyter

Get the ultimate guide for manipulating and analyzing datasets in Python, updated for Python 3.10 and pandas 1.4. This hands-on resource, written by pandas creator Wes McKinney, includes practical case studies to tackle a variety of data analysis problems. It's perfect for analysts new to Python and Python programmers venturing into data science. Data files and additional resources are available on GitHub.

  • Author - Wes McKinney
  • Paperback – Big Book
  • Publication date - 20 Sep. 2022
  • Number of pages - 579
  • Language - English
  • Publisher - O'Reilly Media
  • ISBN-13 - 978-1098104030
Python for Data Analysis. Data Wrangling with Pandas, Numpy, and Jupyter

Python Data Analysis - Third Edition. Perform data collection, data processing, wrangling, visualization, and model building using Python

This book will help you learn how to use Python for data analysis. You’ll explore the steps and methods people use to analyze data and discover how to use modern Python tools to create efficient ways to work with data.

  • Authors - Avinash Navlani, Armando Fandango, Ivan Idris
  • Paperback
  • Publication date - 5 Feb. 2021
  • Number of pages - 478
  • Language - English
  • Publisher - Packt Publishing
  • ISBN-13 - 978-1789955248
Python Data Analysis - Third Edition. Perform data collection, data processing, wrangling, visualization, and model building using Python

Did you like this post? Please let me know if you have any comments or suggestions.

Python posts that might be interesting for you



Bibliography

I am thankful to the TensorFlow Developer Certificate in 2022: Zero to Mastery and following authors for the information used in preparing this post.

  1. “Overview of Pandas Data Types” by Chris Moffitt.
  2. “Data Exploration 101 with Pandas” by Günter Röhrich.
  3. “4 Tools to Speed Up Exploratory Data Analysis (EDA) in Python” by Abdishakur.
  4. “How to Avoid Data Leakage When Performing Data Preparation” by Jason Brownlee.
desktop bg dark

About Elena

Elena, a PhD in Computer Science, simplifies AI concepts and helps you use machine learning.

Citation
Elena Daehnhardt. (2022) 'Feature preprocessing', daehnhardt.com, 29 January 2022. Available at: https://daehnhardt.com/blog/2022/01/29/tensorflow-python-pandas-keras-one-hot-encoding-feature-preprocessing-kaggle-dataset/
All Posts