What Is Data Preprocessing?
Data preprocessing is the task of cleaning and transforming raw data to make it suitable for analysis and modeling. Preprocessing steps include data cleaning, data normalization, and data transformation. The goal of data preprocessing is to improve both the accuracy and efficiency of downstream analysis and modeling.
Raw data often includes missing values and outliers, which can lead to erroneous conclusions during analysis. You can use MATLAB® to apply data preprocessing techniques such as filling missing data, removing outliers, and smoothing, enabling you to visualize attributes such as magnitude, frequency, and nature of periodicity.


smoothdata function. (See MATLAB code.)Data Preprocessing Techniques
Data preprocessing techniques can be grouped into three main categories: data cleaning, data transformation, and structural operations. These steps can happen in any order and iteratively.
Data Cleaning
Data cleaning is the process of addressing anomalies in the data set using techniques such as:

Time-series plot of a solar irradiance raw data set, including missing values.

Solar irradiance data preprocessed with the fillmissing function to fill in missing values. (See MATLAB code.)
Data Transformation
Data transformation is the process of modifying a data set into a preferred format by using operations such as:

Raw data, its trend, and its preprocessed version with trend bias eliminated using the detrend function. (See MATLAB code.)
Structural Operations
Structural operations are often used for combining, reorganizing, and categorizing data sets and include:
Data preprocessing steps can be different depending on the type of data. Here are three examples of different data preprocessing methods, available for various data types.
| Time-Series Data | Tabular Data | Image Data | ||
| You can perform a variety of data preprocessing tasks, such as removing missing values, filtering, smoothing, and synchronizing timestamped data with different time steps. | When a table has messy data, you can use different data preprocessing techniques to clean the table by filling in or removing missing values and rearranging table rows and variables in a different order. | Data preprocessing is useful for applications involving images, including AI. You can preprocess your data by resizing or cropping the images, or even by increasing the amount of training data for deep learning models. | ||
|
|
|
|||
| Preprocess and Explore Time-Stamped Data | Clean Messy and Missing Data in Tables | Preprocess Images for Deep Learning |
Best Practices in Data Preprocessing
Data preprocessing is not a one-size-fits-all approach. It varies based on the characteristics of the data, the machine learning algorithm, and the problem to be solved. Best practices can help when selecting data preprocessing techniques:
Data Preprocessing in Machine Learning Workflows
Data preprocessing is a crucial step in the machine learning pipeline, ensuring that the data set is clean, relevant, and ready for modeling. Properly preprocessed data can significantly improve the performance of machine learning models by providing them with accurate, relevant, and standardized input.
Once you have preprocessed your data in general, you may need to take a few more steps before creating and training a machine learning model. Feature engineering, which follows data preprocessing, is an iterative process of turning raw data into features to be used by machine learning. It encompasses:
Various data preprocessing techniques are tailored for different types of machine learning algorithms. These techniques are foundational to preparing data for machine learning models, aiming to improve model accuracy, efficiency, and generalizability across different types of algorithms and use cases.
| Preprocessing Technique | Purpose | Applicable to Machine Learning Algorithms |
|
Handle missing data, remove outliers, and correct errors |
All types |
|
|
Data Standardization and Normalization |
Scale features to ensure uniformity and improve model performance |
All types, especially support vector machines (SVMs) and neural networks |
|
Convert categorical variables for use in algorithms |
Neural networks, decision trees, forests |
|
|
Adjust the scale of features for distance computation and convergence |
SVMs, neural networks, k-nearest neighbor (KNN) |
|
|
Reduce model complexity, improve interpretability, and model fit |
Decision trees, forests, regression models |
|
|
Focus on the most informative aspects by reducing variables |
Clustering, PCA |
Data Preprocessing with MATLAB
Choosing the right data preprocessing approach is not always obvious. MATLAB provides both interactive capabilities (apps and Live Editor tasks) and high-level functions that make it easy to try different methods and determine which is right for your data. Iterating through different configurations and selecting the optimal settings will help you prepare your data for further analysis.
Interactive Capabilities
The Data Cleaner app enables you to preprocess time-series data without writing code. You can import your data and then clean it, fill in missing data, and remove outliers. You can then save your modified data to the MATLAB workspace for further analysis. You can also automatically generate MATLAB code to document your steps and reproduce them later.
Live Editor tasks are simple point-and-click interfaces that you can add directly to your script to perform a specific set of operations. These tasks can be configured interactively to iterate through different settings and identify the optimal configuration for your application. As with the Data Cleaner app, you can also automatically generate MATLAB code to reproduce your work.
You can interactively preprocess data using a sequence of Live Editor tasks such as Clean Missing Data, Clean Outlier Data, and Normalize Data by visualizing the data at each step.

Data Preprocessing toolbar in MATLAB with a collection of live tasks.
Clean Outlier Data Live Editor task detecting outliers using median thresholding and filling them using linear interpolation. (See MATLAB code.)
Using MATLAB Functions
MATLAB provides thousands of high-level, built-in functions for common mathematical, scientific, and engineering calculations, including data preprocessing.
You can start exploring your raw data set by visualizing it in MATLAB. For example, a data set of solar irradiance received on a typical day includes missing values and outliers. Harsh weather conditions could interfere with wireless telemetry transmission, resulting in a raw data set with imperfections.
Five common data preprocessing techniques can be applied to this raw solar irradiance data set using MATLAB.
| Data Preprocessing Technique | MATLAB Plot |
|---|---|
| Addressing Outliers
Anomalies in the telemetry data show up as outliers. The outliers are removed using |
|
| Filling Missing Data
Loss of communication results in missing data in telemetry. Use |
|
| Smoothing Data
Noisy solar irradiance data is removed using |
|
| Normalize Data
Using the |
|
| Grouping
Use |
Data can be messy, but data preprocessing techniques can help improve data quality and prepare your data for further analysis. See the resources below for more information.