Data Preprocessing in Machine Learning



Data Preprocessing is a technique that is used to convert the raw data into a clean data set.
In machine learning, data preprocessing is the vital and fundamental step to structure the data in a way, so that the model fits perfectly to get accurate results. 

The data, whether it is big data or small data, may be incomplete, unstructured or it has missing attributes. Sometimes the data may be in a different format. We have to process the data in a specific format such that any machine learning and Deep Learning algorithm can be implemented.

Data preprocessing divided into four stages: 

  • Data cleaning
  • Data integration
  • Data reduction
  • Data transformation


 Data cleaning refers to techniques to clean data by removing outliers, smoothing noisy data, replacing missing values, and correcting inconsistent data.

1) Missing Data: Datasets will have missing values, which happens during data collection or data validation.

Some of the common reasons can be 

  • While transferring the database, data can be lost.
  • Some field was not filled by the user.
  • There could be errors in processing the dataset.

Some approaches to deal with the missing data are 

  • Eliminating rows with missing data.
  • Filling approximate values in the missing places.
  • Using a standard value to replace the missing value.
  • Using algorithms like regression and Decision tree the missing values can be predicted and replaced.

2) Duplicate Data: These are data points that are repetitive in the dataset that do not contribute to any new information. 

Duplicate data mostly arise during data collection in scenarios when 

  • The user combines data set from multiple sources.
  • The user scrapes data from the web.
  • The user receives data from other clients.

 3)Inconsistent data: Sometimes, the data may be incorrectly placed in the dataset. It is advised to perform data assessments, whether the features of the data are the same for all the data objects.

4) Outliers in the Data: These are values that are critically off from other observations and can result in poor performance of various models. Outliers can also occur when comparing relationships between two sets of data.

Outliers can come up in the data due to reasons such as

  • Data corruption 
  • Input error when data entered manually 
  • Faulty measurements 

To deal with these anomalous values, data smoothing techniques like binning, regression, Outlier analysis, Specifying absolute bounds on data are done.


Since data is collected from multiple sources, data integration has become a vital part of the process. This may lead to redundant and inconsistent data, which could result in poor accuracy.

To deal with these issues and maintain the data integrity, approaches such as tuple duplication detection and data conflict detection can be used.


 Mostly datasets have a large number of features. As dimensionality increases, the number plane occupied by the data increases which is difficult to model and visualize.

 A few major benefits of dimensionality reduction are :

  •  Data Analysis algorithms work better if the dimensionality of the dataset is lower. 
  •  The models which are built on top of lower-dimensional data are more understandable and explainable.
  •  The data may now also get easier to visualize.

A few methods to reduce the volume of data are:

  • Principal component analysis (PCA) a statistical method that reduces the numbers of attributes by lumping highly correlated attributes together.
  • Single value decomposition (SVD) a factorization of a real or complex matrix.

 The final step of data preprocessing is transforming the data into an appropriate form for Data Modeling. The data you have available may not be in the right format or may require transformations to make it more useful. For achieving better results from the applied model in Machine Learning projects, the format of the data has to be in a proper manner.

Data Transformation activities and techniques include:

  •  Categorical encoding.
  •  Scaling.
  • Normalization.

Implementing Data Preprocessing in Machine Learning

  • 1. Getting the dataset
  • 2. Import libraries
  • 3. Import the dataset
  • 4. Treating missing values
  • 5. Feature scaling
  •  6.Dimensionality Reduction
  • 7. Splitting the dataset.
  • Getting the datasets- Datasets are the major component for machine learning. Datasets can be collected manually or we can get huge datasets from some websites like Kaggle. 
  • Import libraries- Python is the most preferred language for machine learning because of its wide range of libraries. We can import the required libraries required for the data processing like pandas, NumPy, sci-kit learn,etc.,
  • Import the dataset – The Dataset can be imported into the code you can import the dataset using the “read_csv()” function of the Pandas library.
  • Treating missing values– To overcome the missing values, we can perform operations like deleting a row containing missing values, replacing with some values like the median of other values.
  • Feature scaling- It is a method to standardize the independent variables of a dataset within a specific range.
  • Splitting the dataset- The dataset for the Machine Learning model has Split into two separate sets – training set and test set. The training set denotes the subset of a dataset that has been used for training the machine learning model. A test set is the subset of the dataset that is used for testing the machine learning model.


Leave a Comment