Data Preprocessing

3 min readAug 27, 2021

In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.

Steps for data preprocessing

Step 1 : Import the libraries
Step 2 : Import the data-set
Step 3 : Check out the missing values
Step 4 : See the Categorical Values
Step 5: Feature Scaling

Import the libraries

Import the data-set

For this I imported the dataset from kaggle

Now we are import our data-set .

Now we print first five row of our data-set and We print the shape of our data-set . Shape of data-set means total number of rows and columns contains our data-set .

Now we have to describe our data-set for this we will get the information like mean , min , 25% , 50% , 75% , max values of each column.

Check out the missing values

Our data-set we don’t have any null values but in this tutorial we are talk about how to handle null values or different techniques to handle null values .

See the Categorical Values

Machine learning models are based on Mathematical equations and you can intuitively understand that it would cause some problem if we can keep the Categorical data in the equations because we would only want numbers in the equations.

So , we have to convert our categorical values into numerical one . label_encoder is technique which is I use and help us in transferring Categorical data into Numerical data .

There is one more technique called OneHotEncoder . In this we creating a dummy variable . Dummy Variables is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.

Firstly , apply dummy variable on name column .

Feature Scaling

Feature scaling is the method to limit the range of variables so that they can be compared on common grounds.

In this tutorial we will learn standard scaler and Min-Max scaler .

Standard Scaler :

The standard scalar assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.

The mean and standard deviation are calculated for the feature and then the feature is scaled based on:

Data Preprocessing

Written by Nihar javiya

No responses yet