Data Preprocessing
In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.
Steps for data preprocessing
- Step 1 : Import the libraries
- Step 2 : Import the data-set
- Step 3 : Check out the missing values
- Step 4 : See the Categorical Values
- Step 5: Feature Scaling
Import the libraries
Import the data-set
For this I imported the dataset from kaggle
Now we are import our data-set .
Now we print first five row of our data-set and We print the shape of our data-set . Shape of data-set means total number of rows and columns contains our data-set .
Now we have to describe our data-set for this we will get the information like mean , min , 25% , 50% , 75% , max values of each column.
Check out the missing values
Our data-set we don’t have any null values but in this tutorial we are talk about how to handle null values or different techniques to handle null values .
See the Categorical Values
Machine learning models are based on Mathematical equations and you can intuitively understand that it would cause some problem if we can keep the Categorical data in the equations because we would only want numbers in the equations.
So , we have to convert our categorical values into numerical one . label_encoder is technique which is I use and help us in transferring Categorical data into Numerical data .
There is one more technique called OneHotEncoder . In this we creating a dummy variable . Dummy Variables is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.
Firstly , apply dummy variable on name column .
Feature Scaling
Feature scaling is the method to limit the range of variables so that they can be compared on common grounds.
In this tutorial we will learn standard scaler and Min-Max scaler .
Standard Scaler :
The standard scalar assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.
The mean and standard deviation are calculated for the feature and then the feature is scaled based on: