Data Preprocessing in sklearn

Table of Contents

  1. What is pre-processing
  2. Scaling
  3. Normalization
  4. categorical encoding
  5. One hot encoding
  6. handling missing values
  7. Video Tutorial ( 2 Videos )

1 What is pre-processing?

  1. A fact in a data scientist job is that often spends 80% of his time in data pre-processing than in modelingit is one of the very major important steps in the data science life cycle. The data that we get from a real-life scenario has a lot of noises like features may have different scaling, anomalies, categorical values, NAN values. So it’s very important to handle all the cases and we have to make sure data is clean and ready to feed the model.
  2. Pre-processing can improve the performance of the model effectively. So model totally depends on how good the pre-processing steps have been taken to parse and clean the data
  3. scikit-learn makes our processes easy we’ll just import that one specific class from scikit-learn as the pre-processing.

2 Scaling the value

  1.  If a feature1 from the dataset having in a range of 0 to 1 and feature2 in the range of 0 to thousand now by the time you feed that data to the model it creates a change in the performance. 
  2. So we’ll need to scale all features in data to a certain range .Let’s check how we can do that.

a. Standardization

One of the most used scaling technique in machine learning.After Standardization each feature is transformed such that mean is centered to zero and standard deviation will be one

Figure-1

Magical line in sklearn to implement Standardization 

sklearn.preprocesssing.scale(input)

Figure-2

b Min-Max scaling

After Min-Max scaling each feature is transformed such that all the values are ranged from 0 to 1.

Figure-3

Magical line in sklearn to implement minmaxscalar

scale=sklearn.preprocesssing.MinMaxscalar()

scale.fit_transform(input)

Figure-4

3 Normalization

  1. Converting a scale value to a unit vector is known as normalization
  2. Normalization is a very important concept since the way the data is fed into the model is usually in the form of an array or as a part of a vector now.

Magical line in sklearn to implement minmaxscalar:

sklearn.preprocesssing.normalize(input)

Figure-5

4 categorical encoding

  1. If the features in dataset have discrete values then they considered as categorical features . Example:(Months ,weeks ,countries)
  2. Similarly if features in dataset have continuous values then they considered as numerical features .Example:(height  , prices ………)
  3. Most of the time these categorical features in the form of strings 
  4.  machine learning model cannot take an input of a string some how we need to convert those as a numerical values for that we are using OrdinalEncoder from Sklearn.
Figure-6

Where Ordinal encoder will give one numerical values like 0,1,2… for string .In above example each feature has two distinct strings Ordinal encoder label male as 1 and female as 0 do same to other features . So using this fitted values we can transform new data.

5 One hot encoding

  1. when we tried to do ordinal encoding it is done in a specific category 0 to 1 or 0 to 1 or 0 1 2 3.
  2. With ordinal encoding, model can’t understand it as a categorical variable it assumes
  3. features as a numerical value that may affect the performance of the model. 
  4. To handle this scenario we need a special type of encoding called one-hot encoding. 
Figure-7
  1. We can perform One hot encoding in scikit-learn from feature extraction we have called the specific function of dict factorization.
  2. Consider a list of dictionaries which is containing 3 major cities that we have in our data set that is London, Paris, and New York that is done we need to convert them in a specific value such that it doesn’t get converted into a vector value
  3. we do the fit transform of the data that we ha
  4. ve and convert it into an array format what happens with the same is we have now three values London Paris now it gets converted as London representing 1 0–0 Paris representing 0 0 1 and new-york being represented as 0 1 0
  5. Now, what happened is one hot encoding returned us three features 2 through which a combination of them gives us the categorical value once that is being done the computer is able to now differentiate since it’s a categorical value instead of a vector value.
  6. the only drawback with one-hot encoding is that it increases the size of the data

6 Handling the missing values 

  1.  one of the major issues in the data present in a real-life scenario is handling the missing values 
  2. For example, if you take any of the consideration of any historical data or any of the IoT data there’s a specific chance of error or missing values if human interaction is being involved.
  3. The model cannot take an empty value so you have to replace the data with the minimum error-free and the maximum suitable value for that.
  4. While Handling the missing values one should be very careful about selecting the strategy in the missing value which can lead to different kinds of biases and errors. The main objective is to remove the unnecessary values in processing so that our model works perfectly.
Figure-8
  1. Let’s take an example of data frame df with nan ( not a number) values
  2. Null values in your data sets which cannot be directly fed to the computer for that we need to impute them 
  3. For that, we will call from scikit-learn. impute import simple imputed
  4. you need to give is the missing values which we are having now in this 
  5. Missing value always cannot be interpreted as nan in our specific case we have nan missing value can be represented as zero can be represented as a very high number, for example, nine nine-nine can be represented by an operator simple like plus-minus.

7 Video Tutorial

Video Tutorial – 1

Video Tutorial – 2

Leave a Reply

Your email address will not be published. Required fields are marked *