basics of data preparation using keras

Table of Contents

  1. Keras built-in datasets
  2. data set from our external source
  3. Video Tutorial

1. Keras built-in datasets:

  1. Data sets are very important for deep learning models when you are new to Keras and deep learning certainly you do not want your models to be dysfunctional only because of the data 
  2. That’s why Keras deep learning framework contains a good number of standard data sets.
  3. Keras is offering four data sets for image classification and they are CIFAR10 and 100, MNIST, and fashion MNIST.
  4. Two data sets IMDB movie reviews and Reuters news for text classification and sentiment analysis and finally Boston housing price data set for regression analysis.

a) CIFAR10 and CIFAR100:

CIFAR10 is used for small image classification it contains a total of 60,000 RGB images across 10 labels so around 6,000 per class these images are 32 by 32 pixels.

  1. They are split into a training set of 50,000 images and a test set of 10,000 images the suggested categories are airplane, automobile, cat, deer dog, frog, horse, ship 
  2. when we run this very first time the data set will be stored in our local storage which is.Keras/data
  3. Then load_data function will directly store the data in train and test set we can check the shape of the data just using that shape.
  1. Similarly, the CIFAR100 data set has seen 60,000 samples but this time 100 non-overlapping classes instead of six thousand samples per class and contain six hundred for the rest the structure is similar to the CIFAR10.
  2. X_train of CIFAR100 contains fifty thousand points and X_test contains 10000 points.

b) MNIST data set :

  1. it contains 60,000 training and 10,000 test images of handwritten digits which all are 28 by 28 pixels we can easily import this from Keras data set API this load data will split the data set and assign those to the training and test. 
  2. This dataset is used as a benchmark data set in many studies and in fact, it is overused.

c) Fashion MNIST:

It is a drop-in replacement for MNIST it also contains 60,000 training and 10,000 test images but the samples are different in nature. They represent the classes like t-shirt coat, bag dress etc 


d) IMDB movie reviews dataset:

  1. It is used for sentiment classification this dataset contains 25,000 movie reviews from IMDB labeled by sentiment.
  2.  we can import IMDB easily from Keras data set API so similar to the previous examples this load data will assign the data into train and test sets.
  3. This time we may use some arguments to properly use the data set first one is the path if we do not have the data locally we can provide the path to which IMDB will be downloaded.
  1.  The num_words consider the top most frequent words anything beyond this value will be encoded as an out of value.
  2. The maximum length of the sequence id max length and the random seed value for reproducible data
  3. Arguments will help us to do this job easily.

e) Reuters:

It is another data set for text classification it is pre-processed in the same way as of IMDB data set and can be used for classified texts into one of the 46 topics the attributes discussed in the IMDB are also available in this data set so it contains 8982 samples in the training set and 2246samples in our test.


f) Boston housing price data set

  1. This is our final data set which is used for regression analysis 
  2. This data set contains 506 observations of 13 variables this code snippet is similar to the previous except we have concatenated the Y train and Y test as Y.
  3. So finally we have four hundred four observations in our training set and one zero two in our test set for this specific case the minimum house price is five thousand and the maximum is fifty thousand.

2 Data set from our external source

  1. Now we are going to import the external data and process this and make this available for deep analysis.
  2.  First, we will read the data file which is available in our working directory then we’ll convert this to the arrays for our machine to processing
  3. In the next step, we’ve split our dataset into input features and the target that will scale the input features so that they have similar orders of magnitude.
  4. Finally, we’ll split the data set into training validation and test sets
  1. Here we are going to import the data set from our external source this data set house price data.csv is already available in our working directory. 
  2. Now using this pandas we going to import this data set into a variable
  1. Our house price data set contains 1460 rows with 11 columns. Then this data frame converts this into our arrays so we will just take the values from our data frame and store this in a new variable data
  2. we scale the input features so that they have similar orders of magnitude in this specific section we will import the pre-processing from SK learn so let’s run it so this is the scaled version of our input features.
  3. Then we will split our data into train and test and validation set and the ratio will be 0.3 so point seven will be the training set and point three will be the total validation and test set. 
  4. we will split that validation and test set into two separate validation and test set and this time the ratio is 0.5 so 50% of data from that test and validation set will be in the test set and 50% will be in the validation set.

So finally in the training set, we have 1022 observations of ten variables validation set of X we have 219 observations of 10 variables and X test set we have the same 219 observations.

Video Tutorial

Leave a Reply

Your email address will not be published. Required fields are marked *