Table of Contents
- sequence pre-processing
- Text Preprocessing
- Image Preprocessing
- Video Tutorial
1 Sequence pre-processing:
Sequence pre-processing is a very basic type of pre-processing in the case of variable-length sequence prediction problems this requires that our data be transformed such that each sequence has the same length these are all the techniques we used to prepare our variable in sequence data for sequence prediction problem in Keras
a) Pad sequence function:
- The pad sequence function in the Keras library can be used to pad variable-length sequences.
- we generally use the default padding value 0 on the left or right side of a sequence.
Lets run this in Jupyter Notebook:
- we have sequences of data our target is to create a set with the same length of sequences first we use the option pad sequence which is available in Keras pre-processing API.
- This sequence function will take these variable-length sequences as input in this example the highest length of sequences 6 so for each sequence the length will be converted to the same length padded by 0.
- The 0 we’ll be placed before the numbers as we said padding as pre if the max length is less than the highest length of the sequence
- It will also truncate the numbers from left as we say truncating as pre let’s execute it Keras has generated an array of sequences with the same length of six headed by zero.
b) skip grams
Skip grams tries to predict the source context word based on a target word this skip-gram function generates skip Gram word pairs.
- we are using the padded sequence as input vocabulary size is 6 and window sizes 1 which tells how far forward and back we can look we can specify the ratio of negative samples as well let’s execute it down here we have a stream of 0 & 1.
- 1 means it actually occurred in the same window and 0 represents the negatives samples which actually take random words from the vocabulary.
It generates word rank based probabilistic sampling table for skip grams
d) Time series generator :
It is used for generating batches for temporal data now.
- LSTM models, those can be used for time series forecasting problems at the beginning of this type of analysis we need to pre-process our data it returns a sequence of instance.
- we have an input and the target data of the same length we need to specify this input data and target in the function of time series generator this length too will take two input data and assign this to a specific time step let’s execute this as we have chosen length two so it will take first two inputs and by size one it means it takes one batch and assigned to a time step 17 in the second case for two and three the time step will be 18.
- This pre-processing is very important for time series analysis so we have finished the sequence pre-processing.
2. Text pre-processing
- As we cannot use raw takes directly into the model’s text data must be encoded as numbers to be used as input or output for the model’s
- Keras Deep learning library provides some basic functions to help us to prepare our text data.
a) Text to word sequence:
- Text to word split the text into a list of words these words are called tokens and the process of splitting texts into tokens is called tokenization.
- By default, this function automatically does three things split the words by space filter out punctuation, and convert text to lowercase.
- We have an input text ‘’Kara’s is a high-level neural networks API KerasKeras”
We have ten tokens in this text this takes two-word sequence function filters out punctuations
b) one hot function:
We can use this to tokenize an integer encode a text document in one step this function is a wrapper for the hashing trick function using the hash as the hashing function as with the text of our sequence function in the previous section the one that function will make the text to lowercase filter out punctuation and state bursts based on whitespace in addition to these the vocabulary size must be specified which defines the hashing space from that words are hashed which defines the hashing space from that words are hashed ideally this should be longer than the vocabulary by some percentages.
- provides this sophisticated API for preparing text that can be freed and reused to prepare multiple text documents this may be the preferred approach for large projects first we fit our word corpus.
- For tokenizer, this word contains stem tokens. The tokenizer provides for attributes that can be used to describe the documents.
- The first one is word counts it returns the token with the number of times it occurs Keras was three times so it is three.
- The next one is document_underscore count which is 10
- Word_index will display the words and they’re uniquely assigned integers
- Word_Docs will display the words and how many documents appeared
- Texts to the sequence will return a sequence of numbers each word will be assigned to a unique number.
- Text to matrix will construct a matrix the number of rows will be equal to the number of tokens and in this example, it is 10 so there will be a total of ten rows and the words start at number one and extra space is added for zero or unknown.
3 Image pre-processing
Here going to pre-process our image data using an image data generator which is available in Keras dot pre-processing dot image API.
- we will use our MNSIT data set in this example first we load the data set and display a set of images then we reshape our training and test data so that we will be able to fit those for image data generator as the default data type is float32 so we are just converting at to float32.
- Now we generate data and fit our training data on it this image data generator contains different types of arguments these are all the arguments available in our image data generator
- Here we have used these two arguments horizontal_flip and vertical_flip and set true to them so it will flip the image horizontally and vertically finally this code snippet is displaying the flipped image.
- If we compare this to our original image this is our original for so fast it flipped horizontally then vertically so this is our final flipped image.