Table of Contents
- NLP with RNN using LSTM model
- Video Explanation
1. NLP with RNN using LSTM model
Here we are going to discuss about another natural language processing using recurrent Neural network. We will be using LSTM model which is Long Short Term Memory. So lets begin: Before processing want to inform you that it is a deep program, it will take take time run the program, so here we won’t be showing you the run time, but we can explain the code for you.
- Firstly to run the natural language processing, we are importing pandas, numpy, torch, torch.nn. Torch.nn.functional and torch.optim.
- One thing you have to remember is to do some natural language processing programming, you have to import nltk. Nltk is a great library, as it has all the modules and pre-loaded guidance for various languages especially in English.
- After importing nltk, we have to download treebank and universel_tagset. Treebank is basically a package that nltk provides that has all the lemma of all english language as in what should be plural of one singular noun, what should be the singular of the plural noun, the past tense , the present tense etc. So basically it is a guidelines of english language.
- Universal_tagset is the dataset, that provides us the sentences that has already been tagged. The problem we are trying to solve now is called part of speech tagging as in we want to build a model that can tell which word belongs to which parts of speech and by training the model on a set of corpus or the set of example sentences and words and we can further check the examples regarding the model.
- In this case, we are not learning character by character as we did, we are doing it by using word by word.
- Lets check how many sentences we have, as you can see in figure 1 we have 3914 tagged sentences.
The tagged sentences will look like as you can see in figure 2.
- To help you train with the program we are building a small model, small modules or functions.
- So the first module is word to index, so it converts all the words into tensors .
- Secondly it converts all the characters into tensors. Then it converts the tag to tensors as you can see in figure 3.
Then we are building a dictionary that contains that converts the word to index, character to index and tag to index as shown in figure 4.
To build those convertion of word to index, character to index and tag to index, we will get and you can see that in figure 5 how the result looks like
- After that we check how many unique word, tags and characters we have in the dictionary from that we can see the vocabulary size of the word, tag and char of the index.
- As you can see in figure 6, the unique_words are 12408, unique_tags are 12 which will be noun, pronoun, adjectives etc and unique_characters are 79.
- Here we are converting each set of word into a set of embedding. Embeddings are basically the tensors in this case so that our deep learning model can read those and train on the words, characters. So the word embedding dimension is 1024, the character embedding dimension is 128. Then the word hidden dimension is 1024 and character hidden dimension is 1024, and we are running 70 epochs.
- Then we are entering the main part of the program that is the dual LSTM tagger that generates the tags or parts of speech for our words, inside the dual LSTM taggers we have the constructor function, inside init function and the word embedding dimension has been provided.
- The char_embedding, char_lstm. Like as we were discussing before, we building a LSTM model and hidden to tag is basically the final output that gives us the final tag from the hidden layer as we have mentioned that one hidden layer is passed on to the next hidden layer in LSTM model and the final output comes from the last hidden layer and it is tagged as a it is interdicted by the probability.
- To continue with the process we are using a linear layer as we don’t need anything else and the hidden dimension is the dimension of the hidden layer in the previous LSTM model and the tag is the vocabulary size we have 12 unique tags and we are basically implementing the function that will provide the probabilities for each tag then we will take the tag with the highest probability.
- In the forward path we are generating the embedding of the sentences and we are training the models.
- For each word we are checking the character embeddings are we are putting in through the char LSTM.
- The layer mentioned in figure 7, is the layer which will give us the output for that LSTM module and output for the next.
- Then we are stacking up all the hidden layers and finally we are taking the output from the last hidden layer.
- Obviously the output will be found from the last hidden layer in the forward path and in the back-propagation we shall only get the derivatives of each LSTM layer.
- The tag score are being found by the softmax function or the logs of softmax function in this above mentioned case.
Now lets move forward with the output part in which we are going to build the model.
- Then we pass it to the GPU.
- Then the loss function is negative. Log likelihood function, it is a very good function as we have previously discussed. When you double click the Google colab gives you a a comments on negative log likelihood loss function it is useful to train classification problems with C classes.
- We can also use cross entropy function but NLL loss is the better option here.
- Our optimizer is ADAM.
- We are taking a sample sentence here that says “ everybody eating the food, I kept looking out the window trying to find the one I was waiting for”.
- Then we are checking before training the model that what will be the output of the following sentence.
- Then we are converting the sequence of sentences into tensors. First we convert each word, then characters, then we convert the whole sentence.
- Then we create a tag score from model.sentence, but want to make you sure that, this model is untrained as you can refer the figure 8.
- As it has some LSTM structure, it will generate some value, but it won’t be accurate.
- Then we will create the torch.max as you see the figure 8. It is basically the arc max function, as it tells you the tag with highest probability, if noun has the highest probability, it will show the output as noun.
- Then we start with training the model and then we again print and see what has actually changed.
- Here in the training method we have accuracy list, loss list, interval, number of epochs which is 70, then it shows epoch intervals, then we will create the same function to convert the word, character and tag to tensors as shown in figure 8.
- Then we implement model.zero_grad that basically initialises the gradients. After each back_propagation, we are generating a tag score. Then we are comparing the loss function which is negative log likelihood loss function.
- In loss.backward, we are calculating the back_propagation
- Optimizer.step updates the weights and biases after each epoch.
- Then we are calculating the cumulative loss.
- Then finally we generating the output for the parts of speech which is with the maximum probability and we are keeping track of all the accuracies and losses after each epoch.
- Run the program.
- Lets look into the result, as you might remember the sentence for which we have to generate the prediction for that following sentence which is “ everybody eating the food, I kept looking out the window trying to find the one I was waiting for”.
- As you can see in figure 9, it shows the result as everybody as pronoun, eats as num, the as num, food as conjunction etc.
After we train the model we can see in figure 10 that the trend here is how is it going to run more epochs or 40 or 50 epochs.
- As you can from the result, it is good depending on the datasets.
- One thing to remember that the accuracy should increase after each epoch, so if you are not sure about the trend of the accuracies or each epoch, then you should plot the data using. From that plot try to figure out whether your training model is going in the right direction or not.
- So now our model has been trained on a huge dataset.
- As you can check the sentences used above for the modelling, you few mistakes, but it much better, as it giving the 96 percent accuracy, of-course the accuracy can’t be 100 percent. But if you do more epochs, you will get more accuracy.
- So now we have the trained model of tagged sequence of the words and it will not only be used for generating tags and it also being used for word prediction.
- Before doing the further process, you should always know parts of speech goes after the present parts of speech and then try to figure out which word has highest probability in that tag stage.