Natural language processing Bag of words Scikit learn tutorial

Table of Contents

  1. Bag of words without Stopwords
  2. Bag of words withStopwords
  3. TFIDF
  4. Hash
  5. Video Tutorial ( 2 videos )

1 Bag of words

  1. Bag of words is an important concept of natural language processing generally used to transform the text into vectors. 
  2. The bag of words of text simply describes the occurrence of the word in the corpus.
  3.  Corpus is nothing more than a simple textual data or a collection of your document words 
  4. The list that we have taken of three sentences and we need to convert them into our vector followed by we’ll be using the stop words of the sklearn.
Figure-1
  1. when we execute each sentence converted to the vector and each word becomes its count in the corpus.
  2. We can do this using a count vectorizer from sklearn and pass corpus to it transform that to the vectors.
  3. we can see that each word that is present in the corpus is being given a specific value in a vector all right those are the unique values.
  4.  This simply means that these are the bag of words or the sentences from which it has selected all of the words which will be computing and making them feed the model.
  5. when we do this thing we sometimes even get some of the words which are not holding much information.
  6.  Those words should be removed for example ‘in’ given a vector value and ‘the’ is also given a vector value.
  7.  Now we need to use the stop words to decrease the size of the vocabulary.

2 Bag of words with Stopwords

  1. Now we have taken the same corpus but this time we’ll be putting English as the stop words.
  2. we’ll be removing the extra or the unnecessary words which are present inside our corpus.
Figure-2
  1. We can see in this one we have done it and what happened is before initializing or getting anything to one we called the stop words getting removed and then have only received the one which are necessary.
  2. Now all information value we have received is the removal of the stop words but as you can see the one reducing to that we also now have much lesser values in our bag of word .
  3. It’s a very important that we need to reduce our bag of words by considering the important information present our corpus that will drastically improve the accuracy of model. 

3 TFIDF

  1. In this section, we will discuss the TFIDF BOW in Natural Language Processing. TFIDF means Term Frequency –Inverse Document Frequency.
  2. It gives us a vector by which we can use for our corpus conversion from textual data to the vector data. We do this conversion using tf-idf weights that give us the vector using the word frequency of the given data. Now if a word occurs more in each review importance is given more to a word but when if a word has occurred many times in corpus less importance is given to that word. It is not like giving random numbers or the distribution to the corpus in your data, it gives the vector value according to the values of its frequencies.
Figure-3

So, we import the library called tfidfvectorizer from sklearn module. We simply fit our corpus into the vectorizer.

Hashing

There is one more step one can use in machine learning called Hashing. Hashing is one of the small tricks that one can apply which saves a lot amount of memory and data. In this, when you convert a vector first the value gets converted into a dictionary format and after that, it gets converted into a vector. This saves a lot size of the corpus. Hashing vector is something that actually converts directly your row to corpus into the vector. So allowing the two-step process to save a lot of memory. So, we import the hashing vectorizer from sklearn_feature_extraction_text.

Figure-4

In Hashing vectorizer we pass the number of features that we want to produce, here we give ‘6’. Now when we run this we get the binary values as our hashing.

Video Tutorial-1

Video Tutorial-2

Leave a Reply

Your email address will not be published. Required fields are marked *