word count using alexa data using python pandas

In this practical, we are going to analyze data that were obtained from Alexa. The data set contains the date, variation, and reviews, feedback given by the users for some products. We need to know what do people feel about the products by using the repeated words in the reviews.

To start the practical open Anaconda navigator and launch Jupyter lab. Open a new Jupyter notebook and import Pandas and the data set. Assign the data set into a variable. Here we have used a variable called alexa_data as shown in figure 1.

Figure 1: Importing the data set
  1. Next, we are going to convert the data/reviews in theverified_reviews column in to a list and creating a new column named desc_conver.

To do this first the data in verified_reviews column should need to be converted to lowercase. The str.lower() function can be used for this purpose Then to remove the spaces between the words str.strip() function is used. Every review has space.

 In order to pass a comma in to the spaces str.split() function can be used. It will apply a comma after each word. The full code snippet is shown below.

alexa_data[‘desc_conver’]= alexa_data.verified_reviews.str.lower().str.strip().str.split()

Execute the code as shown in figure 2.  Observe that new column values are shown as list and each word is separated by a comma.

Figure 2: Generating a new column

Now we can check whether what is the most repeated word.  A loop can be used for this purpose.

First, initialize the conv_list_reviews variable as an empty list. Then apply the for loop in the following way as shown in figure 3.

Figure 3: The for loop

For the data in alexa_data data set, according to the variation and desc_conver columns, it will iterate the rows one by one using iterrows() function. As an example in the first row, it will pass  the words love, then my, then echo. Then it will go to the second row, and so on. This is shown in figure 4.

Figure 4: The iteration

Once row_data is available, we put another for loop.

We are going to loop it again to get the exact values. In this loop, we are going to convert row_data into the desc_conver column and then pass these values to the  final_word. At last, append the data to the conv_list_reviewslist, by passing row_data.variation (from the first loop) and final_word (from the second loop).

Execute the conv_list_reviewsas shown in figure 5. Now the output is in the list format. Observe that each and every word is listed according to their variation (Charcoal Fabric, Heather Gray Fabric, etc.)

  1. Now we need to convert the data in to the tabular form. As we discussed in the previous sessions this can be done using Dataframe() function by Pandas.  Then pass the relevant list (conv_list_reviews) and the columns (produc_typeand repeated_word) as arguments.

If there are any null values in the repeated words, those values should be removed. To do this, the length of each word should be obtained using the str.len() function and it should be greater than 0. Execute the data frame as shown in figure 6.

Figure 6: Table containing the product type and the repeated word

Once we get the data as shown above. We need to group them. The groupby() function can be used. Data should be grouped according to the produc_type. So, pass produc_type as the argument.

To know how many records available for a particular repeated word which is the value count, use the value_count() function on the Repeated_word column.

Then to convert into the tabular format use to_frame() function. Finally, we are going to rename Repeated_word as Cnt_of_word using the rename() function. The code is shown in figure 7.

Figure 7: The code

Execute the data_set data frame. This is shown in figure 8. As you can see the most repeated word with their count is shown in product type wise.

Figure 8: Most repeated word with the count

To display all our variations data display_top_rec() function can be used. It takes the records and index_lvl as the inputs.

What we are doing using this function is that We are trying to convert the data set, applying groupby(). The data should be grouped according to the index_lvl, which is in this case is 0. The 0 means the produc_type column. Then use the apply() function with lambda in Python to get the maximum of 3 values as shown in figure 9.

Figure 9: The code to display all the variations

Execute the function by giving the input parameters. Refer to the output shown in figure 10. It can be seen the most repeated words and their count has been given according to the product type. But, the words are I, the, it, to, etc. These are not useful to us.

Figure 10: The unusual output

We need to skip these kind of words. The solution is we need to train the model. The model should be trained when we are getting the words. We have to specify the words that should be skipped. The words that should be skipped are specified inside frozenset() function. Put this code (figure 12) before the loops. Then inside the second for loop use an if condition to tell the system to not to get the words in the stop_words.

Figure 11: Training the model

Then execute all the codes again and observe the final result. We have the perfect output. Now we can effectively analyze the reviews and come in to a conclusion about what other feels about the product. Observe the output shown in figure 12.  For Black product the most repeated word is works, the count is 63.


Video Tutorial

Leave a Reply

Your email address will not be published. Required fields are marked *