Table of Contents
- Getting the count of a particular column
- Getting the count of all columns
- Getting maximum or minimum
- Standard Deviation
- Getting the information
- describe function
- info function
- Video Tutorial
In this session, the most common methods to obtain statistics of a record will be discussed. These methods are count, min, max, mean, meadian, mode, standard deviation. The basic meaning of some methods are
- Mean – Average value of given values
- Median – Middle value
- Mode – Most repeated value
- Standard Deviation – For each of the value subtracted by mean and square, and divide the values by number of values then apply the square root
In order to start the practical, open Jupyterlab and launch a Jupyter notebook
Import Pandas and then read the csv file “car_sales.csv” and execute the data frame as shown in figure 1.
In order to find out the number of records present in the data set, count()function can be used. The data frame name should be specified when using this function
2.1 Getting the count of a particular column
The number of records in a particular column can be printed by specifying the data frame, the column name with the count function as shown in figure 2. Assume that the count of the records in the Quantity column is needed to be printed out.
Please note that the count function doesn’t take null values to the account. In order to demonstrate this, delete a value in the count column (in here the first cell of the Quantity column is deleted) and re-import the file. Then, again execute the code. As shown in figure 3, it can be seen that now the count is 9. Because the first cell is a null value.
2.2 Getting the count of all columns
Please note that in the previous section we specified the column name. But, if we don’t specify it, column-wise records count can be obtained. See figure 4, the function count is used after the frame.
3 Getting the maximum or minimum
The max() function can be used to find out the maximum value in a column.
- First, let’s find the maximum value in the Quantity column. (Please change the first cell value back to 2884 as we put it as null in the previous section) Specify the data frame and then the column name with the max function as shown in figure 5.
- If we want to find the column-wise max value, remove the column name from the code, and execute it. As you can see in figure 5, it gives the max value of each column. The max value of the Make column is Volvo because Vis the maximum character when it comes to A-Z. Refer the figure 6 to observe the data set.
The minimum can be taken in the same way we have done with the max function. But in order to take the minimum, we gave to use the min() function. Figure 7shows getting minimum of all the columns. Refer to figure 6 to verify whether the printed values are correct.
The mean is the average value of a given set of values. The mean can be calculated by using the mean() function. As the functions we discussed previously this function can be used to get the mean of a particular column or all the columns.
- Assume that we need to calculate the mean of the Quantity column. First, specify the data frame (car_sales), then the column name(Quantity). Then use the mean function as shown in figure 8.
To get the column-wise mean, remove the column name from the above code. Then execute it as shown in figure 9. Observe that the mean of the Make column is not shown. This is because it automatically detects that, the column contains strings.
Median is the middle value of a given data set. The median can be calculated using the median()function. Specify the data frame you want to find the median and then use the median function. As discussed in the above sections, this function also can be used to find out the median of a particular column or all the columns (figure 10).
The mode is the most repeated value of a given data set. The mode can be obtained using the mode() function. This can be used for a particular column.
To clearly obtain the mode, lets first change multiple cell values to 2884 as shown in the figure 11.
Then let’s find the mode of the Quantity column. First specify the data frame, and then column and at last put the mode() function as shown in figure 12 and execute it. As you can see the mode is shown as 2884.
For demonstration purposes now let’s put mode() function to all the columns as shown in figure 13 and execute it. It can be observed that for Year column mode is 2007 there is no other hence shows as NaN. For both Pct and the Quantity column, there are no repeated values hence shows all the values. The Price column mode is 12090 hence it shows in the first cell and the other cells in that column are NaN.
Let’s assume that there are no repeated values in the quantity column, and then execute the code to calculate the mode of the column: Quantity. As shown in the figure 14, it outputs all the values in that column as there is no mode.
As another example execute the mode for the Pct column and it returns all the values in that column as well. As shown in figure 15 there is no mode.
7 Standard Deviation
In the same way as the other functions are used, in order to find the Standard Deviation std()function can be used.
In order to calculate the standard deviation of the Quantity column, Specify the data frame you want to find the std and then the column name, and lastly, use the std function as shown in figure 16.
Column wise std can be obtained if we remove the column name. This is shown in figure 17.
8 Getting the information
There are two functions that can be used to obtain the statistical or concise summary of the data frame. They are the describe() and info() functions.
8.1 Describe function
Rather than column-wise obtaining the mode, median, std, etc. using the relevant functions, describe() function can be used. It gives the summarized version of the calculated mode, median, std, max, min, percentile values as shown in figure 18.
8.2 info function
The info() function can be used to get a summary of index and column data types, non-null values, and the memory usage as shown in figure 19.