Table of Contents
- Predicting values for missing dates
- More about date_range function
- Video Tutorial
In the previous session, we discussed how to predict missing values in the records. In this session, we are going to discuss predicting values for the missing dates.
Refer the figure 1. It shows the data set but there are no records for some dates. As an example see the row 9 and 10 dates. The records for 9,10,11,12 are missing. Hence, in this session, we are going to predict the values for those missing days.
2 Predicting values for missing dates
First import Pandas and Numpy libraries and the data set into Jupyter notebook as shown in figure 2. Note that the dteday column is not linear, as some date values and their records are missing. Also, some records are missing as well (represented as NaN).
We know that there are missing dates. So we can use the date_range() function to get all the dates within a range. Pass the starting date and the ending date into the date_range() function as shown below. Since we need to get all the dates between January 1st to January 31st, we need to give the starting date as 01-01-2011 and the ending date as 01-31-2011. Then let’s assign it to a variable named date_ range.
date_range = pd.date_range(’01 – 01- 2011’, ‘01- 31- 2011’)
Execute the date_range variable. The output will be a list of dates as shown in the figure 3.
Then we need to append the date_range variable as the date column of our bike_sharing data frame. To do this, the reindex() function in Pandas can be used. The column labels and row labels of a data frame will be changed by the reindex() function. It will conform the date values with the matching record. But we need to specify that the missing records should be represented as NaN, by using the fill_valueparameter.
We need to override the existing data frame by the new values which were obtained after re-indexing. The code is shown in figure 4.
Execute the bike_sharing data frame and observe the updated table in figure 5. Note that the index column contains all the dates within the given range and the missing values are represented as NaN as we specified.
Now let’s use the linear interpolation to predict the missing records. To do this, use the interpolate() method and specify the method parameter as linear. Use the around() function in Numpy library to round the approximated values as shown in figure 6.
Execute the code and observe the output. It can be seen values for the missing records are predicted and rounded as shown in figure 7.
3. More about date_range function
a. Start and End parameters
We have seen that start and end date don’t have to be explicitly specified. But, if we want to, we can explicitly specify these two values by using: start and end parameters. This is shown in figure 8. Set the freq parameter as ‘D’ which is discussed in next point
b. The freq and periods parameters
If we want to obtain the dates according to a specific frequency, the freq parameter can be used.
The periods parameter can be used to specify the number of data that should be displayed. Refer the figure 9. The period is given as 10, along with the start date. Hence, it shows only 10 dates starting from 2011 – 01- 01
Let’s look at some possible values to pass into the freq parameter
D – The default frequency
This value outputs the dates as shown in the calendar. Observe the figure 10.
B – The business days
This displays only the business days, as shown in the figure 101
M – The month end frequency (figure 12)