predicting missing data using linear method python pandas

Table of Contents

  1. Introduction
  2. Linear method
  3. Time series method
  4. Video Tutorial
  1. Introduction

In this practical, we are going to approximate the missing tabular values using the interpolate() function. This function does the same calculation as in mathematical interpolation.

Let’s look at the data set that is used in this practical. It’s shown in figure 1. It contains a number of vehicles that were booked by non-subscribed users and subscribed users in January 2011. Please note that a few values are missing (highlighted in red). Let’s see how the interpolate() function can be used to approximate the missing values.

Figure-1

2.  Linear method

First import Pandas and Numpy libraries and then import the data file. Then convert the dteday column to datetime data type. Put the dtedate column as the index column. The code snippet is shown in figure 2

Figure-2

Use interpolate()function to approximate the missing values in the data set as shown in figure 3. By default, this function uses linear interpolation. It can be observed that now there are values displayed where there were missing values.

Figure 3: Using interpolation function to approximate missing values

Now we need to round the values. This can be done by applying the around() function which is in Numpy library.

np.around(stock_data.interpolate())

Note that the values are evenly rounded as shown in figure 4.

Figure-4

As previously discussed, by default interpolate() function does the linear interpolation. But if you want to, it can be explicitly specified by using an additional parameter called method. Pass the value as ‘linear’ as shown below.

np.around(stock_data.interpolate(method= “linear’))

When executed as shown in figure 5, it returns same values as in figure 4.

Figure-5

3 Time series method

Please note the dates in dteday column. It can be seen that many dates are not recorded. As an example, between 2011-01-08 and 2011-01-13 there are no records. Hence the approximated values from linear method are not accurate. Hence, the time should be also considered when interpolating. This can be done by time series method.

Specify the method parameter as ‘Time’ in order to interpolate using time series method as shown below.

np.around(stock_data.interpolate(method= “time’))

Observe the approximated values in figure 6. Those values are different than the values that were obtained using the linear method.

Figure-6

Video Tutorial

Leave a Reply

Your email address will not be published. Required fields are marked *