how to apply union all and union using python pandas

Table of Contents

  1. Union All
  2. Union
  3. Video Tutorial

Open a Jupyter notebook and import Pandas and the data sets. In this session, we are going to use two sets of data, india_car_sales.csv and india_car_sales_1.csv data sets.

The first data set is shown in figure 1. It contains the columns: Year, Make, Quantity, and Pct for the year 2007. The second data set is shown in figure 2. It contains the same columns as in the data set one, but the year is different. The year values are 2008.

Figure 1: The second data set contains data from 2008
Figure-2

Union All operation 

The data frame which contains the first data set is called car_sales_2007 and the data frame which contains the second data set is called car_sales_2008.

First, let’s see how the union all operation can be used to for these two data frames and combine them in to one data frame.

Pass the two data frames as a list into a new list names car_sales as shown below.

car_sales = [car_sales_2007, car_sales_2008]

Execute the car_sales list as shown in figure 2. It can be observed that the values in two data frames are joined as one. So, it contains all the records indexed from 0-5 and 0-6. The list also contains the headers for the first data set and the second data set as well.

Figure 3: The new list containing two data frames

Then, this list should be converted into a tabular form. This can be done using the concat() function in Pandas as shown below.

car_sales = pd.concat(car_sales)

 Execute the data frame and display the data. The resulting data frame (figure 4) contains the both data sets. Which means they are concatenated. But, the issue is the index column. It is numbered from 0-5 then again 0-6.This means, index starts from 0 at the start of the new data set. This should be changed as 0-12.

Figure 4: Data set after concatenated

The issue with the index can be solved by using an additional parameter called ignore_index. The value should be put as True. This is shown below.

car_sales = pd.concat(car_sales, ignore_index = True)

Execute the code. Observe thefigure 5. Now, the index column is 0-12.

Figure 5: Index start from 0 and ends with 11

If we want to find out, for which data set each record belonged to, we can use the keys parameter. Let’s put keys values as 2007 and 2008. Pass these as a list in to the parameter as shown below.

car_sales = pd.concat(car_sales, keys= ([‘2007’, ‘2008’]))

As you can see in figure 6, for the first data set the key is shown as 2007 and for the second data set the key values is shown as 2008.

Figure 6: Using keys to identify the data sets

We can pass any name for the keys. As an example, car_sales_2007 and car_sales_2008 can be put as the keys. This is shown in figure 7.

Figure 7: Putting the key values as car_sales_2007 and car_sales_2008

Union Operation

  1. Refer the figure 5. It can be observed that the record 5 and record 12 are the same. Since the same record contained in both data sets, once the Union All operation was done,the both records were added in to the final data set. Now let’s see how the union operation is done in Python.
  2. We only need one record from the records. The duplicate should not be considered. To do this we can use the drop_duplicates() function as shown below.

car_sales = car_sales.drop_duplicates()

Execute the data frame as shown in figure 8. Observe that the duplicate records are absent.  Since there were two duplicate values now there are only 11 records.

Figure 8: Dropping the duplicates using drop_duplicates function

Video Tutorial

Leave a Reply

Your email address will not be published. Required fields are marked *