Table of Contents
- Union All
- Union
- Video Tutorial
Open a Jupyter notebook and import Pandas and the data sets. In this session, we are going to use two sets of data, india_car_sales.csv and india_car_sales_1.csv data sets.
The first data set is shown in figure 1. It contains the columns: Year, Make, Quantity, and Pct for the year 2007. The second data set is shown in figure 2. It contains the same columns as in the data set one, but the year is different. The year values are 2008.
Union All operation
The data frame which contains the first data set is called car_sales_2007 and the data frame which contains the second data set is called car_sales_2008.
First, let’s see how the union all operation can be used to for these two data frames and combine them in to one data frame.
Pass the two data frames as a list into a new list names car_sales as shown below.
car_sales = [car_sales_2007, car_sales_2008]
Execute the car_sales list as shown in figure 2. It can be observed that the values in two data frames are joined as one. So, it contains all the records indexed from 0-5 and 0-6. The list also contains the headers for the first data set and the second data set as well.
Then, this list should be converted into a tabular form. This can be done using the concat() function in Pandas as shown below.
car_sales = pd.concat(car_sales)
Execute the data frame and display the data. The resulting data frame (figure 4) contains the both data sets. Which means they are concatenated. But, the issue is the index column. It is numbered from 0-5 then again 0-6.This means, index starts from 0 at the start of the new data set. This should be changed as 0-12.
The issue with the index can be solved by using an additional parameter called ignore_index. The value should be put as True. This is shown below.
car_sales = pd.concat(car_sales, ignore_index = True)
Execute the code. Observe thefigure 5. Now, the index column is 0-12.
If we want to find out, for which data set each record belonged to, we can use the keys parameter. Let’s put keys values as 2007 and 2008. Pass these as a list in to the parameter as shown below.
car_sales = pd.concat(car_sales, keys= ([‘2007’, ‘2008’]))
As you can see in figure 6, for the first data set the key is shown as 2007 and for the second data set the key values is shown as 2008.
We can pass any name for the keys. As an example, car_sales_2007 and car_sales_2008 can be put as the keys. This is shown in figure 7.
Union Operation
- Refer the figure 5. It can be observed that the record 5 and record 12 are the same. Since the same record contained in both data sets, once the Union All operation was done,the both records were added in to the final data set. Now let’s see how the union operation is done in Python.
- We only need one record from the records. The duplicate should not be considered. To do this we can use the drop_duplicates() function as shown below.
car_sales = car_sales.drop_duplicates()
Execute the data frame as shown in figure 8. Observe that the duplicate records are absent. Since there were two duplicate values now there are only 11 records.