Table of Contents
- How to Pass a Delimiter
- Skipping the First n Rows
- Changing the Column Names
- Skipping n Footer Rows
- Display n Number of Rows
- Display Blank Lines
- Index the Columns while Importing
- Managing Duplicate Column Names
- Converting Column to Dates
- Displaying Data Types of the Columns
- Video Tutorial ( 2 videos )
1 How to Pass a Delimiter
Firstly, launch the Jupyterlab and open the Jupyter notebook that we are done practicals in the last time. In order to pass the delimiter use delimiter =’,’ command as shown in figure 1. A comma has to be put here because this is a csv file.
As an example, when a text file is used, pipe (|)should be used instead of the comma. That is delimiter = ‘|’ ( Pipe ). This is shown in figure 2. As shown in figure 3, If the comma delimiter is used for a txt file, all the columns will take as a one record.
2 How to Skip First n Rows
The first n rows can be skipped by using the “skiprows” parameter. If the first 3 rows should be skipped, type the parameter as skiprows=3. Then execute the code. In the output, it can be observed that the first 3 rows are skipped. This is shown in figure 4.
If the first 10 rows should be skipped, simply change the parameter to skiprows= 10, and then execute it. You will get the output accordingly as shown in figure 5.
3 Changing Column Names
As shown in figure 5, when you skip rows, the record is taken as the header where the column names should be. The header should be Year, Maker, sales_qty, and Pct.
This can be changed using the “names” parameter. The data should be passed to the “names” parameter as a list, which means names=[‘Year’,’Maker’,’sales_qty’, ‘Pct’] as shown in figure 6. After executing, it can be observed that the header contains the column names that we passed, not the records as previously.
4 Skipping n Footer Rows
In a situation where the bottom n number of records should be skipped, “skipfooter” parameter can be used. As an example, if the bottom 5 records should be skipped insert the skipfooter= 5 in to the command and execute it. This is shown in figure 7.
In here, it gives us a warning that,python should be added as the engine because c engine doesn’t support the “skipfooter” parameter. Hence, make sure to specify engine=’python’ when “skipfooter” parameter is used as shown in figure 8.
** Please note that
If “skipfooter” or “skiprows” parameters were not used and still we pass column names, using the “names” parameter, the output will take the column names given by using “names” parameter. It will not take the column names from the original file.
Use of header parameter
The “header” parameter can be used to skip the first n rows. The default value of the header is 0. If we put header=1, it’s going to skip the first record as shown in figure 9.
5 Display n Number of Rows
In order to display the first n number of rows, “nrows” parameter can be used. As an example, if the first 5 rows should be displayed, nrows=5 should be added to the command as shown in figure 9. It can be observed that only the first 5 rows are displayed in the output.
6 Display Blank Lines
Launch Jupyterlab and open a Jupyter Notebook. Let’s start with importing Pandas and importing the txt file as we did in the previous session as shown in figure 11.
There are no blank rows in our original txt file. Hence, for demonstration purposes add some blank lines in the txt file as shown in figure 12.
- Then again execute the code in the Jupyter Notebook. As you can see in figure 1, it doesn’t show the blank lines in the output.
- In order to display the blank lines, an additional parameter should be used as shown below.
Skip_blank_lines= False
By setting skip_blank_lines parameter to False, the blank lines can be displayed. As shown in figure 13, the blank cell values are displayed as “NaN”.
The default value of skip_blank_lines parameter is set to be true. This was the reason behind not displaying the blank lines at the beginning. Hence, to display the blank lines, set the parameter to false.
7 Index the Columns while Importing
- The index_col parameter can be used to index the required column. As an example, if you want the 0 th column to be the index column,index_col = 0 should be added to the code as shown in figure 14. It can be observed that, in the output the 0 th column is displayed as the index column.
- Likewise, other columns can be set as the index column as well. In figure 15, the column 1 is set as the index column. If the Pct column should be set as the index column as in figure 6, set index_col =3.
- If you put the index column as column 4, it will give an error message because, in here there is no 4th column.
Likewise, other columns can be set as the index column as well. In figure 5, the column 1 is set as the index column. If the Pct column should be set as the index column as in figure 6, set index_col =3. If you put the index column as column 4, it will give an error message because, in here there is no 4th column
- Moreover, indexing can be done from the backward order as well as from the forward order. As an example if the Pct column should be set as the index column using backward indexing, Index_col = -1 the parameter value should be set to -1, whereas in forwarding indexing it is set as 3. This is shown in figure 17.
By setting the value as -2,-3 and -4, the index column can be set as Quantity, Make, and year. Please note that, in backward indexing there is no 0th column. If the parameter is set to -5, it will give an error message because, there is no -5th column in this data set.
8 Managing Duplicate Column Names
Let us think that mistakenly we have put the same column name twice as shown in figure 8. The name Quantity has set for two columns. After executing the code, it can be observed that from figure 9Python has automatically detected this and by default, it will manage the column names as Quantity, Quantity 1.
Likewise, if Python detects there are 3 columns using the name as Quantity, it will automatically handle this by renaming the columns as Quantity, Quantity 1, and Quantity 2 as shown in figure 21
- This can be also handled by passing a parameter as well as shown in figure 10, by setting the, mangle_dupe_col= True.
- Even if the parameter is not set to True, duplicate columns will be handled by Python by default. If the parameter is set to False, error message will be pop up saying that its not supported by Python.
9 Converting Column to Dates
- In order to convert strings to date data type, parse_date parameter can be used. Lets assume that the year column should be converted to dates. The column name which is needed to be converted should be typed inside the square bracket as shown below.
- parse_date = [‘Year’]
- Execute the code. It can be seen that the Python has taken 01-01 as the default values for the month and the date. In order to verify if this is converted to date data type, lets display the data types of each column.
10 Displaying Data Types of Columns
The data type object dtypes can be used to display the data type of each column. This can be done by executing the code: car_sales.dtypes
Figure 23 shows the output. It can be observed that, before converting to date data type, the data type of the Year column is a string. But after converting, it has become datetime64[ns] data type. The output also shows the data types of the other columns.