Introduction to pandas

Pandas is a free and open source Python library released under the BSD license. Originally developed by Wes McKinney in 2008, and supported by companies like Intel and Nvidia, it is targeted at data analysis and manipulation. It can be easily used to read and write data from a variety of sources such as Excel spreadsheets, Comma-separated Values (CSV) files, and many more.

In this tutorial we explore the basic concepts behind pandas, and we learn how to manipulate tabular data and create statistics and plots.

In this tutorial you will learn:

  • How to install the pandas library
  • How to manipulate tabular data
  • How to create statistics
  • How to create plots
Introduction to pandas
Introduction to pandas
Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Distribution-agnostic
Software python3, pandas, matplotlib
Other basic knowledge of Python and object oriented programming
Conventions # – requires given linux-commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux-commands to be executed as a regular non-privileged user

Installation

The pandas library is available in the official repositories of all the most used Linux distributions. To install pandas on Fedora we use the dnf  package manager:

$ sudo dnf install python3-pandas

On Debian and derivatives, instead, we can use apt:

$ sudo apt install python3-pandas

To install the package on Archlinux we use pacman:

$ sudo pacman -S python-pandas



Native installations of software via package managers provide the maximum system integration; sometimes, however, the version of an application or a library maintained by a distribution may lag behind upstream; other times we may need a specific version of it, perhaps for compatibility reasons. In those cases, the recommended way to install pandas is to use pip, the Python package manager:

$ pip install pandas

This installation method is universal, and doesn’t require privilege escalation. Using pip we can also install  packages inside project-specific virtual environments.

The DataFrame

The first class we need to get familiar with when working with the pandas library is DataFrame: it is used to  represent a bi-dimensional data structure which can be seen as an equivalent of an SQL Table or an Excel Spreadsheet. Later in this tutorial we will actually see how a DataFrame object can be initialized by reading data from external sources as the ones we just mentioned; for the moment, however, let’s see how to create one from scratch.



Creating a DataFrame from scratch is pretty simple. One way to do it is to invoke the class constructor and pass a dictionary as the first argument. When we pass data this way, dictionary keys are used as column labels, and the associated values are used to populate the “column”. Let’s clarify this with an example. Suppose we want to create a DataFrame object with the following information about some characters from “The Lord of the Rings” book:

Name Age Race
Frodo 50 Hobbit
Galadriel 8000 Elf
Aragorn 87 Man

Names and races are represented by strings, while integers are used to represent the age of each character. Here is how we could initialize our DataFrame:

import pandas as pd

data = {
  "Name": ["Frodo", "Galadriel", "Aragorn"],
  "Age": [50, 8000, 87],
  "Race": ["Hobbit", "Elf", "Man"]
}
characters = pd.DataFrame(data)

As an alternative, we can pass the same data by using a list of lists, each one representing a “row”. When we initialiaze the object this way, we must provide the column labels explicitly, using the columns parameter:

data = [
    ['Frodo', 50, 'Hobbit'],
    ['Galadriel', 8000, 'Elf'],
    ['Aragorn', 87, 'Man']
]
characters = pd.DataFrame(data, columns=['Name', 'Age', 'Race'])

To get a “pretty” representation of a DataFrame object we can just “print” it:

print(characters)

The result:

	        Name   Age    Race
	0      Frodo    50  Hobbit
	1  Galadriel  8000     Elf
	2    Aragorn    87     Man

Index labels and index positions

If not otherwise specified, as you can see above, numeric labels are used to identify rows. These labels are often referred to as “index”. In such cases they correspond to their positions, which starts from 0. Index labels, however, can be explicitly defined at initialization time via the index parameter, or modified by using the rename method:

characters.rename(index={0: "r0", 1: "r1", 2: "r2"}, inplace=True)

Here is how our DataFrame object looks like after invoking the method above (the object is changed in place since we used inplace=True):

         Name   Age    Race
r0      Frodo    50  Hobbit
r1  Galadriel  8000     Elf
r2    Aragorn    87     Man

The label of the first index in the DataFrame above is r0, but the position of the index is 0. We can select rows and columns by using both index labels and index positions. To select the first row by its index label, for example, we would use the loc property:

characters.loc['r0']

To select it by index position, instead, we would use the iloc property:

characters.iloc[0]

Working with rows

We saw some examples of how to select specific rows by their index labels and positions. Now let’s see how to filter rows using conditions. Suppose we want to retrieve all the rows relative to characters older than 50 years. We need to test where the value of the “Age” column is > 50. Here is how we can do it:

older_than_fifty = characters[characters['Age'] > 50]



In the example above we used the characters['Age'] > 50 condition: it is applied to the DataFrame object and returns a Series object. In pandas, a “Series” object represents a one-dimensional array which can contain any data type. In this case, it will contain boolean values: True when a value in the “Age” column matches the condition, False, otherwise. Here is the “printed” representation of the object:

0    False
1     True
2     True

The Series object returned by the condition is used as a “filter” on the original DataFrame object. Only the corresponding rows are returned:

      	Name   Age Race
1  Galadriel  8000  Elf	
2    Aragorn    87  Man

Multiple conditions can be specified using logical operators. Suppose we want to select rows relative to characters older than 50 years and whose names start with an “A”. We would write:

characters[(characters['Age'] > 50) & (characters['Name'].str.startswith('A'))]

We enclosed each condition between parentheses and used one of the pandas logical operators (&, |, ~), which are the equivalent to Python’s and, or and not. In the example above you can also see we used an accessor: str. Accessors let us access values of Series and modify them by using the available methods, which obviously change according to the data type. In this case, since characters names are represented by strings, we used the str accessor and applied the startswith method, which works similarly to the native Python method with the same name. As expected, applying the condition returns the following result:

      Name  Age Race
2  Aragorn   87  Man

Adding and replacing rows

Adding rows to an existing DataFrame object is very simple. We saw how to use the loc and iloc properties to retrieve specific rows, but they can be also be used to assign values. To create a new row we just have to provide the new index label (or the index position if using iloc) we want to use for it, and assign the row values. Suppose we want to add a new character to our DataFrame object. We could write:

characters.loc[3] = {'Name': 'Bilbo', 'Age': 110, 'Race': 'Hobbit'}



If we use an already existing index, the existing row will be replaced. For instance, to replace the row with index label 1 (the one about Galadriel) with a row holding information about the Theoden character, we would write:

characters.loc[1] = {'Name': 'Theoden', 'Age': 71, 'Race': 'Man'}

What if we want to add multiple rows at once? To accomplish such task we can create a new DataFrame object with the rows we want to append, and than merge it with the original DataFrame using the concat function:

additional_characters = pd.DataFrame({ 
  'Name': ['Boromir', 'Denethor', 'Elrond'], 
  'Age': [41, 89, 6000], 
  'Race': ['Man', 'Man', 'Elf'] 
}) 
all_characters = pd.concat([characters, additional_characters], ignore_index=True)

In the example above we called concat and passed the list of the DataFrame objects we want to merge as first argument. We also specified we don’t want to preserve existing index labels, by setting the ignore_index argument to True.

Removing rows

To remove a row, we can use the drop method of the DataFrame class. The drop method works with index labels, so supposing we want to remove the first row from the table (index label 0), we would run:

characters.drop(0, inplace=True)

The first and only positional argument accepted by the method can be a single label or a list of index labels. By default, the drop method doesn’t modify the original object in place but returns a new object. If we want to alter this behavior, we need to set the inplace parameter to True, as we did in the example above.

Working with columns

Suppose we want to retrieve only the values of the “Age” column from our DataFrame object. It is a very simple operation. All we have to do is to use the column label as a dictionary key:

ages = characters['Age']

Accessing the DataFrame object column this way, as we already saw, returns a Series object:

0      50
1    8000
2      87
Name: Age, dtype: int64



Multiple columns can be “sliced” from a DataFrame object at once. For example:

subset = characters[['Name', 'Age']]

Selecting multiple columns returns a new DataFrame object (Series object are one-dimensional).

Adding columns

Now, let’s say we want to add a new column to our DataFrame object to hold information about characters residence. Once again, we can do it as if we were adding a new key to a dictionary. Here is what we would write:

characters['Residence'] = ['The Shire', 'Lothlorien', 'Minas Tirith']

The original DataFrame now should look like that:

        Name   Age    Race     Residence
0      Frodo    50  Hobbit     The Shire
1  Galadriel  8000     Elf    Lothlorien
2    Aragorn    87     Man  Minas Tirith

The methods above modifies the DataFrame object in place. If instead we want the result to be assigned to a new object, we would use the assign method:

result = characters.assign(Residence=['The Shire', 'Lothlorien', 'Minas Tirith'])



To insert a column at a specific location, instead, we can use the insert method. Say for example we want to insert the column after the one labeled “Name”, which is the first (position 0). We would write:

characters.insert(1, "Residence", ['The Shire', 'Lothlorien', 'Minas Tirith'])

Deleting columns

To delete a column we can use the same method we use to remove rows. The only thing we need to change is the axis parameter, setting it to 1 (the default is 0), to specify we want to work with columns. For instance, to remove the column with the “Name” label, we would write:

characters.drop("Name", axis=1, inplace=True)

Alternatively, we could specify the name of columns to drop via the columns parameter:

characters.drop(columns=["Name"], inplace=True)

Accessing single values

Until now we saw some examples of how to manage entire rows and columns. Suppose, instead, we want to read or modify only the age of the “Frodo” character in our DataFrame object, without replacing the entire row. To retrieve a single value of a row/column pair (a “cell” in spreadsheet terminology), we can use the at property:

frodo_age = characters.at[0, 'Age'] 

To assign a new value, instead:

characters.at[0, 'Age'] = 51 



The at property, similarly to loc works with “labels”, so the first argument we used in the example above, between square brackets, is not the row index position, but the index label. If we want to work with index positions, we need use iat, instead. Since the row related to “Frodo” is the first one (position 0), and the “Age” column is the second one (position 1), we would write:

characters.iat[0, 1] = 51

Perform calculations

Being able to perform calculations based on existing data is crucial. The pandas library let us perform such operations quite easily. Here is an example. Suppose we want to calculate the average age of the non-Elf characters in the “all_characters” DataFrame object we defined previously. Here is how we could proceed. First we select the rows in which the value of the “Race” column is not “Elf”: this operation, as we already saw, returns a new DataFrame object. We “slice” the “Age” column from it, obtaining a Series object on which we invoke the mean method:

all_characters[all_characters['Race'] != 'Elf']['Age'].mean() 



As expected, we obtain 66.75 as result. Another example. This time we want to find the youngest non-elf character:

characters[characters['Race'] != 'Elf']['Age'].min()

What if we want to obtain the average age of characters based on their race?. In such situations we have to use the groupby method:

all_characters.groupby('Race')['Age'].mean()

We obtain the following result:

Race
Elf          7000.000000
Hobbit         50.000000
Man            72.333333

These are only few basic examples of the computations we can perform on the data contained in our DataFrame!

Reading and writing data from and to external sources

In previous examples we created DataFrame objects from scratch. Most of the time, however, we need to read data from external sources such as Excel spreadsheets, CSV files, etc. Such operations can be performed using dedicated read_* and to_* methods. Let’s see some examples.

Suppose we want to create a DataFrame object from the data contained in a Microsoft Excel spreadsheet called “characters.xlsx”. As a first thing we need to ensure the openpyxl library is installed:

$ pip install openpyxl

Than, to read the spreadsheet and create a DataFrame we would write:

characters = pd.read_excel('characters.xlsx')

Similarly, to import data from a CSV file, we would write:

characters = pd.read_csv('characters.csv')

Exporting data contained in a DataFrame to an Excel spreadsheet, is just as easy:

characters.to_excel('characters.xlsx')

To write to a CSV file, instead:

characters.to_csv('characters.csv')



Those methods accepts many parameters which can be used to customize how data is loaded or written into a file. For obvious reasons we cannot explore all of them here; take a look at the pandas API reference for details.

Creating plots

To create plots the matplotlib library is used as a backend, therefore it must be installed:

$ pip install matplotlib

By default matplotlib uses the “agg” backend, which is used to write plots to files. If we want to visualize created plots using a graphical interface, we must install the corresponding backend. For the sake of this example we will use the Qt5 one:

$ pip install pyqt5



Only numeric data is reported in plots. Here is how to create one to visualize the age difference between characters in our DataFrame:

import matplotlib.pyplot as plt
all_characters.plot(x="Name")  

The plot method we can be applied both on DataFrame and Series objects. In the example we used the x parameter to specify the label we want to use for the x axis: we did this to use the name of the characters instead of numeric labels. In order to visualize the plot we call the show function provided by the plt module:

plt.show()
Linear plot representing age difference between some Lord of The Rings characters
Linear plot representing age difference between some Lord of The Rings characters

There are many kind of plots which can be created: the default is the “line” one. We can specify which one we want to use via the kind parameter of the plot method, or use one of the dedicated methods directly. In our case, to create a “bar” vertical plot, we could either write:

all_characters.plot(x="Name", kind="bar")

Or just call:

all_characters.plot.bar(x="Name")  
Characters age difference reported in a "bar" plot
Characters age difference reported in a “bar” plot



To save the plot to a file directly, we can call the savefig function of the plt module and pass the filename as argument:

plt.savefig('lotr_characters_age_comparison')

Closing thoughts

In this tutorial we learned the basic concepts behind the usage of the Python pandas library. We saw how to create and manage tabular data and how to create statistics and plots. We really just scratched the surface on what can be accomplished using this fantastic library. The best way to master it is by using it, and, of course, reading the official documentation.



Comments and Discussions
Linux Forum