Pandas is a free and open source Python library released under the BSD license. Originally developed by Wes McKinney in 2008, and supported by companies like Intel and Nvidia, it is targeted at data analysis and manipulation. It can be easily used to read and write data from a variety of sources such as Excel spreadsheets, Comma-separated Values (CSV) files, and many more.
In this tutorial we explore the basic concepts behind pandas, and we learn how to manipulate tabular data and create statistics and plots.
In this tutorial you will learn:
- How to install the pandas library
- How to manipulate tabular data
- How to create statistics
- How to create plots

Category | Requirements, Conventions or Software Version Used |
---|---|
System | Distribution-agnostic |
Software | python3, pandas, matplotlib |
Other | basic knowledge of Python and object oriented programming |
Conventions | # – requires given linux-commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires given linux-commands to be executed as a regular non-privileged user |
Installation
The pandas library is available in the official repositories of all the most used Linux distributions. To install pandas on Fedora we use the dnf
package manager:
$ sudo dnf install python3-pandas
On Debian and derivatives, instead, we can use apt
:
$ sudo apt install python3-pandas
To install the package on Archlinux we use pacman
:
$ sudo pacman -S python-pandas
Native installations of software via package managers provide the maximum system integration; sometimes, however, the version of an application or a library maintained by a distribution may lag behind upstream; other times we may need a specific version of it, perhaps for compatibility reasons. In those cases, the recommended way to install pandas is to use
pip
, the Python package manager:
$ pip install pandas
This installation method is universal, and doesn’t require privilege escalation. Using pip
we can also install packages inside project-specific virtual environments.
The DataFrame
The first class we need to get familiar with when working with the pandas library is DataFrame
: it is used to represent a bi-dimensional data structure which can be seen as an equivalent of an SQL Table or an Excel Spreadsheet. Later in this tutorial we will actually see how a DataFrame
object can be initialized by reading data from external sources as the ones we just mentioned; for the moment, however, let’s see how to create one from scratch.
Creating a DataFrame from scratch is pretty simple. One way to do it is to invoke the class constructor and pass a dictionary as the first argument. When we pass data this way, dictionary keys are used as column labels, and the associated values are used to populate the “column”. Let’s clarify this with an example. Suppose we want to create a
DataFrame
object with the following information about some characters from “The Lord of the Rings” book:
Name | Age | Race |
---|---|---|
Frodo | 50 | Hobbit |
Galadriel | 8000 | Elf |
Aragorn | 87 | Man |
Names and races are represented by strings, while integers are used to represent the age of each character. Here is how we could initialize our DataFrame
:
import pandas as pd
data = {
"Name": ["Frodo", "Galadriel", "Aragorn"],
"Age": [50, 8000, 87],
"Race": ["Hobbit", "Elf", "Man"]
}
characters = pd.DataFrame(data)
As an alternative, we can pass the same data by using a list of lists, each one representing a “row”. When we initialiaze the object this way, we must provide the column labels explicitly, using the columns
parameter:
data = [
['Frodo', 50, 'Hobbit'],
['Galadriel', 8000, 'Elf'],
['Aragorn', 87, 'Man']
]
characters = pd.DataFrame(data, columns=['Name', 'Age', 'Race'])
To get a “pretty” representation of a DataFrame
object we can just “print” it:
print(characters)
The result:
Name Age Race
0 Frodo 50 Hobbit
1 Galadriel 8000 Elf
2 Aragorn 87 Man
Index labels and index positions
If not otherwise specified, as you can see above, numeric labels are used to identify rows. These labels are often referred to as “index”. In such cases they correspond to their positions, which starts from 0. Index labels, however, can be explicitly defined at initialization time via the index
parameter, or modified by using the rename
method:
characters.rename(index={0: "r0", 1: "r1", 2: "r2"}, inplace=True)
Here is how our DataFrame
object looks like after invoking the method above (the object is changed in place since we used inplace=True
):
Name Age Race
r0 Frodo 50 Hobbit
r1 Galadriel 8000 Elf
r2 Aragorn 87 Man
The label of the first index in the DataFrame
above is r0
, but the position of the index is 0. We can select rows and columns by using both index labels and index positions. To select the first row by its index label, for example, we would use the loc
property:
characters.loc['r0']
To select it by index position, instead, we would use the iloc
property:
characters.iloc[0]
Working with rows
We saw some examples of how to select specific rows by their index labels and positions. Now let’s see how to filter rows using conditions. Suppose we want to retrieve all the rows relative to characters older than 50 years. We need to test where the value of the “Age” column is > 50. Here is how we can do it:
older_than_fifty = characters[characters['Age'] > 50]
In the example above we used the
characters['Age'] > 50
condition: it is applied to the DataFrame
object and returns a Series
object. In pandas, a “Series” object represents a one-dimensional array which can contain any data type. In this case, it will contain boolean values: True
when a value in the “Age” column matches the condition, False
, otherwise. Here is the “printed” representation of the object:
0 False
1 True
2 True
The Series
object returned by the condition is used as a “filter” on the original DataFrame
object. Only the corresponding rows are returned:
Name Age Race
1 Galadriel 8000 Elf
2 Aragorn 87 Man
Multiple conditions can be specified using logical operators. Suppose we want to select rows relative to characters older than 50 years and whose names start with an “A”. We would write:
characters[(characters['Age'] > 50) & (characters['Name'].str.startswith('A'))]
We enclosed each condition between parentheses and used one of the pandas logical operators (&
, |
, ~
), which are the equivalent to Python’s and
, or
and not
. In the example above you can also see we used an accessor: str
. Accessors let us access values of Series
and modify them by using the available methods, which obviously change according to the data type. In this case, since characters names are represented by strings, we used the str
accessor and applied the startswith
method, which works similarly to the native Python method with the same name. As expected, applying the condition returns the following result:
Name Age Race
2 Aragorn 87 Man
Adding and replacing rows
Adding rows to an existing DataFrame
object is very simple. We saw how to use the loc
and iloc
properties to retrieve specific rows, but they can be also be used to assign values. To create a new row we just have to provide the new index label (or the index position if using iloc
) we want to use for it, and assign the row values. Suppose we want to add a new character to our DataFrame
object. We could write:
characters.loc[3] = {'Name': 'Bilbo', 'Age': 110, 'Race': 'Hobbit'}
If we use an already existing index, the existing row will be replaced. For instance, to replace the row with index label 1 (the one about Galadriel) with a row holding information about the Theoden character, we would write:
characters.loc[1] = {'Name': 'Theoden', 'Age': 71, 'Race': 'Man'}
What if we want to add multiple rows at once? To accomplish such task we can create a new DataFrame
object with the rows we want to append, and than merge it with the original DataFrame
using the concat
function:
additional_characters = pd.DataFrame({
'Name': ['Boromir', 'Denethor', 'Elrond'],
'Age': [41, 89, 6000],
'Race': ['Man', 'Man', 'Elf']
})
all_characters = pd.concat([characters, additional_characters], ignore_index=True)
In the example above we called concat
and passed the list of the DataFrame
objects we want to merge as first argument. We also specified we don’t want to preserve existing index labels, by setting the ignore_index
argument to True
.
Removing rows
To remove a row, we can use the drop
method of the DataFrame
class. The drop
method works with index labels, so supposing we want to remove the first row from the table (index label 0), we would run:
characters.drop(0, inplace=True)
The first and only positional argument accepted by the method can be a single label or a list of index labels. By default, the drop
method doesn’t modify the original object in place but returns a new object. If we want to alter this behavior, we need to set the inplace parameter to True
, as we did in the example above.
Working with columns
Suppose we want to retrieve only the values of the “Age” column from our DataFrame
object. It is a very simple operation. All we have to do is to use the column label as a dictionary key:
ages = characters['Age']
Accessing the DataFrame
object column this way, as we already saw, returns a Series
object:
0 50
1 8000
2 87
Name: Age, dtype: int64
Multiple columns can be “sliced” from a
DataFrame
object at once. For example:
subset = characters[['Name', 'Age']]
Selecting multiple columns returns a new DataFrame
object (Series
object are one-dimensional).
Adding columns
Now, let’s say we want to add a new column to our DataFrame
object to hold information about characters residence. Once again, we can do it as if we were adding a new key to a dictionary. Here is what we would write:
characters['Residence'] = ['The Shire', 'Lothlorien', 'Minas Tirith']
The original DataFrame
now should look like that:
Name Age Race Residence
0 Frodo 50 Hobbit The Shire
1 Galadriel 8000 Elf Lothlorien
2 Aragorn 87 Man Minas Tirith
The methods above modifies the DataFrame
object in place. If instead we want the result to be assigned to a new object, we would use the assign
method:
result = characters.assign(Residence=['The Shire', 'Lothlorien', 'Minas Tirith'])
To insert a column at a specific location, instead, we can use the
insert
method. Say for example we want to insert the column after the one labeled “Name”, which is the first (position 0). We would write:
characters.insert(1, "Residence", ['The Shire', 'Lothlorien', 'Minas Tirith'])
Deleting columns
To delete a column we can use the same method we use to remove rows. The only thing we need to change is the axis
parameter, setting it to 1
(the default is 0), to specify we want to work with columns. For instance, to remove the column with the “Name” label, we would write:
characters.drop("Name", axis=1, inplace=True)
Alternatively, we could specify the name of columns to drop via the columns
parameter:
characters.drop(columns=["Name"], inplace=True)
Accessing single values
Until now we saw some examples of how to manage entire rows and columns. Suppose, instead, we want to read or modify only the age of the “Frodo” character in our DataFrame
object, without replacing the entire row. To retrieve a single value of a row/column pair (a “cell” in spreadsheet terminology), we can use the at
property:
frodo_age = characters.at[0, 'Age']
To assign a new value, instead:
characters.at[0, 'Age'] = 51
The
at
property, similarly to loc
works with “labels”, so the first argument we used in the example above, between square brackets, is not the row index position, but the index label. If we want to work with index positions, we need use iat
, instead. Since the row related to “Frodo” is the first one (position 0), and the “Age” column is the second one (position 1), we would write:
characters.iat[0, 1] = 51
Perform calculations
Being able to perform calculations based on existing data is crucial. The pandas library let us perform such operations quite easily. Here is an example. Suppose we want to calculate the average age of the non-Elf characters in the “all_characters” DataFrame
object we defined previously. Here is how we could proceed. First we select the rows in which the value of the “Race” column is not “Elf”: this operation, as we already saw, returns a new DataFrame
object. We “slice” the “Age” column from it, obtaining a Series
object on which we invoke the mean
method:
all_characters[all_characters['Race'] != 'Elf']['Age'].mean()
As expected, we obtain
66.75
as result. Another example. This time we want to find the youngest non-elf character:
characters[characters['Race'] != 'Elf']['Age'].min()
What if we want to obtain the average age of characters based on their race?. In such situations we have to use the groupby
method:
all_characters.groupby('Race')['Age'].mean()
We obtain the following result:
Race
Elf 7000.000000
Hobbit 50.000000
Man 72.333333
These are only few basic examples of the computations we can perform on the data contained in our DataFrame
!
Reading and writing data from and to external sources
In previous examples we created DataFrame
objects from scratch. Most of the time, however, we need to read data from external sources such as Excel spreadsheets, CSV files, etc. Such operations can be performed using dedicated read_*
and to_*
methods. Let’s see some examples.
Suppose we want to create a DataFrame
object from the data contained in a Microsoft Excel spreadsheet called “characters.xlsx”. As a first thing we need to ensure the openpyxl
library is installed:
$ pip install openpyxl
Than, to read the spreadsheet and create a DataFrame
we would write:
characters = pd.read_excel('characters.xlsx')
Similarly, to import data from a CSV file, we would write:
characters = pd.read_csv('characters.csv')
Exporting data contained in a DataFrame
to an Excel spreadsheet, is just as easy:
characters.to_excel('characters.xlsx')
To write to a CSV file, instead:
characters.to_csv('characters.csv')
Those methods accepts many parameters which can be used to customize how data is loaded or written into a file. For obvious reasons we cannot explore all of them here; take a look at the pandas API reference for details.
Creating plots
To create plots the matplotlib
library is used as a backend, therefore it must be installed:
$ pip install matplotlib
By default matplotlib uses the “agg” backend, which is used to write plots to files. If we want to visualize created plots using a graphical interface, we must install the corresponding backend. For the sake of this example we will use the Qt5 one:
$ pip install pyqt5
Only numeric data is reported in plots. Here is how to create one to visualize the age difference between characters in our
DataFrame
:
import matplotlib.pyplot as plt
all_characters.plot(x="Name")
The plot
method we can be applied both on DataFrame
and Series
objects. In the example we used the x
parameter to specify the label we want to use for the x axis: we did this to use the name of the characters instead of numeric labels. In order to visualize the plot we call the show
function provided by the plt
module:
plt.show()

There are many kind of plots which can be created: the default is the “line” one. We can specify which one we want to use via the kind
parameter of the plot
method, or use one of the dedicated methods directly. In our case, to create a “bar” vertical plot, we could either write:
all_characters.plot(x="Name", kind="bar")
Or just call:
all_characters.plot.bar(x="Name")

To save the plot to a file directly, we can call the
savefig
function of the plt
module and pass the filename as argument:
plt.savefig('lotr_characters_age_comparison')
Closing thoughts
In this tutorial we learned the basic concepts behind the usage of the Python pandas library. We saw how to create and manage tabular data and how to create statistics and plots. We really just scratched the surface on what can be accomplished using this fantastic library. The best way to master it is by using it, and, of course, reading the official documentation.