How to manipulate Excel spreadsheets with Python and openpyxl

Python is a general purpose programming language that needs no presentations. It was originally written by Guido Van Rossum, and saw its first release in the year 1991. At the moment of writing, the latest stable version of the language is 3.10. In this tutorial we see how to use it together with the openpyxl library to manipulate Excel spreadsheets.

In this tutorial you will learn:

  • How to create a workbook in memory
  • How to retrieve,create,copy,move and remove sheets from a workbook
  • How to create a workbook from a file
  • How to access a range of cells
  • How to iterate over rows and columns
How to manipulate Excel spreadsheets with Python and openpyxl
How to manipulate Excel spreadsheets with Python and openpyxl

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Distribution independent
Software Python and the openpyxl library
Other None
Conventions # – requires given linux-commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux-commands to be executed as a regular non-privileged user

Installing the openpyxl library

To obtain the openpyxl library on our operating system, we can use two methods: the fist consists into installing the package available in the repository of our favorite distribution using its native package manager, the second, universal method, is to make use of pip, the python package manager. Let’s explore both of them.

The openpyxl package is available in the default repositories of some of the most used Linux distributions, such as Debian (and its derivatives), Fedora, and Archlinux. To install the package in the mentioned distributions, we can run, the following commands, respectively:

# Install openpyxl on Debian and derivatives
$ sudo apt install python3-openpyxl

# Install openpyxl on Fedora
$ sudo dnf install python3-openpyxl

# Install openpyxl on Archlinux
$ sudo pacman -S python-openpyxl

The commands above are distribution-specific. If we want to use a cross-distribution method to install openpyxl (or any other python library), we can use pip, the python package manager (pip itself should be installed on our system, of course):

$ pip install openpyxl --user

You can notice that we launched pip without administrative privileges, and with the --user option to install the package only for our user. This is the recommended way to use the package manager. Once the library is installed on our system, we can start working.

Creating a simple spreadsheet in memory

Let’s start easy. To create a workbook all we have to do is to import and work with the Workbook class, which represents a container for all the other parts of a document. When we create an instance of the Workbook class, a new spreadsheet is also created by default. We can access it via the active property:

from openpyxl import Workbook

workbook = Workbook()
spreadsheet = workbook.active



Twitter icon Follow LinuxConfig.org on Twitter for the latest tips and tricks about Linux!


When a new spreadsheet is created it contains no cells. They are created on the fly, so its better to access them directly in order to avoid wasting precious memory. We can reference a cell of the spreadsheet just like a dictionary key. For example, to get the value of the ‘A1’ cell, we would write:

a1_value = spreadsheet['A1']

Similarly, to assign a value to the same cell we would write:

spreadsheet['A1'] = 'Hello World'

An alternative way to access the spreadsheet cells is to use the cell() method of the Worksheet object, and pass the row/column coordinates as arguments:

# Get the value of the cell
a1_value = spreadsheet.cell(row=1, column=1)

# Populate the cell
spreadsheet.cell(row=1, column=1, value='Hello World')

To save the worksheet we created and manipulated, all we have to do is to use the save method of the Workbook object, and pass the name of the destination file as argument. For example, to save the worksheet as worksheet.xlsx, we would run:

workbook.save('worksheet.xlsx')
The simple spreadsheet we just created
The simple spreadsheet we just created

As soon as we invoke this method, a file with the specified name will be created on our filesystem. Here is its content (in this case I opened with Libreoffice calc):

Adding a sheet to a workbook

In the previous example we saw how to access the active spreadsheet of a workbook. As we know, however, a workbook can contain multiple worksheets, so what if we want to create a new one? We can do it via the create_sheet method of the Workbook object:

new_sheet = workbook.create_sheet('new')

The create_sheet method accepts two optional arguments: title and index. We can use the former (it should be a string) to assign a name to the new sheet, and the latter (int) to specify in what position the sheet should be inserted. The method creates and returns the new sheet. In the example above we created a new sheet using “new” as title. The title can be used to later retrieve the spreadsheet:

sheet = workbook['new']

Copying and moving sheets

To copy an existing sheet we can use the copy_worksheet method and pass the worksheet that should be copied as argument. To copy the active worksheet, for example, we would write:

sheet_copy = workbook.copy_worksheet(workbook.active)

The method returns the created copy of the sheet, which in this case we referenced via the sheet_copy variable.

To move an existing sheet into a determined position in the workbook, instead, we can use the move_sheet method which accepts two arguments. The first one, sheet, is mandatory, and is the worksheet that we want to move around, the second is optional (it defaults to 0), and is the offset to use to specify the sheet position. Let’s see an example. In this case the default worksheet “Sheet” is the first one in the workbook. To move it in the second position, we would write:

workbook.move_sheet(workbook["Sheet"], 1)

We can obtain a list of all the sheets belonging to a workbook via the worksheets property.

Removing a sheet

To remove a sheet from a workbook we use the remove method of the Workbook class. The method accepts one mandatory argument, which is the object representing the sheet we want to remove. Supposing we want to remove the “new” sheet from our workbook, we would write:

workbook.remove(workbook['new'])

Creating a workbook from file

Reading an existing excel spreadsheet file is quite simple with openpyxl. All we have to do is to load the load_workbook function from the library. This function only mandatory parameter is filename, which must be the path of the file we want to open. Supposing this file is called spreadsheet.xlsx, we would write:

from openpyxl import load_workbook
workbook = load_workbook('spreadsheet.xlsx')




The method accepts also some optional parameters which are useful to modify how the file is handled:

Parameter Explanation Default
read_only Open the file in reading-optimized mode. It can’t be edited False
keep_vba Whether to preserve vba content False
data_only Whether to preserve formula in cells or report just the value contained in it False
keep_links Whether links to external workbooks should be preserved True

Once the we load the spreadsheet file we can access the spreadsheet(s) via the instance of the Workbook class returned by load_workbook.

Accessing multiple cells

What if we want to get the value of a range of cells, instead of the value of a single one? All we have to do is to specify the range with the following syntax:

cell_values = spreadsheet['A1':'D1']

The result of specifying a range will be a tuple containing a tuple for each row that is specified. In the example above, there is only one row, since we specified cells from A1 to D1 (they are indeed on the same row), so that would be the result:

((<Cell 'Sheet'.A1>, <Cell 'Sheet'.B1>, <Cell 'Sheet'.C1>, <Cell 'Sheet'.D1>),)

If we wanted to get the object representing the cells of the columns ‘A’ to ‘D’ of the first 3 rows, instead, we would write:

cell_values = spreadsheet['A1':'D3']

We would obtain the following result:

(
  (<Cell 'Sheet'.A1>, <Cell 'Sheet'.B1>, <Cell 'Sheet'.C1>, <Cell 'Sheet'.D1>),
  (<Cell 'Sheet'.A2>, <Cell 'Sheet'.B2>, <Cell 'Sheet'.C2>, <Cell 'Sheet'.D2>),
  (<Cell 'Sheet'.A3>, <Cell 'Sheet'.B3>, <Cell 'Sheet'.C3>, <Cell 'Sheet'.D3>)
)

The tuple this time contains three tuples, one for each row, as we said before. To access all the cells of a column, instead we would just specify the column name, without any row number. For example, to get all the cells of the ‘A’ column we would write:

cells = spreadsheet['A']

To obtain all the cells of the columns A to D, instead, we would write:

cells = spreadsheet['A':'D']

Similarly, we can access whole rows by specifying a range of their numbers. To obtain all the cells of the first two rows, we would write:

cells = spreadsheet[1:3]

Iterate over rows and columns Using the iter_rows() and iter_cols() methods

Instead of specifying a range to access the value of a series of cells we can use the iter_rows() and iter_cols() methods of the spreadsheet. Both methods accepts the same optional arguments:

Parameter Explanation
min_row The smallest row index
max_row The largest row index
min_col The smallest column index
max_col The largest column index
values_only Whether only cell values should be returned

In both methods with the min_row/max_row and min_col/max_col parameters we specify the range of rows and columns on which the iteration should take place. The difference between the two is that iter_rows() returns the cells organized by rows, where iter_cols(), instead returns them organized by columns. Let’s see some practical examples. Suppose we want to iterate over the first three rows from the first to the fifth column and want to obtain the cells organized by row. Here is what we would run:

for i in spreadsheet.iter_rows(min_row=1, max_row=3, min_col=1, max_col=5):
  for cell in i:
    print(cell)




Executing the code above returns the following result:

<Cell 'Sheet'.A1>
<Cell 'Sheet'.B1>
<Cell 'Sheet'.C1>
<Cell 'Sheet'.D1>
<Cell 'Sheet'.E1>
<Cell 'Sheet'.A2>
<Cell 'Sheet'.B2>
<Cell 'Sheet'.C2>
<Cell 'Sheet'.D2>
<Cell 'Sheet'.E2>
<Cell 'Sheet'.A3>
<Cell 'Sheet'.B3>
<Cell 'Sheet'.C3>
<Cell 'Sheet'.D3>
<Cell 'Sheet'.E3>

As you can see, the cells are organized by row. To obtain the same cells, this time organized by columns, we would use pass the same arguments to the iter_cols() method:

for i in spreadsheet.iter_rows(min_row=1, max_row=3, min_col=1, max_col=5):
  for cell in i:
    print(i)

The same cells are returned; this time, organized in columns:

<Cell 'Sheet'.A1>
<Cell 'Sheet'.A2>
<Cell 'Sheet'.A3>
<Cell 'Sheet'.B1>
<Cell 'Sheet'.B2>
<Cell 'Sheet'.B3>
<Cell 'Sheet'.C1>
<Cell 'Sheet'.C2>
<Cell 'Sheet'.C3>
<Cell 'Sheet'.D1>
<Cell 'Sheet'.D2>
<Cell 'Sheet'.D3>
<Cell 'Sheet'.E1>
<Cell 'Sheet'.E2>
<Cell 'Sheet'.E3>

Conclusions

In this tutorial we learned how to work with Excel spreadsheet files using the Python openpyxl library. We saw how to create a workbook in memory or from a file, how to retrieve, create, copy, move and remove sheets, how to access a cell and a range of cells, and, finally how to iterate over rows and columns. Are you interested in more tutorials like this? Take a look at our Python articles!



Comments and Discussions
Linux Forum