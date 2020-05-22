On Linux and other Unix-like operating systems, tar is undoubtedly one of the most used archiving utilities; it let us create archives, often called “tarballs”, we can use for source code distribution or backup purposes. In this tutorial we will see how to read, create and modify tar archives with python, using the
tarfile module.
- The modes in which a tar archive can be opened using the tarfile module
- What are the TarInfo and TarFile classes and what they represent
- How to list the content of a tar archive
- How to extract the content of a tar archive
- How to add files to a tar archive
Software requirements and conventions used
|Category
|Requirements, Conventions or Software Version Used
|System
|Distribution-independent
|Software
|Python3
|Other
|Basic knowledge of python3 and object oriented programming
|Conventions
| # - requires given linux commands to be executed with root privileges either directly as a root user or by use of
sudo command $ - requires given linux commands to be executed as a regular non-privileged user
Basic usage
The tarfile module is included in the python standard library, so we don’t need to install it separately; to use it, we just need to “import” it. The recommended way to access a tarball using this module is by the
open function; in its most basic usage, we must provide, as the first and second arguments:
- The name of the tarball we want to access
- The mode in which it should be opened
The “mode” used to open a tar archive depends on the action we want to perform and on the type of compression (if any) in use. Let’s see them together.
Opening an archive in read-only mode
If we want to examine or extract the content of a tar archive, we can use one of the following modes, to open it read-only:
|Mode
|Meaning
|‘r’
|Read only mode - the compression type will be automatically handled
|‘r:’
|Read-only mode without compression
|‘r:gz’
|Read-only mode - zip compression explicitly specified
|‘r:bz2’
|Read-only mode - bzip compression explicitly specified
|‘r:xz’
|Read-only mode - lzma compression explicitly specified
In most of the cases, where the compression method can be easily detected, the recommended mode to use is ‘r’.
Opening an archive to append files
If we want to append files to an existing archive we can use the ‘a’ mode. It’s important to notice that it’s possible to append to an archive only if it is not compressed; if we attempt to open a compressed archive with this mode, a
ValueError exception will be raised. If we reference a non-existing archive it will be created on the fly.
Opening an archive for writing
If we want to explicitly create a new archive and open it for writing, we can use one of the following modes:
|Mode
|Meaning
|‘w’
|Open the archive for writing - use no compression
|‘w:gz’
|Open the archive for writing - use gzip compression
|‘w:bz’
|Open the archive for writing - use bzip2 compression
|‘w:xz’
|Open the archive for writing - use lzma compression
If an existing archive file is opened for writing, it is truncated, so all its content is discarded. To avoid such situations, we may want to open the archive exclusively, as described in the next section.
Create an archive only if it doesn’t exist
When we want to be sure an existing file is not overridden when creating an archive, we must open it exclusively. If we use the ‘x’ mode and a file with the same name of the one we specified for the archive already exists, a
FileExistsError will be raised. The compression methods can be specified as follows:
|Mode
|Meaning
|‘x’
|Create the archive without compression if doesn’t exist
|‘x:gz’
|Create the archive with gzip compression only if it doesn’t exist
|‘x:bz2’
|Create the archive with bzip2 compression only if it doesn’t exist
|‘x:xz’
|Create the archive with lzma compression only if it doesn’t exist
Working with archives
There are two classes provided by the
tarfile module that are used to interact with tar archives and their contents, and are, respectively:
TarFile and
TarInfo. The former is used to represent a tar archive in its entirety and can be used as a context manager with the Python
with statement, the latter is used to represent an archive member, and contains various information about it. As a first step, we will focus on some of the most often used methods of the
TarFile class: we can use them to perform common operations on tar archives.
Retrieving a list of the archive members
To retrieve a list of the archive members we can use the
getmembers method of a
TarFile object. This method returns a list of
TarInfo objects, one for each archive member. Here is an example of its usage with a dummy compressed archive containing two files:
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... archive.getmembers()
...
[<TarInfo 'file1.txt' at 0x7f58dab50d00>, <TarInfo 'file2.txt' at 0x7f58dab50ac0>]
As we will see later, we can access some of the attributes of an archived file, as its ownership and modification time, via the corresponding
TarInfo object properties and methods.
Displaying the content of a tar archive
If all we want to do is to display the content of a tar archive, we can open it in read mode and use the
list method of the
Tarfile class.
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... archive.list()
...
?rw-r--r-- egdoc/egdoc 0 2020-05-16 15:45:45 file1.txt
?rw-r--r-- egdoc/egdoc 0 2020-05-16 15:45:45 file2.txt
As you can see the list of the files contained in the archive is displayed as output. The
list method accepts a positional parameter, verbose which is
True by default. If we change its value to
False, only the file names will be reported in the output, with no additional information.
The method also accepts an optional named parameter, members. If used, the argument provided must be a subset of the list of
TarInfo objects as returned by the
getmembers method. Only information about the specified files will be displayed if this parameter is used and a correct value is provided.
Extracting all members from the tar archive
Another very common operation we may want to perform on a tar archive is to extract all its content. To perform such operation we can use the
extractall method of the corresponding
TarFile object. Here is what we would write:
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... archive.extractall()
The first parameter accepted by the method is path: it used to specify where the members of the archive should be extracted. The default value is
'.', so the members are extracted in the current working directory.
The second parameter, members, can be used to specify a subset of members to extract from the archive, and, as in the case of the
list method, it should be a subset of the list returned by the
getmembers method.
The
extractall method has also a named parameter, numeric_owner. It is
False by default: if we change it to
True, numeric uid and gid will be used to set the ownership of the extracted files instead of user and group names.
Extracting only one member from the archive
What if we want to extract only a single file from the archive? In that case we want to use the
extract method and reference the file that should be extracted by its name (or as a
TarFile object). For example, to extract only the
file1.txt file from the tarball, we would run:
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... archive.extract('file1.txt')
Easy, isn’t it? The file is extracted on the current working directory by default, but a different position can be specified using the second parameter accepted by the method: path.
Normally the attributes the file has inside the archive are set when it is extracted on the filesystem; to avoid this behavior we can set the third parameter of the function, set_attrs, to
False.
The method accepts also the numeric_owner parameter: the usage its the same we saw in the context of the
extractall method.
Extracting an archive member as a file-like object
We saw how, by using the
extractall and
extract methods we can extract one or multiple tar archive members to the filesystem. The
tarfile module provides another extraction method:
extractfile. When this method is used, the specified file is not extracted to the filesystem; instead, a read-only file-like object representing it is returned:
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... fileobj = archive.extractfile('file1.txt')
... fileobj.writable()
... fileobj.read()
...
False
b'hello\nworld\n'
Adding files to an archive
Until now we saw how to obtain information about an archive and its members, and the different methods we can use to extract its content; now it’s time to see how we can add new members.
The easiest way we can use to add a file to an archive is by using the
add method. We reference the file to be included in the archive by name, which is the first parameter accepted by the method. The file will be archived with its original name, unless we specify an alternative one using the second positional parameter: arcname. Suppose we want to add the
file1.txt to a new archive, but we want to store it as
archived_file1.txt; we would write:
>>> with tarfile.open('new_archive.tar.gz', 'w') as archive:
... archive.add('file1.txt', 'archived_file1.txt')
... archive.list()
...
-rw-r--r-- egdoc/egdoc 12 2020-05-16 17:49:44 archived_file1.txt
In the example above, we created a new uncompressed archive using the ‘w’ mode and added the
file1.txt as
archive_file1.txt, as you can see by the output of
list().
Directories can be archived in the same way: by default the are added recursively, so together with their content. This behavior can be changed by setting the third positional parameter accepted by the
add method, recursive, to
False.
What if we want to apply a filter, so that only specified files are included in the archive? For this purpose we can use the optional filter named parameter. The value passed to this parameter must be a function that takes a
TarInfo object as argument and returns said object if it must be included in the archive or
None if it must be excluded. Let’s see an example. Suppose we have three files in our current working directory:
file1.txt,
file2.txt and
file1.md. We want to add only the files with the
.txt extension to the archive; here is what we could write:
>>> import os
>>> import tarfile
>>> with tarfile.open('new_archive.tar.gz', 'w') as archive:
... for i in os.listdir():
... archive.add(i, filter=lambda x: x if x.name.endswith('.txt') else None)
... archive.list()
...
-rw-r--r-- egdoc/egdoc 0 2020-05-16 18:26:20 file2.txt
-rw-r--r-- egdoc/egdoc 0 2020-05-16 18:22:13 file1.txt
In the example above we used the
os.listdir method to obtain a list of the files contained in the current working directory. Iterating over said list, we used the
add method to add each file to the archive. We passed a function as the argument of the filter parameter, in this case an anonymous one, a lambda. The function takes the tarfile object as argument (x) and returns it if its name (name is one of the properties of the
TarInfo object) ends with “.txt”. If it’s not the case, the function returns
None so the file is not archived.
The TarInfo object
We already learned that the
TarInfo objects represents a tar archive member: it stores the attributes of the referenced file and provides some methods which can help us identify the file type itself. The
TarInfo object doesn’t contain the actual file data. Some of the attributes of the
TarInfo object are:
- name (name of the file)
- size (file size)
- mtime (file modification time)
- uid (the user id of the file owner)
- gid (the id of the file group)
- uname (the user name of the file owner)
- gname (the name of the file group)
The object has also some very useful methods, here are some of them:
- isfile() - Returns True if the file is a regular file, False otherwise
- isdir() - Returns True if the file is a directory, False otherwise
- issym() - Returns True if the file is a symbolic link, False otherwise
- isblk() - Returns True if the file is a block device, False otherwise
Conclusions
In this tutorial we learned the basic usage of the
tarfile Python module, and we saw how we can use it to work with tar archives. We saw the various operating modes, what the
TarFile and
TarInfo classes represent, and some of the most used methods to list the content of an archive, to add new files or to extract them. For a more in depth knowledge of the
tarfile module please take a look at the module official documentation