How to create incremental backups using rsync on Linux

In previous articles, we already talked about how we can perform local and remote backups using rsync and how to setup the rsync daemon. In this tutorial we will learn a very useful technique we can use to perform incremental backups, and schedule them using the good old cron.

In this tutorial you will learn:

  • The difference between hard and symbolic links
  • What is an incremental backup
  • How the rsync –link-dest option works
  • How to create incremental backups using rsync
  • How to schedule backups using cron

How to create incremental backups using rsync on Linux

How to create incremental backups using rsync on Linux

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Distribution independent
Software Rsync
Other None
Conventions # – linux-commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – linux-commands to be executed as a regular non-privileged user


Hard vs symbolic links

Before we proceed further, and learn how to create incremental backups with rsync, we should take some time to clearly grasp the difference between symbolic and hard, links, since the latter will have a crucial role in our implementation (you can skip this part if it sounds obvious to you).

On Unix-based systems like Linux we have two types of “links”: hard and symbolic. The ln command generates hard links by default; if we want to create symbolic links we must invoke it with the -s option (short for --symbolic).

To understand how hard_links work, we must focus on the concept of inode. An inode is a data structure on the filesystem which contains various information about a file or a directory (which, by the way, is just a “special” kind of file), such as its permissions and the location of the hard disk blocks containing the actual data.

At this point you may think the name of a file is also “stored” into its inode: this is not the case. What we commonly call “file names” are just human-friendly references to inodes established inside directories.

A directory can contain more then one reference to the same inode: those references are what we call hard_links. All files have (of course) at least one hard link.

Hard links have two major limitations: they don’t work across filesystems and cannot be used for directories.

When the count of hard links for an inode reaches 0, the inode itself is deleted and so the referenced blocks on the disk become usable by the operating system (the actual data is not deleted, and can be sometimes recovered, unless it is overwritten by new data). The count of hard links associated with an inode is reported in the output of the ls command when it is called with the -l option:

$ ls -l ~/.bash_logout
-rw-r--r--. 1 egdoc egdoc 18 Jan 28 13:45 /home/egdoc/.bash_logout

In the output above, just after the permissions notation, we can clearly see that ~/.bash_logout is the only reference (the only hard link) to its specific inode. Let’s create another hard link, and see how the output of the command changes:

$ ln ~/.bash_logout bash_logout && ls -l ~/.bash_logout
-rw-r--r--. 2 egdoc egdoc 18 Jan 28 13:45 /home/egdoc/.bash_logout


As expected, the hard links count has been incremented by one unit and is now 2. Again: ~/.bash_logout and ~/bash_logout are not two different files; they are just two directory entries pointing to the same inode. This can easily be demonstrated by running ls, this time with the -i option (short for --inode): it makes so that the inode index is included the output:

$ ls -li ~/.bash_logout ~/bash_logout
131079 -rw-r--r--. 2 egdoc egdoc 18 Jan 28 13:45 /home/egdoc/.bash_logout
131079 -rw-r--r--. 2 egdoc egdoc 18 Jan 28 13:45 /home/egdoc/bash_logout

As you can see, the referenced inode is 131079 in both lines.

Symbolic links are different. They are a more modern concept and overcome the two hard links limitations: they can be used for directories and can be set across filesystems. A symbolic link is a special kind of file which points to a completely different file (its target). The removal of a symbolic link doesn’t affect its target: deleting all symbolic links to a file doesn’t cause the original file to be deleted. On the other hand, deleting the “target” file, breaks the symbolic link(s) pointing to it.

At this point it should be clear why in terms of space occupied on disk, creating hard links is more convenient: when we add an hard link we don’t create a new file, but a new reference to an already existing one.



Creating incremental backups with rsync

First of all, what is a so called incremental backup? An incremental backup stores only the data that has been changed since the previous backup was made. In an incremental backup strategy, only the first backup of the series is a “full backup”; the subsequent ones, will just store the incremental differences. This has the advantage of requiring less space on disk and less time to be completed compared to full backups.

How can we use rsync to create incremental backups? Say we want to create incremental backups of our $HOME directory: first we will create a full backup of it and store it in a directory we will name after the current timestamp. We will than create a link to this directory, and we will call it latest in order to have an easily identifiable reference.

The subsequent backups will be made by calculating the differences between the current state of the $HOME directory and the last existent backup. Each time a new backup will be created, the current latest link, still pointing to the previous backup, will be removed; it will be than recreated with the new backup directory as target. The link will always point to the latest available backup.

Even if the backups are incremental, by taking a look inside each directory we will always see the complete set of files, not only the ones that changed: this is because the unchanged files will be represented by hard links. Those who where modified since the last backup will be the only ones to occupy new space on the disk.

To implement our backup strategy we will make use of the --link-dest option of rsync. This option takes a directory as argument. When invoking rsync we will than specify:

  • The source directory
  • The destination directory
  • The directory to be used as argument of the --link-dest option

The content of the source directory will be compared to that of the directory passed to the --link-dest option. New and modified files existing in the source directory will be copied to the destination directory as always (and files deleted in the source will also not appear in the backup if the --delete option is used); unchanged files will also appear in the backup directory, but they will just be hard links pointing to inodes created in the previously made backups.

Implementation

Here is a simple bash script with an actual implementation of our strategy:

#!/bin/bash

# A script to perform incremental backups using rsync

set -o errexit
set -o nounset
set -o pipefail

readonly SOURCE_DIR="${HOME}"
readonly BACKUP_DIR="/mnt/data/backups"
readonly DATETIME="$(date '+%Y-%m-%d_%H:%M:%S')"
readonly BACKUP_PATH="${BACKUP_DIR}/${DATETIME}"
readonly LATEST_LINK="${BACKUP_DIR}/latest"

mkdir -p "${BACKUP_DIR}"

rsync -av --delete \
  "${SOURCE_DIR}/" \
  --link-dest "${LATEST_LINK}" \
  --exclude=".cache" \
  "${BACKUP_PATH}"

rm -rf "${LATEST_LINK}"
ln -s "${BACKUP_PATH}" "${LATEST_LINK}"


The first thing we did was to declare some read-only variables: SOURCE_DIR which contains the absolute path of the directory we want to backup (our home directory in this case), BACKUP_DIR directory which contains the path to the directory where all the backups will be stored, DATETIME which stores the current timestamp, BACKUP_PATH which is the absolute path of the backup directory obtained by ‘joining’ BACKUP_DIR and the current DATETIME. Finally we set the LATEST_LINK variable which contains the path of the symbolic link which will always point to the latest backup.

We then launch the rsync command providing the -a option (short for --archive) to preserve the most important attributes of the source files, the -v option to make the command more verbose (optional), and the --delete option to make so that files deleted from source are also deleted on destination (we explained this and other rsync options in a previous article.

Notice that we added a trailing slash to the SOURCE_DIR in the rsync command: this makes so that only the content of the source directory is synced, not the directory itself.

We run the command with the --link-dest option, passing the LATEST_LINK directory as argument. The first time we will launch the script this directory will not exist: this will not generate an error, but will cause a full backup to be performed, as expected.

We decided to exclude the .cache directory from the backup with the --exclude option, and finally, we provided the BACKUP_PATH to instruct rsync where to create the backup.

After the command is successfully executed, the link pointing to the previous backup is removed, and another one with the same name, pointing to the new backup is created.

That’s it! Before we use the script in the real world we’d better add some error handling to it (for example we could delete the new backup directory if the backup is not completed successfully), and, since the rsync command can potentially run for a quite long period of time (at least the first time, when a full backup is created) we may want to implement some form of signal propagation from the parent script to the child process (how to do this could be a nice topic for another tutorial).



Run the script periodically with cron

This script is not meant to be launched manually: the most convenient thing would be to schedule its execution by creating an entry in our personal crontab. To edit our crontab and add a new cron job, all we have to do is to execute the following command:

$ crontab -e

The crontab will be opened in the default text editor. In it we can create the new cron job. For example, for the script to be executed every 12 hours we could add this entry:

0 */12 * * * /path/to/backup-script.sh

Conclusions

In this tutorial we explained the difference between symbolic and hard links on Linux and we learned why it is important in the context of an incremental backup strategy implemented with rsync. We saw how and why we use the rsync --link-dest option to accomplish our task and we created a simple bash script to illustrate the strategy flow; finally we saw how to schedule the invocation of the script periodically using cron.



Comments and Discussions
Linux Forum