July 31, 2009
By Pierre Vignéras
More stories by this author:
Abstract:
The first question that you may ask is why are there so many file-systems, and what are their differences if any? To make it short (see wikipedia for details):
There are other file-systems, in particular new ones such as btrfs, zfs and nilfs2 that may sound very interesting too. We will deal with them later on in this article (see 5
).
So now the question is: which file-system is the most suitable for your particular situation? The answer is not simple. But if you don't really know, if you have any doubt, I would recommend XFS for various reasons:
The only problem I see with XFS, is that you cannot reduce an XFS fs. You can grow an XFS partition even when mounted and in active use (hot-grow), but you cannot reduce its size. Therefore, if you have some reducing file-system needs choose another file system such as ext2/3/4 or reiserfs (as far as I know you cannot hot-reduce neither ext3 nor reiserfs file-systems anyway). Another option is to keep XFS and to always start with small partition size (as you can always hot-grow afterwards).
If you have a low profile computer (or file server) and if you really need your CPU for something else than dealing with input/output operations, then I would suggest JFS.
If you have many directories or/and small files, reiserfs may be an option.
If you need performance at all cost, I would suggest ext2.
Honestly, I don't see any reason for choosing ext3/4 (performance? really?).
That is for file-system choice. But then, the other question is which layout should I use? Two partitions? Three? Dedicated /home/? Read-only /? Separate /tmp?
Obviously, there is no single answer to this question. Many factors should be considered in order to make a good choice. I will first define those factors:
Finding the perfect layout is a trade-off between those factors.
Often, a desktop end-user with few knowledge of Linux will follow default settings of his distribution where (usually) only two or three partitions are made for Linux, with the root file-system `/', /boot and the swap. Advantages of such a configuration is simplicity. Main problem is that this layout is neither flexible nor performant.
Lack of flexibility is obvious for many reasons. First, if the end-user wants another layout (for example he wants to resize the root file-system, or he wants to use a separate /tmp file-system), he will have to reboot the system and to use a partitioning software (from a livecd for example). He will have to take care of his data since re-partitioning is a brute-force operation the operating system is not aware of.
Also, if the end-user wants to add some storage (for example a new hard drive), he will end up modifying the system layout (/etc/fstab) and after some while, his system will just depend on the underlying storage layout (number, and location of hard drives, partitions, and so on).
By the way, having separate partitions for your data (/home but also all audio, video, database, ...) makes much easier the changing of the system (for example from one Linux distribution to another). It makes also the sharing of data between operating systems (BSD, OpenSolaris, Linux and even Windows) easier and safer. But this is another story.
A good option is to use Logical Volume Management (LVM). LVM solves the flexibility problem in a very nice way, as we will see. The good news is that most modern distributions support LVM and some use it by default. LVM adds an abstraction layer on top of the hardware removing hard dependencies between the OS (/etc/fstab) and the underlying storage devices (/dev/hda, /dev/sda, and others). This means that you may change the layout of storage -- adding and removing hard drives -- without disturbing your system. The main problem of LVM, as far as I know, is that you may have trouble reading an LVM volume from other operating systems.
Whatever file-system is used (ext2/3/4, xfs, reiserfs, jfs), it is not perfect for all sort of data and usage patterns (aka workload). For example, XFS is known to be good in the handling of big files such as video files. On the other side, reiserfs is known to be efficient in the handling of small files (such as configuration files in your home directory or in /etc). Therefore having one file-system for all sort of data and usage is definitely not optimal. The only good point with this layout is that the kernel does not need to support many different file-systems, thus, it reduces the amount of memory the kernel uses to its bare minimum (this is also true with modules). But unless we focus on embedded systems, I consider this argument as irrelevant with today computers.
Often, when a system is designed, it is usually done in a bottom to top approach: hardware is purchased according to criteria that are not related to their usage. Thereafter, a file-system layout is defined according to that hardware: ''I have one disk, I may partition it this way, this partition will appear there, that other one there, and so on''.
I propose the reverse approach. We define what we want at a high level. Then we travel layers top to bottom, down to real hardware -- storage devices in our case -- as shown on Figure 1. This illustration is just an example of what can be done. There are many options as we will see. Next sections will explain how we can come to such a global layout.

Before installing a new system, the target usage should be considered. First from a hardware point of view. Is it an embedded system, a desktop, a server, an all-purpose multi-user computer (with TV/Audio/Video/OpenOffice/Web/Chat/P2P, ...)?
As an example, I always recommend end-users with simple desktop needs (web, mail, chat, few media watching) to purchase a low cost processor (the cheapest one), plenty of RAM (the maximum) and at least two hard drives.
Nowadays, even the cheapest processor is far enough for web surfing and movie watching. Plenty of RAM gives good cache (linux uses free memory for caching -- reducing the amount of costly input/output to storage devices). By the way, purchasing the maximum amount of RAM your motherboard can support is an investment for two reasons:
Having two hard drives allows them to be used in mirror. Therefore, if one fails, the system will continue to work normally and you will have time to get a new hard drive. This way, your system will remain available and your data, quite safe (this is not sufficient, backup your data also).
When choosing hardware, and specifically the file-system layout you should consider applications that will use it. Different applications have different input/output workload. Consider the following applications: loggers (syslog), mail readers (thunderbird, kmail), search engine (beagle), database (mysql, postgresql), p2p (emule, gnutella, vuze), shells (bash)... Can you see their input/output patterns and how much they differ?
Therefore, I define the following abstract storage location known as logical volume -- lv -- in the LVM terminology:
You may add/suggest any other categories here with different patterns such as sequential.read.lv, for example.
Let's suppose that we already have all those storage abstract locations in the form of /dev/TBD/LV where:
So we suppose we already have /dev/TBD/tmp.lv, /dev/TBD/read.lv, /dev/TBD/write.lv, and so on.
By the way, we consider that each volume group is optimized for its usage pattern (a trade-off have been found between performance and flexibility).
We would like to have /tmp, /var/tmp, and any $HOME/tmp all mapped to /dev/TBD/tmp.lv.
What I suggest is the following:
# Replace auto by the real file-system if you want
# Replace defaults 0 2 by your own needs (man fstab)
/dev/TBD/tmp.lv /.tmp auto defaults 0 2
/.tmp/ALL_TMP /tmp none bind 0 0
/.tmp/ALL_TMP /var/tmp none bind 0 0
/.tmp/FHS_TMP /tmp none bind 0 0
/.tmp/FHS_VAR_TMP /var/tmp none bind 0 0
if test ! -e $HOME/tmp -a ! -e /tmp/kde-$USER;thenmkdir /tmp/kde-$USER;
ln -s /tmp/kde-$USER $HOME/tmp;
fi
Since the root file-system contains /etc, /bin, /usr/bin and so on, they are perfect for read.lv. Therefore, in /etc/fstab I would place the following:
/dev/TBD/read.lv / auto defaults 0 1
For configuration files in user home directories things are not so simple as you may guess. One may try to use the XDG_CONFIG_HOME environment variable (see FreeDesktop )
But I would not recommend this solution for two reasons. First, few applications actually conforms to it nowadays (default location is $HOME/.config when not set explicitly). Second, is that if you set XDG_CONFIG_HOME to a read.lv sub-directory, end users will have trouble in finding their configuration files. Therefore, for that case, I don't have any good solution and I will make home directories and all config files stored to the general write.lv location.
For that case, I will reproduce some way the pattern used for tmp.lv. I will bind different directories for different applications. For example, I will have in the fstab something similar to this:
/dev/TBD/write.lv /.write auto defaults 0 2
/.write/db /db none bind 0 0
/.write/p2p /p2p none bind 0 0
/.write/home /home none bind 0 0
Of course, this suppose that db and p2p directories have been created in write.lv.
Note that you may have to be aware of rights access. One option is to provide the same rights than for /tmp where anyone can write/read their own data. This is achieved by the following command for example: chmod 1777 /p2p.
That volume has been tuned for loggers style applications such as syslog (and its variants syslog_ng for example), and any other loggers (Java loggers for example). The /etc/fstab should be similar to this:
/dev/TBD/append.lv /.append auto defaults 0 2/.append/syslog /var/log none bind 0 0
/.append/ulog /var/ulog none bind 0 0
Again, syslog and ulog are directories previously created into append.lv.
For multimedia files, I just add the following line:
/dev/TBD/mm.lv /mm auto defaults 0 2
Inside /mm, I create Photos, Audios and Videos directories. As a desktop user, I usually share my multimedia files with other family members. Therefore, access rights should be correctly designed.
You may prefer having distinct volumes for photo, audio and video files. Feel free to create logical volumes accordingly: photos.lv, audios.lv and videos.lv.
You may add your own logical volumes according to your need. Logical volumes are quite free to deal with. They do not add a big overhead and they provide a lot of flexibility helping you to take out the most of your system particularly when choosing the right file-system for your workload.
Now that our mount points and our logical volumes have been defined according to our application usage patterns, we may choose the file-system for each logical volumes. And here we have many choices as we have already seen. First of all, you have the file-system itself (e.g: ext2, ext3, ext4, reiserfs, xfs, jfs and so on). For each of them you also have their tuning parameters (such as tuning block size, number of inodes, log options (XFS), and so on). Finally, when mounting you may also specify different options according to some usage pattern (noatime, data=writeback (ext3), barrier (XFS), and so on). File-system documentation should be read and understood so you can map options to the correct usage pattern. If you don't have any idea on which fs to use for which purpose, here are my suggestions:
At that high level, you may also decide if you need encryption or compression support. This may help in choosing the file-system. For example, for mm.lv, compression is useless (as multimedia data are already compressed) whereas it may sound useful for /home. Consider also if you need encryption.
At that step we have chosen the file-systems for all of our logical volumes. Time is now to go down to the next layer and to define our volume groups.
Next step is to define volume groups. At that level, we will define our needs in term of performance tuning and fault tolerance. I propose defining VGs according to the following schema: [r|s].[R|W].[n] where:
Letters determine the type of optimization the named volume has been tuned for. The number gives an abstract representation of the fault tolerance level. For example:
We then have to map each logical volume to a given volume group. I suggest the following:
Of course, we have a 'may' and not a 'must' statement as it depends on the number of storage devices that you can put into the equation. Defining VG is actually quite difficult since you cannot always really abstract completely the underlying hardware. But I believe that defining your requirements first may help in defining the layout of your storage system globally.
We will see at the next level, how to implement those volume groups.
That level is where you actually implements a given volume group requirements (defined using the notation rs.RW.n described above). Hopefully, there are not -- as far as I know -- many ways in implementing a vg requirement. You may use some of LVM features (mirroring, stripping), software RAID (with linux MD), or hardware RAID. The choice depends on your needs and on your hardware. However, I would not recommend hardware RAID (nowadays) for a desktop computer or even a small file server, for two reasons:
So if you don't have any idea on how to implement a given specification using RAID, please, see RAID documentation.
Some few hints however:
When you map storage space to a given physical volume, do not attach two storage spaces from the same storage device (i.e. partitions). You will lose both advantages of performance and fault tolerance! For example, making /dev/sda1 and /dev/sda2 part of the same RAID1 physical volume is quite useless.
Finally, if you are not sure what to choose between LVM and MDADM, I would suggest MDADM has it is a bit more flexible (it supports RAID0, 1, 5 and 10, whereas LVM only supports striping (similar to RAID0), and mirroring (similar to RAID1)).
Even if strictly not required, if you use MDADM, you will probably end up with a one-to-one mapping between VGs and PVs. Said otherwise, you may map many PVs to one VG. But this is a bit useless in my humble opinion. MDADM provides all the flexibility required in the mapping of partitions/storage devices into VG implementations.
Finally, you may want to make some partitions out of your different storage devices in order to fulfill your PV requirements (for example, RAID5 requires at least 3 different storage spaces). Note that in the vast majority of cases, your partitions will have to be of the same size.
If you can, I would suggest to use directly storage devices (or to make only one single partition out of a disk). But it may be difficult if you are short in storage devices. Moreover, if you have storage devices of different sizes, you will have to partition one of them at least.
You may have to find some trade-off between your PV requirements and your available storage devices. For example, if you have only two hard drives, definitely you cannot implement a RAID5 PV. You will have to rely on a RAID1 implementation only.
Note that if you really follow the top-bottom process described in this document (and if you can afford the price of your requirements of course), there is no real trade-off to deal with! ;-)
We didn't mention in our study the /boot file-system where the boot-loader is stored. Some would prefer having only one single / where /boot is just a sub-directory. Others prefer to separate / and /boot. In our case, where we use LVM and MDADM, I would suggest the following idea:
Swap is also a stuff we didn't discuss up to now. You have many options here:
Using LVM it is quite easy to set up a new logical volume created from some volume group (depending on what you want to test and your hardware) and to format it to some file-systems. LVM is very flexible in this regard. Feel free to create and remove file-systems at will.
But in some ways, future file-systems such as ZFS, Btrfs and Nilfs2 will not fit perfectly with LVM. The reason is that LVM leads to a clear separation between application/user needs and implementations of this needs, as we have seen. On the other side, ZFS and Btrfs integrate both needs and implementation into one stuff. For example both ZFS and Btrfs supports RAID level directly. The good thing is that it ease the making of file-system layout. The bad thing is that it violates some ways the separation of concern strategy.
Therefore, you may end up with both an XFS/LV/VG/MD1/sd{a,b}1 and Btrfs/sd{a,b}2 inside the same system. I would not recommend such a layout and suggest to use ZFS or Btrfs for everything or not at all.
Another file-system that may be interesting is Nilfs2. This log structured file-systems will have very good write performance (but maybe poor read performance). Therefore, such a file-system may be a very good candidate for the append logical volume or on any logical volume created from an rs.W.n volume group.
If you want to use one or several USB drives in your layout consider the following:
Therefore, it may be interesting to use several USB drives (or even stick) to make them part of a RAID system, especially a RAID1 system. With such a layout, you can pull out one USB drive of a RAID1 array, and use it (in read-only mode) elsewhere. Then, you pull it in again in your original RAID1 array, and with a magic mdadm command such as:
mdadm /dev/md0 -add /dev/sda1
The array will reconstruct automagically and come back to its original state. I would not recommend making any other RAID array out of USB drive however. For RAID0, it is obvious: if you remove one USB drive, you loose all your data! For RAID5, having USB drive, and thus, the hot-plug capability does not offer any advantage: the USB drive you have pulled out is useless in a RAID5 mode! (same remark for RAID10).
Finally, new SSD drives may be considered while defining physical volumes. Their properties should be taken into account:
Therefore SSD drives are suitable for implementing rsR#n volume groups. As an example, mm.lv and read.lv volumes can be stored on SSDs since data is usually written once and read many times. This usage pattern is perfect for SSD.
In the process of designing a file-system layout, the top-bottom approach starts with high level needs. This method has the advantage that you can rely on previously made requirements for similar systems. Only the implementation will change. For example, if you design a desktop system: you may end up with a given layout (such as the one in figure 1). If you install another desktop system with different storage devices, you can rely on your first requirements. You just have to adapt bottom layers: PVs and partitions. Therefore, the big work, usage pattern or workload, analysis can be done only one time per system, naturally.
In the next and final section, I will give some layout examples, roughly tuned for some well known computer usages.
This ( see the top layout of figure 2) is a rather strange situation in my opinion. As already said, I consider that any computer should be sized according to some usage pattern. And having only one disk attached to your system means that you accept a complete failure of it someway. But I know that the vast majority of computers today -- especially laptops and netbooks -- are sold (and designed) with only a single disk. Therefore, I propose the following layout which focuses on flexibility and performance (as much as possible):


Here (see the bottom layout of figure 2), our concern is high availability. Since we have only two disks, only RAID1 can be used. This configuration provides:
Here (see the top layout of figure 3), our concern is high performance. Note however that I still consider unacceptable to lose some data. This layout provides the following:
Figure 3: Top: Layout for high performance desktop usage with two disks. Bottom: Layout for file server with four disks.


Here (see the bottom layout of figure 3), our concern is both high performance and high availability. This layout provides the following:
We may have used RAID10 for the whole system as it provides very good implementation of rs.RW.1 vg (and someway also rs.RW.2). Unfortunately, this comes with a cost: 4 storage devices are required (here partitions), each of the same capacity S (let say S=500 Gigabytes). But the RAID10 physical volume does not provide a 4*S capacity (2 Terabytes) as you may expect. It only provides half of it, 2*S (1 Terabytes). The other 2*S (1 Terabytes) is used for high availability (mirror). See RAID documentation for details. Therefore, I choose to use RAID5 for implementing rs.R.1. RAID5 will provide 3*S capacity (1.5 Gigabytes), the remaining S (500 Gigabytes) is used for high availability. The mm.lv usually requires a big amount of storage space since it holds multimedia files.
Note 2:If you export through NFS or SMB 'home' directories, you may consider their location carefully. If your users need a lot of space, making homes on the write.lv (the 'fit-all' location) may be storage-expensive because it is backed by an RAID10 pv where half of the storage space is used for mirroring (and performance). You have two options here:
If you have any question, comment, and/or suggestion on this document, feel free to contact me at the following address: This email address is being protected from spambots. You need JavaScript enabled to view it. .
This document is licensed under a Creative Commons Attribution-Share Alike 2.0 France License.
The information contained in this document is for general information purposes only. The information is provided by Pierre Vignéras and while I endeavor to keep the information up to date and correct, I make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the document or the information, products, services, or related graphics contained in the document for any purpose.
Any reliance you place on such information is therefore strictly at your own risk. In no event I will we be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this document.
Through this document you are able to link to other documents which are not under the control of Pierre Vignéras. I have no control over the nature, content and availability of those sites. The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed within them.