April 13, 2010
By Pierre Vignéras

Abstract:

RAID has still not been adopted by most end-users despite its inherent quality such as performance and reliability. Reasons such as complexity of RAID technology (levels, hard/soft), set-up, or support may be given. We believe the main reason is that most end-users own a vast amount of heterogeneous storage devices (USB stick, IDE/SATA/SCSI internal/external hard drives, SD/XD Card, SSD, …), and that RAID-based systems are mostly designed for homogenous (in size and technology) hard disks. Therefore, there is currently no storage solution that manages heterogeneous storage devices efficiently.

In this article, we propose such a solution and we call it PROUHD (Pool of RAID Over User Heterogeneous Devices). This solution supports heterogeneous (in size and technology) storage devices, maximizes the available storage space consumption, is tolerant to device failure up to a customizable degree, still makes automatic addition, removal and replacement of storage devices possible and remains performant in the face of average end-user workflow.

Although this article makes some references to Linux, the algorithms described are independent of the operating system and thus may be implemented on any of them.

Introduction

Whereas RAID¹ has been massively adopted by the industry, it is still not common on end-users desktop. Complexity of RAID system might be one reason… among many others. Actually, in a state-of-the-art data center, the storage is designed according to some requirements (the ”top-bottom” approach already discussed in a previous article²). Therefore, from a RAID perspective, the storage is usually composed of a pool of disks of same size and characteristics including spares³. The focus is often on performance. The global storage capacity is usually not a big deal.

The average end-user case is rather different in that their global storage capacity is composed of various storage devices such as:

Hard drives (internal IDE, internal/external SATA, external USB, external Firewire);
USB Sticks;
Flash Memory such as SDCard, XDCard, …;
SSD.

On the opposite, performance is not the big deal for the end-user: most usage does not require very high throughput. Cost and capacity are main important factors along with ease of use. By the way, the end-user does not usually have any spare devices.

We propose in this paper an algorithm for disk layout using (software) RAID that has the following characteristics:

it supports heterogeneous storage devices (size and technology);
it maximizes storage space;
it is tolerant to device failure up to a certain degree that depends on the number of available devices and on the RAID level chosen;
it still makes automatic addition, removal and replacement of storage devices possible under certain conditions;
it remains performant in the face of average end-user workflow.

Algorithm

Description

Conceptually, we first stack storage devices one over the other as shown in figure 1.

Figure 1:Stacking storage devices (same size, ideal RAID case).

On that example with devices, each of capacity (terabytes), we end up with a global storage capacity of . From that global storage space, using RAID, you can get:

a 4 Tb () virtual storage devices (called PV for Physical Volume⁴ in the following) using RAID0 (level 0), but then you have no fault tolerancy (if a physical device fail, the whole virtual device is lost).
a 1 Tb () PV using RAID1; in that case, you have a fault tolerancy degree of 3 (the PV remains valid in the face of 3 drives failure, and this is the maximum).
a 3 Tb () PV using RAID5; in that case, you have a fault tolerancy degree of 1;
a 2 Tb () PV using RAID10; it that case, the fault tolerancy degree is also 1⁵ ( is the number of mirrored sets, 2 in our case).

The previous example hardly represents a real (end-user) case. Figure 2 represents such a scenario, with 4 disks also (though listed capacities does not represent common use cases, they ease mental capacity calculation for the algorithm description). In this case, we face devices , of respective capacity : 1 Tb, 2 Tb, 1 Tb, and 4 Tb. Hence the global storage capacity is:

Since traditional RAID array requires same device size, in that case, the minimum device capacity is used:

. Therefore, we can have:

4 Tb, using RAID0;
1 Tb, using RAID1;
3 Tb, using RAID5;
2 Tb, using RAID10.

Figure 2:Stacking storage devices (different size = usual end-user case).

Thus, exactly the same possibilities than in the previous example. The main difference however, is the wasted storage space — defined as the storage space unused from each disk neither for storage nor for fault tolerancy⁶.

In our example, the 1 Tb capacity of both devices hda and hdc are fortunately fully used. But only 1 Tb out of 2 Tb of device hdb and 1 Tb out of 4 Tb of device hdd is really used. Therefore in this case, the wasted storage space is given by the formula:

$\begin{displaymath} W=\sum_{d}(c_{d}-c_{min})=(1-1)+(2-1)+(1-1)+(4-1)=4 Tb\end{displaymath}$

In this example, out of , i.e. 50% of the global storage space is actually unused. For an end-user, such an amount of wasted space is definitely an argument against using RAID, despite all the other advantages RAID provides (flexibility for adding/removing devices, fault tolerancy and performance).

The algorithm we propose is very simple indeed. First, we sort the device list in ascending capacity order. Then, we partition each disk in such a way that an array with the maximum number of other partitions of the same size can be made. Figure 3 shows the process in our preceding example with 4 disks.

Figure 3:Illustration of the vertical RAID layout.

A first partition is made on all disks. The size of that partition is the size of the first disk, hda, which is the minimum — 1 Tb in our case. Since the second disk in our sorted list, named hdc is also of 1 Tb capacity, no room is available for making a new partition. Therefore, it is skipped. Next disk is hdb in our sorted list. Its capacity is of 2 Tb. The first partition takes 1 Tb already. Another 1 Tb is available for partitionning and it becomes . Note that this other 1 Tb partition is also made on each following disk in our sorted list. Therefore, our last device, hdd has already 2 partitions: and . Since it is the last disk, the remaining storage space (2 Tb) will be wasted. Now, a RAID array can be made from each partition of the same size from different disks. In this case, we have the following choices:

making a RAID array using 4 partitions, we can get:
- 4 Tb in RAID0;
- 1 Tb in RAID1;
- 3 Tb in RAID5;
- 2 Tb in RAID10;
making another array using 2 partitions, we can get:
- 2 Tb in RAID0;
- 1 Tb in RAID1.

Therefore, we maximized the storage space we can get from multiple devices. Actually, we minimized the wasted space which is given — with this algorithm — by the last partition of the last drive, in this case: . Only 20% of the global storage space is wasted, and this is the minimum we can get. Said otherwise, 80% of the global storage space is used either for storage or fault tolerancy and this is the maximum we can get using RAID technology.

The amount of available storage space depends on the RAID level chosen for each PV from vertical partitions . It can vary from 2 Tb {RAID1, RAID1} up to 6 Tb {RAID0, RAID0}. The maximum storage space available with a fault tolerancy degree of 1 is 4 Tb {RAID5, RAID1}.

Analysis

In this section, we will give an analysis of our algorithm. We consider storage devices of respective capacity for where . Said otherwise, the drives are sorted by their capacity in ascending order as illustrated on figure 4. We also define for simplification purposes.

Figure 4:Illustration of the general algrorithm.

We also define:

the global storage space:
$\begin{displaymath} G(n)=\sum_{i=1}^{n}c_{i}=c_{1}+c_{2}+\dots+c_{n}\end{displaymath}$

naturally, we also define (no device gives no storage);
the wasted storage space ; we also define (no device gives no waste); note anyway that (with only one device you cannot make any RAID array and therefore, the wasted space is maximum!);
the maximum (safe) available storage space (using RAID5⁷):
$\begin{eqnarray*} C_{max}(n) & = & c_{1}.(n-1)+(c_{2}-c_{1}).(n-2)+\dots+(c_{n-1... ...}^{n-1}(c_{i}-c_{i-1}).(n-i)\\ & = & \sum_{i=1}^{n-1}W(i).(n-i)\end{eqnarray*}$
we also define , and (you need at least 2 drives to make a RAID array).
the lost storage space defined as ; it represents the amount of space not used for storage (it includes both space used for fault tolerancy and the wasted space); note that and that (with one drive, the wasted space is maximum, and is equal to the lost space).

We also have, :

the maximum storage space at level is the global storage space at previous level . By the way, when a new storage device is added, with a capacity of we have:

the new global storage space: ;
the new maximum available storage space: ;
the new wasted space is: ;
the new lost space: .

When a new storage device bigger that any other in the configuration is added, the maximum available storage space is increased by an amount equal to the last device in the previous configuration without the new device. Moreover, the new lost space is exactly equal to the size of that new device.

As a conclusion, purchasing a much bigger device than the last one in the configuration is not a big win in the first place, since it mainly increases the wasted space! That wasted space will be used when a new drive of a higher capacity will get introduced.

You may compare our algorithm with the usual RAID layout (i.e. using same device size ) on the same set of devices: the global storage

space remains unchanged:

;

the maximum storage becomes:

;

the wasted space becomes:

$\begin{displaymath} W'(n)=\sum_{2}^{n}(c_{i}-c_{1})=G'(n)-n.c_{1}\end{displaymath}$

the lost space becomes:

When a new device of capacity is added to the device set, we get:

(the available storage space is increased by only);
(whereas the wasted space is increased by ;
(and the lost space is increased by the same amount);

As seen formally, the traditionnal algorithm is very weak in the handling of heterogeneous storage device size. When you add a new device, in the configuration of a higher capacity, you increase both the wasted space and the lost space by an amount that is the difference in size between that new device and the first one. Figure 5 gives a graphical comparisons of and on the whole set of devices for traditionnal RAID algorithm (left) and for PROUHD (right).

Figure 5:Graphical representation of quantities and for the traditionnal RAID algorithm (left) and the PROUHD algorithm (right)

By the way, formally, since , it is clear that . Thus, . Therefore the heterogeneous algorithm always gives a better result in terms of wasted space, as expected. It can been shown easily that the heterogeneous algorithm also gives systematically a better result for the lost space .

On the opposite, our algorithm can be seen as an extension of traditionnal layout where all devices are of same size. This translate formally to , and we have:

for a global storage space of:

;

a maximum storage space of:

(RAID5);

a wasted space of:

;

a lost space of:

;

And we get back to what we are used to where only one disk is lost for drives of same size (using RAID5).

Implementation (layout-disks)

We propose an open-source python software — called layout-disks and available at http://www.sf.net/layout-disks– that given a list of devices label and size, returns the possible layout using this algorithm. As an example, with 4 disks taken from illustration 3, the software proposes the following:

    raid

The software tells that from the first partition of each 4 drives, several RAID level options are available (from RAID1 up to RAID5)8. From the second partition on devices hdb and hdd, only RAID1 is available.

Performance

From a performance point of view, this layout is definitely not optimal for every usage. Traditionally, in the enterprise case, two different virtual RAID devices map to different physical storage devices. On the opposite here, any distinct PROUHD devices share some of their physical storage devices. If no care is taken, this can lead to very poor performance as any request made to a PROUHD device may be queued by the kernel until other requests made to other PROUHD device have been served. Note however that this is not different from the single disk case except from a strict performance point of view: the throughput of a RAID array — especially on reads — may well outperform the throughput of a single disk thanks to parallelism.

For most end-user cases, this layout is perfectly fine from a performance point of view, especially for storing multimedia files such as photo, audio or video files where most of the time, files are written once, and read multiple times, sequentially. A file server with such a PROUHD disk layout will easily serve multiple end-user clients simultaneously. Such a layout may also be used for backup storage. The only reason such a configuration should not be used is where you have strong performance requirements. On the other side, if your main concern is storage space management, such a configuration is very sound.

By the way, you may combine such a layout with the Linux Volume Manager (LVM). For example, if your main concern is storage space with a tolerance level of 1, you may combine, the 3.0Gb RAID5 region with the 1.0Gb RAID1 region in the previous example as a volume group resulting in a virtual device of 4.0 Gb, from which you can define logical volumes (LV) at will.

Advantages of such a combined RAID/LVM layout versus a strict LVM layout (without any RAID array in between), is that you can benefit advantages of RAID levels (all levels 0, 1, 5, 10, 50, or 6) whereas LVM provide, as far as I know, a ”poor” (compared to RAID) mirroring and stripping implementation. By the way, note that specifying mirror or stripe options at logical volume creation will not give the expected performance and/or tolerancy improvement since physical volumes are (already) RAID arrays sharing real physical devices.

SSD special case

Our solution makes good use of available storage space at the expense of raw performance penalty in some cases: when concurrent access are made to distincts RAID arrays sharing same physical devices. Concurrent accesses usually imply random access to data.

Hard drives have a hard limit on their I/O througput with random access pattern due to their mecanical constraints: after data has been located, the reading (or writing) head should seek to the correct cylinder and wait until the correct sector passes under it thanks to plate rotation. Obviously, reading or writing to hard disks is mainly a sequential process. A read/write request is pushed onto a queue (in software or in hardware), and it should just wait previous ones. Of course, many improvements were made to speed up the reading/writing process (for example, using buffer and cache, smart queue managements, bulk operations, data locality computation among others), but performance of hard drives are physically limited anyhow, especially on random accesses. In some ways, this random (concurrent) access problems is the reason why RAID has been introduced in the first place.

SSDs are very different from hard disks. In particular, they do not have such mecanical constraints. They handle random accesses much better than hard disks. Therefore, the performance penalty of PROUHD discussed above may not be so true with SSD. Concurrent accesses made to distincts RAID arrays sharing physical SSDs will result in several requests with a random access pattern made to each underlying SSD. But as we have seen, SSDs handles random request quite well. Some investigations should be made to compare performance of PROUHD over hard disks versus PROUHD over SSDs. Any help in this regard will be appreciated.

Partitionning drives

PROUHD requires that storage devices are properly partitionned into slices of same size. Depending on the number of different sized storage devices, the algorithm may lead to the creation of a vast number of partitions on each device. Fortunately, it is not required to use primary partitions which are limited to 4 by PC BIOS for legacy reasons. Logical partitions can be used in order to create all the required slices: there are almost no limit to their numbers. On the other side, if you need partitions of more than 2 TeraBytes, then logical partitions are no more an option.

For this specific case (partition size of more than 2TB), GUID Partition Table (GPT) might be an option. As far as I know, only parted⁹ supports them.

It might be tempting to use LVM for partitionning purpose. If this is a perfect choice in the usual case of partitionning, I would not recommend it for PROUHD anyway. Actually, the other way round is the good option: RAID arrays are perfect choice for LVM Physical Volume (PV). I mean, each RAID array becomes a PV. From some PVs, you create Volume Group (VG). From those VGs, you create Logical Volumes (LV) that you finally format and mount into your filesystem. Therefore, the chain of layers is as follow:

 Device -> RAID -> PV -> VG -> LV -> FS.

If you use LVM for partitionning drives, you end up with a huge number of layers that kill performance (probably) and design:

 Device -> PV -> VG -> LV -> RAID -> PV -> VG -> LV -> FS.

Honestly, I have not tested such a complex configuration. I would be interested on feedbacks though. 😉

Handling Disk Failure

Of course, any disk will fail, one day or another. The later, the better. But, planning disk replacement is not something that can be postponed until failure, it is usually not at the good time (the murphy’s law!). Thanks to RAID (for level 1 and above), a disk failure does not prevent the whole system to work normally. This is a problem since you may not even notice that something went wrong. Again, if nothing is planned, you will discover it the hard way, when a second disk actually fail, and when you have no way to recover your RAID arrays. First thing is to monitor your storage devices. You have (at least) 2 tools for that purpose:

smartmontools:: SMART is a standard implemented in most IDE and SATA drives that monitor the health of a disk, performing some tests (online and offline), and that can send reports by email, especially when one or many tests went wrong. Note that SMART does not give any guarantee that it will anticipate failure, nor that its failure forecasts are accurate. Anyway, when SMART tells that something is wrong, it is better to plan for a disk replacement very soon. By the way, in such a case, do not stop the drive unless you have a spare, they usually dislike being re-started, especially after such forecasted failures. Configuring smartmontools is quite simple. Install that software and look at the file smartd.conf usually in /etc.
mdadm:: mdadm is the linux tool for (software) RAID management. When something happens to a RAID array, an email can be sent. See the file mdadm.conf usually in /etc for details.

In traditionnal RAID, when one device from a RAID array fail, the array is in a so called ”degraded” mode. In such a mode, the array is still working, data remains accessible, but the whole system may suffer a performance penalty. When you replace the faulty device, the array is reconstructed. Depending on the RAID level, this operation is either very simple (mirroring requires only a single copy) or very complex (RAID5 and 6 requires CRC computation). In either case, the time required to complete this reconstruction is usually quite huge (depending on the array size). But the system is normally able to perform this operation online. It can even limit the overhead as much as possible when the RAID array is serving clients. Note that RAID5 and RAID6 levels can stress a file server quite well during array reconstructions.

In the case of PROUHD, the effect on the whole system is worse since one drive failure impact many RAID arrays. Traditionnaly, degraded RAID arrays can get reconstructed all at the same time. The main point is to reduce the time spent in degraded mode minimizing the probability of data loss globally (the more time in degraded mode, the more probable data loss can occur). But parallel reconstruction is not a good idea in the PROUHD case because RAID arrays share storage devices. Therefore, any reconstruction impact all arrays. Parallel reconstructions will just stress more all storage devices, and thus, the global reconstruction will probably not recover sooner than a simpler sequential one.

Sep   6 00:57:02 phobos kernel : md: syncing RAID array md0
Sep   6 00:57:02 phobos kernel : md: minimum _guaranteed_ reconstruction speed : 1000
    KB/ sec / disc .
Sep 6 00:57:02 phobos kernel : md: using maximum available idle IO bandwidth ( but
    not more than 200000 KB/ sec ) f o r reconstruction .
Sep 6 00:57:02 phobos kernel : md: using 128k window , over a total of 96256 blocks .
Sep 6 00:57:02 phobos kernel : md: delaying resync of md1 until md0 has finished
    resync ( they share one or more physical units )
Sep 6 00:57:02 phobos kernel : md: syncing RAID array md2
Sep 6 00:57:02 phobos kernel : md: minimum _guaranteed_ reconstruction speed : 1000
    KB/ sec / disc .
Sep 6 00:57:02 phobos kernel : md: using maximum available idle IO bandwidth ( but
    not more than 200000 KB/ sec ) for reconstruction .
Sep 6 00:57:02 phobos kernel : md: using 128k window , over a total of 625137152
    blocks .
Sep 6 00:57:02 phobos kernel : md: delaying resync of md3 until md2 has finished
    resync ( they share one or more physical units )
Sep 6 00:57:02 phobos kernel : md: delaying resync of md1 until md0 has finished
    resync ( they share one or more physical units )
Sep 6 00:57:02 phobos kernel : md: delaying resync of md4 until md2 has finished
    resync ( they share one or more physical units )
Sep 6 00:57:02 phobos kernel : md: delaying resync of md1 until md0 has finished
    resync ( they share one or more physical units )
Sep   6 00:57:02 phobos kernel : md: delaying resync of md3 until md4 has finished
    resync ( they share one or more physical units )
Sep 6 00:57:25 phobos kernel : md: md0: sync done .
Sep 6 00:57:26 phobos kernel : md: delaying resync of md3 until md4 has finished
    resync ( they share one or more physical units )
Sep 6 00:57:26 phobos kernel : md: syncing RAID array md1
Sep 6 00:57:26 phobos kernel : md: minimum _guaranteed_ reconstruction speed : 1000
    KB/ sec / disc .
Sep 6 00:57:26 phobos kernel : md: using maximum available idle IO bandwidth ( but
    not more than 200000 KB/ sec ) f o r reconstruction .
Sep 6 00:57:26 phobos kernel : md: using 128k window , over a total of 2016064
    blocks .
Sep 6 00:57:26 phobos kernel : md: delaying resync of md4 until md2 has finished
    resync ( they share one or more physical units )
Sep 6 00:57:26 phobos kernel : RAID1 conf printout :
Sep 6 00:57:26 phobos kernel : −−− wd: 2 rd : 2

Therefore, we can rely on mdadm to do the right thing with RAID, wether it is an homogeneous, an heteregeous configuration or a combination of both.

Replacement Procedure

Replacing a failed device with a same-size one.

This is the ideal situation and it mostly follows the traditional RAID approach except that you now have more than one RAID array to manage for each device. Let’s take our example (figure 6 left), and let’s suppose that a failure has been detected on hdb. Note that a failure may have been detected locally on hdb2, and not on hdb1 for example. Anyway, the whole disk will have to be replaced and therefore, all arrays are concerned. In our example, we have setup the storage with the following PROUHD configuration:

/dev/md0: hda1, hdb1, hdc1, hdd1 (RAID5, (4-1)*1Tb = 3 Tb)

/dev/md1: hdb2, hdd2 (RAID1, (2*1Tb)/2 = 1Tb)

Logically remove each faulty device partition from its corresponding RAID array:

mdadm /dev/md0 -faulty /dev/hdb1 -remove /dev/hdb1

mdadm /dev/md1 -faulty /dev/hdb2 -remove /dev/hdb2

Physically remove the faulty device — unless you have a hot-plug system such as USB you will have to power off the whole system;
Physically add a new device — unless you have a hot-plug system such as USB, you will have to power on the whole system;
Partition the new device (let say /dev/sda) with the exact same layout than the failed device: 2 partitions of 1Tb each /dev/sda1 and /dev/sda2;
Logically add each new partition to its corresponding RAID array:
```
mdadm /dev/md0 -add /dev/sda1
```
```
mdadm /dev/md1 -add /dev/sda2
```

After a while, all your RAID arrays will get re-constructed.

Replacing a failed device with a larger one.

This case is not so simple indeed. The main issue is that the whole layout is not at all related to the old one. Let’s take the previous example, and see what happened if /dev/hdb fail. If we replace that 2Tb device with a 3Tb new device, we should end up with the layout of figure 6 (right).

$\includegraphics[width=0.5\columnwidth]{7_home_pierre_Research_Web_Blog_prouhd_replacement.eps}$

Figure 6:Replacing a failed device by a larger one. Layout before (left) and after (right) the replacement of /dev/hdb:2 with /dev/sda:3.

Notice that partition is now of 2Tb and not of 1Tb as it was the case previously (see figure 3). This means that the previous RAID array made from /dev/hdb2:1Tb and /dev/hdd2:1Tb is no more relevant after the replacement: it does not appear in the layout algorithm. Instead, we have a RAID array made of /dev/sda2:2Tb and /dev/hdd2:2Tb.

Replacing a failed device (f) by a larger one (k), general case before (left) and after (right).

Figure 7:Replacing a failed device (f) by a larger one (k), general case before (top) and after (bottom).

$\includegraphics[width=0.5\columnwidth]{9_home_pierre_Research_Web_Blog_prouhd_replacement-analysis-after.eps}$

In the general case, as shown on figure 7, the last partition of the failed device , is no more relevant. Therefore, the whole RAID array labeled of size , made from partitions of devices should be removed. The following array, , that was made from the last partition of the following disk, , should be resized according to the new layout. Partitions were having a size of . These partitions can now be ”merged” since there is no ”in-between” and . Therefore, new ”merged” partitions become with a size of .

Finally, the new device is inserted between devices at rank and because its capacity is so that . (Note that all devices will shift to rank because new device is added after failed device ). The new device should be partitionned so all partitions from up to are of same size than in the previous layout: . Size of partition is given by: as we have seen previously. Finally, all following partitions, up to are of the same size than in the old layout: . This new device, adds its own modification in the new layout according to the difference between its size and the size of the previous device which is the k device in the old layout ( ). Therefore, in the new layout, partition k has a size given by . Finally, next partition should be modified. It was previously of size , but this is no more relevant in the new layout. It should be reduced to . The following partitions should not be changed. Note that the new device replaces failed partitions from the failed device, but adds up 1 more partition to RAID arrays . We note the number of partitions that made up RAID array . Therefore, we have: . Fortunately, it is possible to grow a RAID array under Linux thanks to the great mdam grow command.

In summary, old layout:

$\begin{displaymath} p_{1},p_{2},\ldots,p_{f},\ldots,p_{k},\ldots,p_{n}\end{displaymath}$

becomes new layout:

$\begin{displaymath} p'_{1},p'_{2},\ldots,p'_{f},\ldots,p'_{k},\ldots,p'_{n}\end{displaymath}$

with:

$\begin{eqnarray*} p'_{i} & = & p_{i}, \forall i\in[1, f-1]\\ p'_{f} & = & c_... ...n]\\ dev(R'_{i}) & = & dev(R_{i+1})+1, \forall i\in[f+1, k-1]\end{eqnarray*}$

As we see, replacing a faulty device by a larger one leads to quite a lot of modifications. Fortunately, they are somewhat local: in a large set of devices, modifications happens only to a bounded number of devices and partitions. Anyway, the whole operation is obviously very time consuming and error prone if done without proper tools.

Hopefully, the whole process can be automated. The algorithm presented below uses LVM advanced volume management. It supposes that RAID arrays are physical volumes that belongs to some virtual groups (VG) from which logical volumes (LV) are created for the making of filesystems. As such, we note the LVM physical volume backed by RAID array .

We suppose disk is dead. We thus have degraded RAID arrays, and safe RAID arrays. An automatic replacement procedure is defined step-by-step below.

Backup your data (this should be obvious, we are playing with degraded arrays since one disk is out of order, therefore any mistake will eventually lead to data loss! For that purpose, you may use any storage space available that do not belong to the failed disk. Next RAID arrays in the layout are fine for example.
Mark all partitions of broken device as faulty, in their corresponding RAID arrays and remove them (mdadm -fail -remove).
Remove the failed storage device .
Insert the new storage device .
Partition new device according to the new layout (fdisk). In particular, last failed device partition and last new device partition should have correct sizes: and . At that stage, will still have f degraded arrays : .
Replace failed partition by adding new device partition to its corresponding raid array (mdadm -add). After this step, only is a degraded RAID array.
Remove , and from their corresponding VG (pvmove). LVM will handle that situation quite well, but it requires enough free space in the VG (and time!). It will actually copy data to other PV in the (same) VG.
Stop both RAID arrays and corresponding to and (mdadm stop).
Merge (fdisk) partition and into one single partition . This should work fine, since other partitions are not impacted by that. It should be done on each device following failed device : that is storage devices in total (device was already partitionned in step 5).
Create a new raid array from the merged partition (mdadm create).
Create the corresponding (pvcreate), and add it to the previous VG (vgextend). At that step, we are back to a safe global storage space: all RAID arrays are now safe. But the layout is not optimal: partition are still unused for example.
Remove from its corresponding VG (pvmove). Again, you will need some available storage space.
Stop the corresponding RAID array (mdadm stop).
Split old partition into new and (fdisk); This should be done on each device following k, that is devices in total. This should not cause any problem, other partitions are not impacted.
Create two new RAID arrays and from thus 2 new partitions and (mdadm create).
Create and accordingly (pvcreate). Insert them back into VG (vgextend).
Finally, add each new device partition to its corresponding raid array . You will have to grow RAID arrays so that (mdadm grow).
We are back with the new correct layout, with safe RAID arrays.

Note that this process focuses on the end-user: it makes the replacement as convenient as possible, preventing the user a long wait between failed device removal and new one replacement. All is done at the beginning. Of course, the time required before the whole pool of RAID arrays runs non-degraded can be quite huge. But it is somewhat transparent from the end-user point of view.

Replacing a failed drive with a smaller one

This case is the worst one, for two reasons. First, the global capacity is obviously reduced: . Second, since some bytes of the failed larger drives were used for fault tolerancy¹⁰, some of those bytes are no more present in the new device. This will have quite a consequence on the practical algorithm as we will see.

When a device fail, all RAID arrays , where becomes degraded. When we replace the failed device by a new device where , , then RAID arrays becomes repaired, but RAID arrays remains degraded (see figure 8) because there is not enough storage space in the new device for taking over failed ones. (Note that all devices will shift to rank because new device is added before failed device ).

Figure 8: Replacing a failed device (f) by a smaller one (k), general case before (top) and after (bottom).

Replacing a failed device (f) by a smaller one (k), general case before (left) and after (right)

As in the previous case, the solution requires the merging of partitions with the one from since there is no more . Hence, on all devices . Also, the new device , should be partitionned correctly. In particular, its last partition . Devices should change their partitionning according to new partition . For those devices, partition should also be changed: . Most important modifications concerns all RAID arrays since they are still degraded. For all of them, their number of (virtual) devices should be decreased by one: for example, was made of ”vertical” partitions from device up to device since device was wide enough to support a partition . It is no more the case for since the new device does not provide sufficient storage space to support a partition. Therefore, .

In summary, old layout:

$\begin{displaymath} p_{1},p_{2},\ldots,p_{k},\ldots,p_{f},\ldots,p_{n}\end{displaymath}$

becomes new layout:

$\begin{displaymath} p'_{1},p'_{2},\ldots,p'_{k},\ldots,p'_{f},\ldots,p'_{n}\end{displaymath}$

with

$\begin{eqnarray*} p'_{i} & = & p_{i}, \forall i\in[1, k]\\ p'_{k+1} & = & c'... ..., n]\\ dev(R'_{i}) & = & dev(R_{i-1})-1, \forall i\in[k+2, f]\end{eqnarray*}$

Unfortunately, as far as we know, it is not (currently) possible to shrink a RAID device using Linux RAID. The only option is to remove the whole set of arrays entirely, and to create new ones with the correct number of devices. An automatic replacement procedure is therefore defined step-by-step below:

Backup your data! 😉
Mark all partitions of broken device as faulty, in their corresponding RAID arrays and remove them (mdadm -fail -remove).
Remove failed storage device .
Insert the new storage device .
Partition new device according to the new layout (fdisk). In particular, last partition should have correct size: . At that stage we still have degraded RAID arrays: .
Replace faulty partitions by adding new device ones and add them to their respective arrays . After this step, are still old degraded arrays, that is RAID arrays in total. Two RAID arrays are still made of wrong sized partitions: and .
For each array :
1. Move the data corresponding to to other devices (pvmove on the related LVM volume );
2. Remove the corresponding LVM volume from its volume group (pvremove);
3. Stop related array (mdadm stop);
4. Create a new RAID array from partition . Note that there is now one less partition in : ;
5. Create the corresponding LVM volume (pvcreate);
6. Add that new LVM volume to its related volume group .
At this step, and french are still made of wrong sized old and .
Move the data corresponding to to other devices (pvmove on the related LVM volume );
Remove the corresponding LVM volume from its volume group (pvremove);
Stop the related array (mdadm stop);
Merge (fdisk) old partitions and into one single partition . This should work fine, since other partitions are not impacted by that. It should be done on each device following failed device : that is storage devices in total.
Create a new raid array from the merged partition (mdadm create).
Create the corresponding (pvcreate), and add it to the previous VG (vgextend). At that step, only remains wrong and degraded.
Move the data corresponding to to other devices (pvmove on the related LVM volume ).
Revove the corresponding LVM volume from its volume group (pvremove);
Stop the related array (mdadm stop);
Split (fdisk) old partitions into new partitions and . This should be done on all following devices, that is devices in total.
Create (mdadm -create) new RAID arrays and from partitions and ;
Create (pvcreate) the corresponding and and add (vgextend) them to their corresponding .
You are back with the new correct layout, with safe RAID arrays.

Note that step 7 is done one array per one array. The main idea is to reduce the amount of available storage space required by the algorithm. Another option is to remove all LVM volumes (PV) at the same time from their related VG, then, to remove their corresponding RAID arrays, and then to recreate them with the correct number of partitions (it should be reduced by one). Removing all those arrays in one turn may result in a big reduction of available storage space that might block the whole process while removing PV from their corresponding VG. Since such a removal results in the move of the data from one PV to others (in the same VG), it also requires that there is enough free space in that VG to accomodate the full copy.

On the other side, the algorithm described may result in a vast amount of data transfert. For example, suppose that all PVs are actually in a single VG. The removal of the first PV in the list ( therefore) may result in the move of its data to . Unfortunately, on next iteration, will be also removed resulting in the transfert of same data to and so on. Investigation on a smarter algorithm for that specific step 7is therefore a must.

RAID array reconstruction

Given the size of current hard drives, and the Unrecoverable Bit Error (UBE) — for enterprise class disk drives (SCSI, FC, SAS) and for desktop class disk drives (IDE/ATA/PATA, SATA), the reconstruction of a disk array after the failure of a device can be quite challenging. When the array is in a degraded mode, during reconstruction, it tries to get data from remaining devices. But with today large device capacity, the probability of an error during that step becomes significant. Especially, there is a trend with large RAID5 groups to be unrecoverable after a single disk failure. Hence the design of RAID6 that can handle 2 simultaneous disk failures but with a very high write performance hit.

Instead of setting up large RAID5 groups, it might be preferable to setup large set of RAID10 arrays. This gives better result both in terms of reliability (RAID1 is far easier to recover than RAID5), and performance. But the high storage cost — 50% of space lost — often makes this choice irrelevant despite the cheap price of the MB today.

With PROUHD, given that the wasted space is minimum, the RAID10 option might be an acceptable compromise (over traditionnal RAID layout of course).

Moreover, in PROUHD, RAID components do no cover entire drives but only a portion of it (a partition). Therefore, the probability of other sector errors is reduced.

Adding/removing a device to/from a PROUHD

As shown by figure 9, adding a new device in the pool is much simpler than previous replacement cases. The last partition of the new device impacts the previous layout:

$\begin{eqnarray*} p'_{k+1} & = & c'_{k+1}-c_{k}=c'_{k+1}-c'_{k}\\ p'_{k+2} & = & c_{k+1}-c'_{k+1}=c'_{k+2}-c'_{k+1}\end{eqnarray*}$

And all raid arrays up to should see their number of devices increased by one:

$\begin{displaymath} dev(R'_{i})=dev(R_{i})+1, \forall i\in[1, k]\end{displaymath}$

Figure 9:Adding a device (k) to the pool, general case before (left) and after (right).

The reverse is also much simpler than any replacement procedure as shown by figure 10. Removing a device from the pool leads also to a modification of its related partition :

$\begin{eqnarray*} p'_{k} & = & c{}_{k+1}-c_{k-1}=c'_{k}-c'_{k-1}\end{eqnarray*}$

And all raid arrays up to should see their number of devices decreased by one:

$\begin{displaymath} dev(R'_{i})=dev(R_{i})-1, \forall i\in[1, k-1]\end{displaymath}$

Figure 10:Removing a device (k) from the pool, general case before (left) and after (right).

Both step-by-step algorithms are quite straightforward compared to the replacement ones. They are left out to reader curiosity therefore.

Forecasting: Storage Box for Average End-Users

Taken individually, each storage device answers some requirements the end-user had at one time (for example, a camera needs an XD card). But often, new storage devices are added to the pool for various reasons (new camera without XD card support, new USB disk for more storage space, …). The end-user end up having a global storage space composed of individual disconnected components. Some devices still need there context to be useful (the new camera and its new SD card). But others may not be used even if they still work (the old XD card).

This study shows that a storage box can be provided with the following features:

provides a global storage space, made of any physical storage devices of any size, of any technology (disk, SDD, flash, usb-sticks, sdcard, xdcard, and so on);
supports disk addition, removal and replacement;
supports any RAID levels;
supports mixture of RAID levels;
supports fault tolerancy up to a degree that depends on RAID levels used;
when used properly, the box can deliver high performance (for example, if 2 RAID arrays are never used simultaneously);
offers good performance for average end-users needs (such as media streaming);
very efficient in terms of storage efficiency: any single byte can be used (either for storage or for fault tolerancy depending on users specific needs). Said otherwise, the storage box reduces the wasted space to the bare minimum (that space can still be used for storing data, but fault tolerancy is not supported in such a case).

Of course, the complexity of our solution has to be masked to the end-user. As an example, imagine a storage box composed of a vast number of connexions for USB drives and sticks, Firewire disks, SATA/SCSI disks, XD/SD-Card and all others, that implements the presented solution. On initialization, when all devices have been connected, the software will detect all storage devices, and will propose simple configurations such as:

maximize space (choose RAID5 when possible, then RAID10, then RAID1);
maximize performance (choose RAID10 when possible, then RAID1);
safe config (choose RAID10 when possible, RAID5, then RAID1);
custom config.

Presenting those configurations graphically, enabling configuration comparisons, proposing pre-defined configurations for well-known workloads (multimedia files, system files, log files and so on) will add up to the initial solution.

Finally, the main performance (and cost) of such storage boxes will come from the actual number of controllers. Concurrent requests (RAID naturally increases them) are best served when they come from different controllers.

Questions, Comments & Suggestions

If you have any question, comment, and/or suggestion on this document, feel free to contact me at the following address: pierre@vigneras.name.

Acknowledgment

The author would like to thanks Lubos Rendek for publishing this work and Pascal Grange for his valuable comments and suggestions.

Footnotes

… RAID¹

For an introduction on RAID technology, please refer to online articles such as:

http://en.wikipedia.org/wiki/Standard_RAID_levels

… article²

http://www.vigneras.org/pierre/wp/2009/07/21/choosing-the-right-file-system-layout-under-linux/

… spares³

By the way, since similar disks may fail at similar time, it may be better to create storage pools from disks of different model or even vendor.

… Volume⁴

This comes from the LVM terminology which is often used with RAID on Linux.

… 1⁵

This is the worst case and the one that should be taken into account. Of course, disks hda and hdc may fail, for example, and the PV will remain available, but the best case is not the one that represents the fault tolerancy degree.

… tolerancy⁶

Note that this is independent on the actual RAID level chosen: each byte in a RAID array is used, either for storage or for fault tolerance. In the example, using RAID1, we only get 1 Tb out of 8 Tb and it may look like a waste. But if RAID1 is chosen for such an array, it actually means that the fault tolerancy degree of 3 is required. And such a fault tolerancy degree has a storage cost!

… RAID5⁷

From the available storage space point of view, RAID5 consumes one partition for fault tolerancy. When only 2 partitions are available, RAID1 is the only option available with fault tolerancy, and it also consumes one partition for that purpose. Therefore, from a maximum available storage space perspective, a 2 devices RAID1 array is considered a RAID5 array.

…⁸

RAID0 is only presented if option -unsafe is specified. RAID6 and other RAID levels are not implemented currently. Any help is welcome! 😉

… parted⁹

See http://www.gnu.org/software/parted/index.shtml

… tolerancy¹⁰

Unless RAID0 was used, but in that case, the situation is even worse!

Copyrights

This document is licensed under a Creative Commons Attribution-Share Alike 2.0 France License. Please, see for details: http://creativecommons.org/licenses/by-sa/2.0/

Disclaimer

The information contained in this document is for general information purposes only. The information is provided by Pierre Vignéras and while I endeavor to keep the information up to date and correct, I make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the document or the information, products, services, or related graphics contained in the document for any purpose.

Any reliance you place on such information is therefore strictly at your own risk. In no event I will we be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this document.

Through this document you are able to link to other documents which are not under the control of Pierre Vignéras. I have no control over the nature, content and availability of those sites. The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed

PROUHD: RAID for the end-user.