The operation a large distributed storage facility for data intensive collaboration involves many challenges. Reliability, performance, and ease of access are obviously important. L-Store has demonstrated its ability to provide these attributes in its use by application domains such as CMS and the Vanderbilt TV News archive. We discuss below other challenges that our experience has taught us are important.
Any time one changes a parameter by an order of magnitude new challenges arise. Managing multi-petabyte data sets is no different. Moving a few terabytes of data between remote sites can be done in under a day with existing tools and hardware. Ten Gb/s or greater networking is not required, and one can most likely use traditional hardware or software based RAID systems to insure data integrity. At the petabyte scale, reliable high performance networks become vital and standard data integrity solutions such as RAID5 and even RAID6 become less viable.
Data integrity becomes a much more significant issue at petabyte scales due to increasing probabilities of data loss and array rebuild times. At these scales, unrecoverable read errors start to become statistically significant.
Traditional RAID array rebuild times can span days when using multi-terabyte hard drives. One can overcome this by using a distributed RAID array which is discussed later.
When modeling failure modes it is usually assumed that drive failures are uncorrelated but in our experience that is not the case. Typically all the drives in an array are not just from the same build batch but will have sequential serial numbers. Manufacturing defects or firmware bugs lead to systematic and correlated problems. In addition, because rack space is typically a precious commodity one tends to use dense disk enclosures which can suffer from mechanical vibrations and broken circuit traces on connectors.
Data integrity schemes typically employ either data mirroring or some form of RAID. Data mirroring is nothing more than keeping multiple copies of the data stored on different devices at different locations. Mirroring is great for read dominated workloads since all replicas can be used to service user requests. Write performance is slowed by the making of all the replicas. Mirroring is not very space efficient which can be quite costly at the petabyte scale. It is also not very good at providing data integrity as can be seem from the above figure. We assumed a 6% annualized failure probability and a drive rebuild time of 24 hours in generating this plot. Changing this probability will change the scale of the y-axis but the relative positions of the different schemes will not change.
The other standard approach is to employ some form of RAID. In this case one stores the data and additional parity information that can be used to reconstruct unrecoverable bit errors or drive failures. The two most common forms for RAID are RAID5 and RAID6. RAID5 uses one additional disk for storing parity and is can survive a single device failure. RAID6 has two additional disks for parity and can survive two drive failures. Reed-Solomon encoding is the generalization or RAID5 and RAID6 to support arbitrary numbers of drive failures. There are other RAID encoding schemes but these are the ones we will focus on without loss of generality.
L-Store has implemented generic Reed-Solomon encoding along with several other data integrity schemes. We typically recommend 6+3 Reed-Solomon (RS-6+3) encoding – 6 data disks and 3 parity disks. More generally we recommend the use of the RS-d+(d/2) family of configurations which can be seen in the figure above as RS-6+3, RS-8+4, and RS-10+5. This family uses 2/3 of the total space for data with the remainder for parity and has excellent reliability. RS-6+3 has the same reliability as keeping 3 copies of the data but uses only half the space.
As alluded to previously, recovering from the failure of a multi-terabyte disk drive can take days. We have witnessed this first hand in a 500 terabyte GPFS file system used to provide home disk space for a large campus cluster. The possibility of correlated drive failures (for the reasons described in the introduction of Section 3.1) increases the probability that additional drive failures could occur, resulting in unrecoverable data loss. The above figure shows the probability of data loss due to the loss of additional disks as a function of the rebuild time for three RAID configurations. RAID5-6 provides very little protection in this scenario. The RADI6-8 configuration is better since two additional drives are required in order to lose data. Even so, the chances of data loss are uncomfortably high if it takes a few days for the repair. The RS-6+3 is substantially more reliable even if repair takes a few days but keeping the rebuild times down to a few hours helps minimize the chance of data loss. The graph shows a distinct knee for this configuration occurring in the first couple of hours. The plot assumes a 6% Annualized Failure Rate (AFR) which is taken from the literature (Pinheiro 2007;Schroder 2007) for 3 and 4 year AFR estimates.
Traditional RAID arrays completely reconstruct a single failed drive on a single replacement drive. The use of a single replacement drive is a major factor in the rebuild time. Traditionally the entire drive is reconstructed with no regard for used vs. unused space. If the array is active with user I/O requests, this will greatly increase the rebuild time.
Distributed RAID arrays are designed to overcome these limitations. Instead of using the whole disk the disk is broken up into many smaller blocks. These blocks are combined with blocks on other disks creating many small logical RAID arrays utilizing a large subset of the available drives as shown in the figure. These distributed logical RAID arrays are based on space that is actually used. The free space on each drive can be used to store the newly reconstructed data. This allows for a large number of drives being read and written to simultaneously providing significantly faster rebuild times. For L-Store each file and associated parity is placed on a random collection of drives based on the fault tolerance scheme used and data placement criteria. We routinely rebuild a single 2TB in a couple of hours using a single host to perform the data reconstruction. Adding more hosts causes the rebuild time to proportionately decrease.