Data Integrity

What are the challenges of wide-area network and data management at the petabyte scale?

The operation a large distributed storage facility for data intensive collaboration involves many challenges. Reliability, performance, and ease of access are obviously important. L-Store has demonstrated its ability to provide these attributes in its use by application domains such as CMS and the Vanderbilt TV News archive. We discuss below other challenges that our experience has taught us are important.

Any time one changes a parameter by an order of magnitude new challenges arise. Managing multi-petabyte data sets is no different. Moving a few terabytes of data between remote sites can be done in under a day with existing tools and hardware. Ten Gb/s or greater networking is not required, and one can most likely use traditional hardware or software based RAID systems to insure data integrity. At the petabyte scale, reliable high performance networks become vital and standard data integrity solutions such as RAID5 and even RAID6 become less viable.

Why is data integrity such a big issue at petabyte scales?

Data integrity becomes a much more significant issue at petabyte scales due to increasing probabilities of data loss and array rebuild times. At these scales, unrecoverable read errors start to become statistically significant.

Traditional RAID array rebuild times can span days when using multi-terabyte hard drives. One can overcome this by using a distributed RAID array which is discussed later.

When modeling failure modes it is usually assumed that drive failures are uncorrelated but in our experience that is not the case. Typically all the drives in an array are not just from the same build batch but will have sequential serial numbers. Manufacturing defects or firmware bugs lead to systematic and correlated problems. In addition, because rack space is typically a precious commodity one tends to use dense disk enclosures which can suffer from mechanical vibrations and broken circuit traces on connectors.

Probability of data loss vs space efficiency

Probability of data loss vs. space efficiency for various data integrity schemes. Space efficiency is defined as (unique data)/(data+parity). We assumed a 6% annualized probability of disk failure and a 24 hour rebuild time, changing this assumption will change the scale of the y-axis but not change the relative positions of the results for each scheme. The ideal configuration is in the lower right corner and would have a low chance of data loss and make good use of space. Mirror-xN signifies N replicas of data. RAID5-N and RAID6-N correspond to N total drives, data and parity, used in the array. The RAID6-Nx2 correspond to RAID6 arrays using N total drives that are replicated. RS-D+P represents generic Reed-Solomon using D data disk and P parity disks. The Y-axis should be scaled by the number of drive arrays to calculate the actual probability. For a fixed storage capacity (1 petabyte, for example) the number of drive arrays required will vary depending on the data integrity scheme used.

Data integrity Schemes

Data integrity schemes typically employ either data mirroring or some form of RAID. Data mirroring is nothing more than keeping multiple copies of the data stored on different devices at different locations. Mirroring is great for read dominated workloads since all replicas can be used to service user requests. Write performance is slowed by the making of all the replicas. Mirroring is not very space efficient which can be quite costly at the petabyte scale. It is also not very good at providing data integrity as can be seem from the above figure. We assumed a 6% annualized failure probability and a drive rebuild time of 24 hours in generating this plot. Changing this probability will change the scale of the y-axis but the relative positions of the different schemes will not change.

The other standard approach is to employ some form of RAID. In this case one stores the data and additional parity information that can be used to reconstruct unrecoverable bit errors or drive failures. The two most common forms for RAID are RAID5 and RAID6. RAID5 uses one additional disk for storing parity and is can survive a single device failure. RAID6 has two additional disks for parity and can survive two drive failures. Reed-Solomon encoding is the generalization or RAID5 and RAID6 to support arbitrary numbers of drive failures. There are other RAID encoding schemes but these are the ones we will focus on without loss of generality.

L-Store has implemented generic Reed-Solomon encoding along with several other data integrity schemes. We typically recommend 6+3 Reed-Solomon (RS-6+3) encoding – 6 data disks and 3 parity disks. More generally we recommend the use of the RS-d+(d/2) family of configurations which can be seen in the figure above as RS-6+3, RS-8+4, and RS-10+5. This family uses 2/3 of the total space for data with the remainder for parity and has excellent reliability. RS-6+3 has the same reliability as keeping 3 copies of the data but uses only half the space.

Probability of data loss vs, Rebuild time assuming a 6% AFR

Probability of additional drive failures occurring during the initial drive rebuilds leading to data loss for three common RAID configurations. Ideally repairs should take as little time as possible but keeping them down to a few hours almost as good for RS-6+3. The plot assumes a 6% Annualized Failure Rate (AFR) based on the literature. The probability given is for a single array.

Array Rebuild Times

As alluded to previously, recovering from the failure of a multi-terabyte disk drive can take days. We have witnessed this first hand in a 500 terabyte GPFS file system used to provide home disk space for a large campus cluster. The possibility of correlated drive failures (for the reasons described in the introduction of Section 3.1) increases the probability that additional drive failures could occur, resulting in unrecoverable data loss. The above figure shows the probability of data loss due to the loss of additional disks as a function of the rebuild time for three RAID configurations. RAID5-6 provides very little protection in this scenario. The RADI6-8 configuration is better since two additional drives are required in order to lose data. Even so, the chances of data loss are uncomfortably high if it takes a few days for the repair. The RS-6+3 is substantially more reliable even if repair takes a few days but keeping the rebuild times down to a few hours helps minimize the chance of data loss. The graph shows a distinct knee for this configuration occurring in the first couple of hours. The plot assumes a 6% Annualized Failure Rate (AFR) which is taken from the literature (Pinheiro 2007;Schroder 2007) for 3 and 4 year AFR estimates.

Diagram showing a traditional RAID array rebuild vs using a distributed approach

Diagram showing a traditional RAID array rebuild vs using a distributed approach.

Distributed RAID Arrays

Traditional RAID arrays completely reconstruct a single failed drive on a single replacement drive. The use of a single replacement drive is a major factor in the rebuild time. Traditionally the entire drive is reconstructed with no regard for used vs. unused space. If the array is active with user I/O requests, this will greatly increase the rebuild time.

Distributed RAID arrays are designed to overcome these limitations. Instead of using the whole disk the disk is broken up into many smaller blocks. These blocks are combined with blocks on other disks creating many small logical RAID arrays utilizing a large subset of the available drives as shown in the figure. These distributed logical RAID arrays are based on space that is actually used. The free space on each drive can be used to store the newly reconstructed data. This allows for a large number of drives being read and written to simultaneously providing significantly faster rebuild times. For L-Store each file and associated parity is placed on a random collection of drives based on the fault tolerance scheme used and data placement criteria. We routinely rebuild a single 2TB in a couple of hours using a single host to perform the data reconstruction. Adding more hosts causes the rebuild time to proportionately decrease.