What is L-Store?

L-Store provides a flexible logistical storage framework for distributed and scalable access to data for a wide spectrum of users. L-Store is designed to provide: virtually unlimited scalability in raw storage; support for arbitrary metadata associated with each file; user controlled fault tolerance and data reliability on a file and directory level; scalable performance in raw data movement; a virtual file system interface with both a native mount in Linux (exportable via NFS and CIFS to other platforms) and a high performance command line interface; and support for the geographical distribution and migration of data to facilitate quick access. These features are accomplished by segregating directory and metadata services from data transfer. Logistical Networking is used as the block-level data abstraction for storage with L-Store clients directly performing all data I/O without going through an intermediate server as is done for NFS and CIFS. This means adding storage adds both capacity but also bandwidth.

Logistical Networking

Logistical Networking (LN) is a distributed storage management technology which has been expressly designed to deal with the current flood of data. In order to make shared storage infrastructure a ubiquitous part of the underlying communication fabric, LN builds on a highly generic, best effort storage service, called the Internet Backplane Protocol (IBP) that is design by analogy with the Internet Protocol (IP) in order to produce a common storage service that maximizes interoperability and scalability. IBP was developed by the Logistical Computing and Internetworking (LoCI) Lab at the University of Tennessee, Knoxville. IBP enables the movement of large data sets via the simultaneous transfer of data fragments rather than requiring the sequential transfer of the entire data set or file. Mirroring, data striping, fault tolerance, and recovery features are also supported by IBP.

Probability of data loss vs space efficiency

Probability of data loss vs. space efficiency for various data integrity schemes. Space efficiency is defined as (unique data)/(data+parity). We assumed a 6% annualized probability of disk failure and a 24 hour rebuild time, changing this assumption will change the scale of the y-axis but not change the relative positions of the results for each scheme. The ideal configuration is in the lower right corner and would have a low chance of data loss and make good use of space. Mirror-xN signifies N replicas of data. RAID5-N and RAID6-N correspond to N total drives, data and parity, used in the array. The RAID6-Nx2 correspond to RAID6 arrays using N total drives that are replicated. RS-D+P represents generic Reed-Solomon using D data disk and P parity disks. The Y-axis should be scaled by the number of drive arrays to calculate the actual probability. For a fixed storage capacity (1 petabyte, for example) the number of drive arrays required will vary depending on the data integrity scheme used.

Data integrity

L-Store intrinsically addresses the question of what fault tolerance strategy to use on a file by file basis by storing the data layout strategy in the file’s metadata. Thus the most efficient fault tolerance method can be chosen for each file, and changed as conditions change, which is essential for an ongoing, long-term data repository. L-Store can be extended to support new fault tolerance strategies as they become available. Because L-Store directly dictates the data corruption prevention strategy, it can implement more exotic and powerful strategies than simplistic RAID5 and RAID6 strategies. We typically recommend using a 6+3 Reed-Solomon (RS-6+3) encoding – 6 data disks and 3 parity disks. This allows any 3 disks to fail and still be able to fully access all data. More generally we recommend the use of the RS-d+(d/2) family of configurations which can be seen in the figure as RS-6+3, RS-8+4, and RS-10+5. This family uses 2/3 of the total space for data with the remainder for parity and has excellent reliability. RS-6+3 has the same reliability as keeping 3 copies of the data but uses only half the space.

Another issue commonly encountered in large scale storage is silent bit errors. Silent bit errors go undetected by the drive when a bit’s state is unintentionally changed but they can also occur in the HBA controlling the drive, or during the network transfer process to/from the client. Given manufacturer specified unrecoverable bit error rates of 1 in 1014 or 1015 and with the size of available drives, these do occur. But a more common silent bit error is from a drive or computer crash resulting in garbage being stored of portions of an allocation. The file system may replay the log and correct some of these errors but it does not catch all of them. This is even harder in a distributed context since the IBP write operation will have completed from the remote applications standpoint but before the operating system had the ability to flush the data to disk. One option would be to only return success to the remote client after a flush to disk had occurred but that would greatly impact performance.

Another option is to interleave block-level checksum information with the data. This has some impact on disk performance but much less than the alternative. The tradeoff is much greater processing power to calculate the checksums. This is done on the depot only which has plenty of excess computing power. Enabling block-level checksums is done at the allocation’s creation time. When an allocation has block-level checksums enabled every read and write operation involves comparison to the existing checksum to verify data integrity. If an error occurs, it immediately notifies the remote application allowing it to properly deal with the issue. This normally triggers either a soft or hard error, typically in the segment_lun driver, which is corrected by another layer. These errors are kept track of and used to update the system.soft_errors and system.hard_errors attributes. These are used, in turn, to trigger more in depth inspections of the object.

L-Store also has the ability to extend this block-level checksum to the network allowing the the detection and correction of silent network errors. Combining this with the IBP block-level checksums provides the client a strong end-to-end guarantee of data integrity.

Diagram showing a traditional RAID array rebuild vs using a distributed approach

Diagram showing a traditional RAID array rebuild vs using a distributed approach.

Distributed RAID Arrays

Traditional RAID arrays completely reconstruct a single failed drive on a single replacement drive. The use of a single replacement drive is a major factor in the rebuild time. Traditionally the entire drive is reconstructed with no regard for used vs. unused space. If the array is active with user I/O requests, this greatly increases the rebuild time.

L-Store uses distributed RAID arrays which are designed to overcome these limitations. Instead of using the whole disk the disk is broken up into many smaller blocks. These blocks are combined with blocks on other disks creating many small logical RAID arrays utilizing a large subset of the available drives as shown in the figure. These distributed logical RAID arrays are based on space that is actually used. The free space on each drive can be used to store the newly reconstructed data. This allows for a large number of drives being read and written to simultaneously providing significantly faster rebuild times. For L-Store each file and associated parity is placed on a random collection of drives based on the fault tolerance scheme used and data placement criteria. We routinely rebuild a single 2TB in a couple of hours using a single host to perform the data reconstruction. Adding more hosts causes the rebuild time to proportionately decrease.

L-Store configrations

L-Store configurations

Flexible Architecture

L-Store provides an extremely flexible framework that can adapt to your needs not just today but as they evolve. It’s plugin based architecture allows even more flexiblity.

Entry Level System

L-Store can evolve based on your needs. The entry level L-Store solution is comprised of a single L-Server for handling metadata operations coupled with a single storage depot appliance. L-Store can then be natively mounted on Linux based systems and all operating systems can access data using the LIO command line tools.

NFS & CIFS Support

L-Store accepts a NAS head to export data natively to NFS & CIFS clients.

Commercial Backup Support

L-Store is fully integrated with the True incremental Backup System for automated backup of defined L-Store storage to TiBS backup servers.

Data Assurance & High Availability

Aditional L-Servers can be added as auxillary metadata servers introducing redundancy for directory services. Additional L-Store Archive and NAS Depots can also be deployed to permit data to be distributed accross multiple appliances and locations providing greater uptime and durability.

Synchronous and Lazy Replication Options

L-Severs can be placed in geographically diverse locations. Data can be synchronously replicated among locations or high availability applications. In addition, L-Servers can facilitate faster access to data in geographically remote regions by employing lazy-replication features to local depots.

Archive Storage

L-Store archive appliances can ingest data from foreign storage platforms and archive it into L-Store. When integrated with the True incremental Backup System, L-Store archive appliances can extend storage capabilites to Robotic Tape Library Systems from virtually any manufacturer. File lookup databases are simultaneously available in both TiBS and L-Store systems.

Mountable Archive

Users can mount the L-Store archive locally as a read-only file system without having to do a transfer back to local storage.

Transparent Migration Capabilities

Transparent migration capabilities for data is available in L-Store due to the separation of name and metadata services from the file service. Files can be replicated and moved across NAS and Archive appliances without the users needing to know specific mount points.

Location Independence of L-Store Data

File data can be located anywhere in the fabric or in multiple locations simultaneously.