Applied Distributed Systems Lab (ADSLab)

Department of Computer Science and Engineering, The Chinese University of Hong Kong
 

Erasure Coding for Dependable Storage at Scale


Modern big data storage systems are prone to failures. Erasure coding is a low-cost redundancy mechanism that achieves higher fault tolerance than traditional replication with much less redundancy, and has been widely deployed. Its idea is to encode original data chunks to form extra parity chunks, such that any subset of a sufficient number of data and parity chunks can reconstruct the original data. Erasure coding makes a trade-off of incurring high bandwidth and I/O costs, especially in failure repair and chunk updates.

Our research applies erasure coding as a low-cost redundancy mechanism to support large-scale dependable storage in distributed environments, with emphasis on balancing the design trade-offs between storage savings and performance efficiency in practical deployment, backed by theoretical and empirical analyses. Its research outputs also lead to a commercial multi-cloud storage system, nCloud, which is adopted by industry in Hong Kong.

We study several topics in erasure coding, including (i) network coding for hierarchical storage, (ii) repair-efficient techniques for general erasure codes, (iii) a unified framework for erasure coding management, (iv) storage scaling for erasure coding, and (v) wide-stripe erasure coding.

See also our research impact derived from this research.

Network Coding for Hierarchical Storage

We apply the network coding theory to support performance-efficient dependable storage at scale. Our first work, NCCloud1 realizes the functional minimum-storage regenerating (FMSR) codes in multi-cloud storage and shows that it minimimizes the bandwidth of repairing a single-cloud failure (by up to 50% compared to the conventional repair in traditional erasure codes) without coded computations at the clouds.

Based on NCCloud, we further consider hierarchical storage with diverse bandwidth at different levels. We focus on rack-based data centers where the cross-rack bandwidth is much more limited than the inner-rack bandwidth; the same analysis also applies to geo-distributed storage where the bandwidth across geographic regions is much more limited than within the same region. We propose a new family of erasure codes called Double Regenerating Codes (DRC)2, which provably minimizes the cross-rack repair traffic. DRC extends classical regenerating codes and builds on the network coding theory to allow rack-level encoding during repair. It focuses on hierarchical block placement and places multiple blocks in different nodes per rack. To repair a failed block, DRC splits a repair operation into inner-rack and cross-rack layers: it first dedicates a node in each rack to perform partial repair using the available blocks within the same rack, and then aggregates the partially repaired blocks from multiple racks to reconstruct the failed block. We show that DRC can significantly reduce the cross-rack repair traffic of the classical regenerating codes (e.g., by 45.5% for some parameters). We further realize DRC constructions in a repair layering framework called DoubleR3 and implement DoubleR on Hadoop Distributed File System (HDFS). Our follow-up work, Rack-aware Regenerating Codes (RRC)4, generalizes DRC and rigorously proves an optimal trade-off curve between the cross-rack repair traffic and the storage redundancy via information theoretical analysis.

  1. Henry C. H. Chen, Yuchong Hu, Patrick P. C. Lee, and Yang Tang
    "NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds"
    IEEE Transactions on Computers (TC), 63(1), pp. 31-44, January 2014 (Special Issue: Cloud of Clouds).
    (An earlier version appeared in FAST 2012)
    [pdf] [software] [doi]

  2. Yuchong Hu, Patrick P. C. Lee, and Xiaoyang Zhang.
    "Double Regenerating Codes for Hierarchical Data Centers."
    Proceedings of IEEE International Symposium on Information Theory (ISIT 2016), Barcelona, Spain, July 2016.
    [pdf] [pptx]

  3. Yuchong Hu, Xiaolu Li, Mi Zhang, Patrick P. C. Lee, Xiaoyang Zhang, Pan Zhou, and Dan Feng.
    "Optimal Repair Layering for Erasure-Coded Data Centers: From Theory to Practice."
    ACM Transactions on Storage (TOS), 13(4), pp. 33:1-33:24, December 2017.
    [pdf] [doi] [software]

  4. Hanxu Hou, Patrick P. C. Lee, Kenneth W. Shum, and Yuchong Hu.
    "Rack-Aware Regenerating Codes for Data Centers."
    IEEE Transactions on Information Theory (TIT), 65(8), pp. 4730-4745, August 2019.
    [pdf] [doi]

Repair-efficient Techniques for General Erasure Codes

In addition to proposing new erasure codes, we also propose novel efficient techniques that work for general erasure code constructions. Repair pipelining1 reduces the single-block repair time to about the same as the normal read time for a single block, by decomposing the repair of a block in small-size slices and carefully scheduling the repair of multiple slices in a pipelined manner. POCache2 supports storage-efficient caching by keeping only parity chunks in cache for straggler tolerance, without assuming the accurate detection of which servers are stragglers. FastPR3 builds on the predictive nature of disk failures by relocating the data of soon-to-fail disks in advance, and proposes a fast predictive repair algorithm for erasure-coded storage. We further propose cross-rack-aware repair and update techniques, respectively called CAR4 and CAU5, for the classical Reed-Solomon codes in hierarchical storage.

  1. Runhui Li, Xiaolu Li, Patrick P. C. Lee, and Qun Huang.
    "Repair Pipelining for Erasure-Coded Storage."
    Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 2017), Santa Clara, CA, USA, July 2017.
    (AR: 60/283 = 21.2%)
    [pdf] [pptx] [software] [corrections]

  2. Mi Zhang, Qiuping Wang, Zhirong Shen, and Patrick P. C. Lee.
    "Parity-Only Caching for Robust Straggler Tolerance."
    Proceedings of 35th International Conference on Massive Storage Systems and Technology (MSST 2019) (Full paper), Santa Clara, CA, USA, May 2019.
    (AR: 21/71 = 29.6%)
    [pdf] [pptx] [software]

  3. Zhirong Shen, Xiaolu Li, and Patrick P. C. Lee.
    "Fast Predictive Repair in Erasure-Coded Storage."
    Proceedings of the 49th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2019), Portland, Oregon, USA, June 2019.
    (AR: 54/212 = 21.4%)
    [pdf] [pptx] [software]

  4. Zhirong Shen, Patrick P. C. Lee, Jiwu Shu, and Wenzhong Guo.
    "Cross-Rack-Aware Single Failure Recovery for Clustered File Systems."
    IEEE Transactions on Dependable and Secure Computing (TDSC), 17(2), pp. 248-261, March/April 2020.
    (An earlier version appeared in DSN 2016)
    [pdf] [doi]

  5. Zhirong Shen and Patrick P. C. Lee.
    "Cross-Rack-Aware Updates in Erasure-Coded Data Centers: Design and Evaluation."
    IEEE Transactions on Parallel and Distributed Systems (TPDS), 31(10), pp. 2315-2328, October 2020.
    (An earlier version appeared in ICPP 2018)
    [pdf] [doi] [software]

A Unified Framework for Erasure Coding Management

To simplify the deployment of various types of erasure coding solutions into distributed storage systems, we design OpenEC1, a unified and configurable middleware framework that decouples erasure coding management from the storage workflows and provides a programming abstraction for erasure coding operations. We prototype OpenEC on HDFS and Quantcast FS (QFS) and show that the integrations only require limited codebase changes.

  1. Xiaolu Li, Runhui Li, Patrick P. C. Lee, and Yuchong Hu.
    "OpenEC: Toward Unified and Configurable Erasure Coding Management in Distributed Storage Systems."
    Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST 2019), Boston, MA, USA, February 2019.
    (AR: 26/145 = 17.9%)
    [pdf] [pptx] [poster] [tech report] [software]

Storage Scaling for Erasure Coding

System operators need to re-parameterize the right redundancy level for erasure coding in distributed storage systems to adapt to different trade-offs of storage efficiency, fault tolerance, access performance, and management complexity. We study storage scaling, which reconfigures the erasure coding parameters and relocates the currently stored data to different storage nodes on-the-fly. However, redundancy transitioning introduces substantial data transfers among storage nodes. We study the optimal storage scaling problem from both theoretical and applied perspectives. We propose NCScale1, which exploits network coding to achieve information-theoretically minimum bandwidth in storage scaling. For decentralized key-value stores such as Memcached, we propose ECHash2 to support efficient data migration when nodes join or leave a storage system, and also propose Elastic Reed-Solomon codes3 for I/O-efficient redundancy transitioning. Furthermore, we study the optimal trade-off between repair and scaling in rack-based data centers4.

  1. Xiaoyang Zhang, Yuchong Hu, Patrick P. C. Lee, Pan Zhou.
    "Toward Optimal Storage Scaling via Network Coding: From Theory to Practice."
    Proceedings of IEEE International Conference on Computer Communications (INFOCOM 2018), Honolulu, HI, USA, April 2018.
    (AR: 309/1606 = 19.2%)
    [pdf] [pptx] [corrections]

  2. Liangfeng Cheng, Yuchong Hu, and Patrick P. C. Lee.
    "Coupling Decentralized Key-Value Stores with Erasure Coding."
    Proceedings of the ACM Symposium on Cloud Computing 2019 (SoCC 2019), Santa Cruz, CA, USA, November 2019.
    (AR: 39/157 = 24.8%)
    [pdf] [pptx] [software]

  3. Si Wu, Zhirong Shen, and Patrick P. C. Lee.
    "Enabling I/O-Efficient Redundancy Transitioning in Erasure-Coded KV Stores via Elastic Reed-Solomon Codes."
    Proceedings of the 39th International Symposium on Reliable Distributed Systems (SRDS 2020), Shanghai, China, September 2020.
    (AR: 32/131 = 24.4%)
    Prof. C. V. Ramamoorthy Best Paper Award.
    [pdf] [pptx] [software]

  4. Si Wu, Zhirong Shen, Patrick P. C. Lee, Yinlong Xu.
    "Optimal Repair-Scaling Trade-off in Locally Repairable Codes: Analysis and Evaluation."
    Accepted for publication in IEEE Transactions on Parallel and Distributed Systems (TPDS).
    (An earlier version appeared in INFOCOM 2020)

Wide-Stripe Erasure Coding

While erasure coding effectively mitigates storage redundancy, we explore further redundancy reduction under erasure coding to achieve extreme storage savings. We consider wide stripes, in which the fraction of data chunks is set to be significantly larger than the fraction of parity chunks, so as to achieve near-optimal redundancy. Wide stripes are particularly attractive for cold storage systems to achieve long-term data durability at extremely low cost. However, wide stripes aggravate the repair penalty, while existing repair-efficient approaches for erasure coding cannot effectively address wide stripes. We propose Combined Locality, the first mechanism that systematically addresses the wide-stripe repair problem via the combination of both parity locality and topology locality. Experiments on Amazon EC2 show that combined locality reduces the single-chunk repair time by up to 90.5% compared to locality-based state-of-the-arts, with only a redundancy of as low as 1.063x. We further propose StripeMerge, a wide-stripe generation mechanism that selects and merges narrow stripes into wide stripes, with the primary objective of minimizing the wide-stripe generation bandwidth.

  1. Yuchong Hu, Liangfeng Cheng, Qiaori Yao, Patrick P. C. Lee, Weichun Wang, and Wei Chen.
    "Exploiting Combined Locality for Wide-Stripe Erasure Coding in Distributed Storage."
    Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 2021), February 2021.
    (AR: 28/130 = 21.5%)
    [pdf] [pptx] [software]

  2. Qiaori Yao, Yuchong Hu, Liangfeng Cheng, Patrick P. C. Lee, Dan Feng, Weichun Wang, Wei Chen.
    "StripeMerge: Efficient Wide-Stripe Generation for Large-Scale Erasure-Coded Storage."
    Proceedings of the 41st IEEE International Conference on Distributed Computing Systems (ICDCS 2021), July 2021.
    (AR: 97/498 = 19.8%)
    [pdf] [software]


Last updated in June 2021.