Erasure coding offers a storage-efficient redundancy mechanism for maintaining data availability guarantees in large-scale storage clusters, yet it also incurs high performance overhead in failure repair. Recent developments in accurate disk failure prediction allow soon-to-fail (STF) nodes to be repaired in advance, thereby opening new opportunities for accelerating failure repair in erasure-coded storage. To this end, we present a fast predictive repair solution called FastPR, which carefully couples two repair methods, namely migration (i.e., relocating the chunks of an STF node) and reconstruction (i.e., decoding the chunks of an STF node through erasure coding), so as to fully parallelize the repair operation across the storage cluster. FastPR solves a bipartite maximum matching problem and schedules both migration and reconstruction in a parallel fashion. We show that FastPR significantly reduces the repair time over the baseline repair approaches for both Reed-Solomon codes and Azure's Local Reconstruction Codes via mathematical analysis, large-scale simulation, and Amazon EC2 experiments.
Our FastPR prototype uses the Jerasure library, but you don't need to pre-install it in your system as we include part of its source code files into our codebase. It can run atop Hadoop-3.1.1. Please download Hadoop-3.1.1 from the official website to test our FastPR prototype.
FastPR is co-developed by (i) School of Informatics at Xiamen University and (ii) the Applied Distributed Systems Lab in the Department of Computer Science and Engineering at the Chinese University of Hong Kong (CUHK).