Erasure coding has been widely adopted to protect data storage against failures in production data centers. Given the hierarchical nature of data centers, characterizing the effects of erasure coding and redundancy placement on the reliability of erasure-coded data centers is critical yet largely unexplored. This paper presents a comprehensive simulation analysis of reliability on erasure-coded data centers. We conduct the analysis by building a discrete-event simulator called SimEDC, which reports reliability metrics of an erasure-coded data center based on the configurable inputs of the data center topology, erasure codes, redundancy placement, and failure/repair patterns of different subsystems obtained from statistical models or production traces. Our simulation results show that placing erasure-coded data in fewer racks generally improves reliability by reducing cross-rack repair traffic, even though it sacrifices rack-level fault tolerance in the face of correlated failures.
SimEDC is developed by the Applied Distributed Systems Lab in the Department of Computer Science and Engineering at the Chinese University of Hong Kong (CUHK).
Please contact Mi Zhang if you have any questions.
The software of SimEDC codes is built on High-Fidelity Reliability Simulator (HFRS) developed by Kevin Greenan.