An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers

Introduction

Flash-based solid state drives (SSDs) are increasingly adopted as the mainstream storage media in modern data centers. However, little is known about how SSD failures in the field are correlated, both spatially and temporally. We argue that characterizing correlated failures of SSDs is critical, especially for guiding the design of redundancy protection for high storage reliability. We present an in-depth data-driven analysis on the correlated failures in the SSD-based data centers at Alibaba. We study nearly one million SSDs of 11 drive models based on a dataset of SMART logs, trouble tickets, physical locations, and applications. We show that correlated failures in the same node or rack are common, and study the possible impacting factors on those correlated failures. We also evaluate via trace-driven simulation how various redundancy schemes affect the storage reliability under correlated failures. To this end, we report 15 findings. Our dataset and source code are now released for public use.

Publication

Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu.
"An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers"
Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 2021), February 2021
[pdf] [pptx] [software] [dataset] [corrections]

Download

Version 1.0.0 (January 2021): ssdanalysis-1.0.0.tar.gz (md5sum: 80022aa436bd5515ef0101783fdab183)
Github (mirror): https://github.com/shujiehan/ssdanalysis.

Please contact Shujie Han (sjhan@cse.cuhk.edu.hk) if you have any questions.

License

The source code is released under the GNU/GPL license.

Acknowledgments

We extend the C++ discrete-event simulator SimEDC to support the reliability evaluation on our dataset.