Deduplication has been widely used to improve storage efficiency in modern primary and secondary storage systems, yet how deduplication fundamentally affects storage system reliability remains debatable. This paper aims to analyze and compare storage system reliability with and without deduplication in primary workloads using public file system snapshots from two research groups. We first study the redundancy characteristics of the file system snapshots. We then propose a trace-driven, deduplication-aware simulation framework called SimDedup to analyze data loss in both chunk and file levels due to sector errors and whole-disk failures. Compared to without deduplication, our analysis shows that deduplication consistently reduces the damage of sector errors due to intra-file redundancy elimination, but potentially increases the damages of whole-disk failures if the highly referenced chunks are not carefully placed on disk. To improve reliability, we examine a deliberate copy technique that stores and repairs first the most referenced chunks in a small dedicated physical area (e.g., 1% of the physical capacity), and demonstrate its effectiveness through our simulation framework.
This software is developed by (i) the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST) and (ii) the Applied Distributed Systems Lab in the Department of Computer Science and Engineering at the Chinese University of Hong Kong (CUHK).
Please contact Shujie Han (sjhan@cse.cuhk.edu.hk) if you have any questions.
The software of SimDedup codes is built on High-Fidelity Reliability Simulator (HFRS) developed by Kevin Greenan.