TY - GEN
T1 - Tradeoffs in scalable data routing for deduplication clusters
AU - Dong, Wei
AU - Douglis, Fred
AU - Reddy, Sazzala
AU - Li, Kai
AU - Shilane, Philip
AU - Patterson, Hugo
PY - 2019/1/1
Y1 - 2019/1/1
N2 - As data have been growing rapidly in data centers, deduplication storage systems continuously face challenges in providing the corresponding throughputs and capacities necessary to move backup data within backup and recovery window times. One approach is to build a cluster deduplication storage system with multiple deduplication storage system nodes. The goal is to achieve scalable throughput and capacity using extremely high-throughput (e.g. 1.5 GB/s) nodes, with a minimal loss of compression ratio. The key technical issue is to route data intelligently at an appropriate granularity. We present a cluster-based deduplication system that can deduplicate with high throughput, support deduplication ratios comparable to that of a single system, and maintain a low variation in the storage utilization of individual nodes. In experiments with dozens of nodes, we examine tradeoffs between stateless data routing approaches with low overhead and stateful approaches that have higher overhead but avoid imbalances that can adversely affect deduplication effectiveness for some datasets in large clusters. The stateless approach has been deployed in a two-node commercial system that achieves 3 GB/s for multi-stream deduplication throughput and currently scales to 5.6 PB of storage (assuming 20X total compression).
AB - As data have been growing rapidly in data centers, deduplication storage systems continuously face challenges in providing the corresponding throughputs and capacities necessary to move backup data within backup and recovery window times. One approach is to build a cluster deduplication storage system with multiple deduplication storage system nodes. The goal is to achieve scalable throughput and capacity using extremely high-throughput (e.g. 1.5 GB/s) nodes, with a minimal loss of compression ratio. The key technical issue is to route data intelligently at an appropriate granularity. We present a cluster-based deduplication system that can deduplicate with high throughput, support deduplication ratios comparable to that of a single system, and maintain a low variation in the storage utilization of individual nodes. In experiments with dozens of nodes, we examine tradeoffs between stateless data routing approaches with low overhead and stateful approaches that have higher overhead but avoid imbalances that can adversely affect deduplication effectiveness for some datasets in large clusters. The stateless approach has been deployed in a two-node commercial system that achieves 3 GB/s for multi-stream deduplication throughput and currently scales to 5.6 PB of storage (assuming 20X total compression).
UR - http://www.scopus.com/inward/record.url?scp=85077072489&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85077072489&partnerID=8YFLogxK
M3 - Conference contribution
T3 - Proceedings of FAST 2011: 9th USENIX Conference on File and Storage Technologies
SP - 15
EP - 29
BT - Proceedings of FAST 2011
PB - USENIX Association
T2 - 9th USENIX Conference on File and Storage Technologies, FAST 2011
Y2 - 15 February 2011 through 17 February 2011
ER -