Tradeoffs in scalable data routing for deduplication clusters

Wei Dong, Fred Douglis, Sazzala Reddy, Kai Li, Philip Shilane, Hugo Patterson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

52 Scopus citations

Abstract

As data have been growing rapidly in data centers, deduplication storage systems continuously face challenges in providing the corresponding throughputs and capacities necessary to move backup data within backup and recovery window times. One approach is to build a cluster deduplication storage system with multiple deduplication storage system nodes. The goal is to achieve scalable throughput and capacity using extremely high-throughput (e.g. 1.5 GB/s) nodes, with a minimal loss of compression ratio. The key technical issue is to route data intelligently at an appropriate granularity. We present a cluster-based deduplication system that can deduplicate with high throughput, support deduplication ratios comparable to that of a single system, and maintain a low variation in the storage utilization of individual nodes. In experiments with dozens of nodes, we examine tradeoffs between stateless data routing approaches with low overhead and stateful approaches that have higher overhead but avoid imbalances that can adversely affect deduplication effectiveness for some datasets in large clusters. The stateless approach has been deployed in a two-node commercial system that achieves 3 GB/s for multi-stream deduplication throughput and currently scales to 5.6 PB of storage (assuming 20X total compression).

Original languageEnglish (US)
Title of host publicationProceedings of FAST 2011
Subtitle of host publication9th USENIX Conference on File and Storage Technologies
PublisherUSENIX Association
Pages15-29
Number of pages15
ISBN (Electronic)9781931971829
StatePublished - Jan 1 2019
Event9th USENIX Conference on File and Storage Technologies, FAST 2011 - San Jose, United States
Duration: Feb 15 2011Feb 17 2011

Publication series

NameProceedings of FAST 2011: 9th USENIX Conference on File and Storage Technologies

Conference

Conference9th USENIX Conference on File and Storage Technologies, FAST 2011
Country/TerritoryUnited States
CitySan Jose
Period2/15/112/17/11

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Tradeoffs in scalable data routing for deduplication clusters'. Together they form a unique fingerprint.

Cite this