Avoiding the disk bottleneck in the data domain deduplication file system

Benjamin Zhu, Kai Li, Hugo Patterson

Research output: Contribution to conferencePaperpeer-review

608 Scopus citations

Abstract

Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment. This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.

Original languageEnglish (US)
Pages269-282
Number of pages14
StatePublished - 2008
Event6th USENIX Conference on File and Storage Technologies, FAST 2008 - San Jose, United States
Duration: Feb 26 2008Feb 29 2008

Conference

Conference6th USENIX Conference on File and Storage Technologies, FAST 2008
Country/TerritoryUnited States
CitySan Jose
Period2/26/082/29/08

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Avoiding the disk bottleneck in the data domain deduplication file system'. Together they form a unique fingerprint.

Cite this