Abstract
Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment. This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.
Original language | English (US) |
---|---|
Pages | 269-282 |
Number of pages | 14 |
State | Published - 2008 |
Event | 6th USENIX Conference on File and Storage Technologies, FAST 2008 - San Jose, United States Duration: Feb 26 2008 → Feb 29 2008 |
Conference
Conference | 6th USENIX Conference on File and Storage Technologies, FAST 2008 |
---|---|
Country/Territory | United States |
City | San Jose |
Period | 2/26/08 → 2/29/08 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Hardware and Architecture
- Software