TY - GEN
T1 - SAP
T2 - 33rd IEEE Conference on Computer Communications, IEEE INFOCOM 2014
AU - Balasubramanian, Bharath
AU - Lan, Tian
AU - Chiang, Mung
PY - 2014
Y1 - 2014
N2 - Given a set of files that show a certain degree of similarity, we consider a novel problem of deduplicating them (eliminating redundant chunks) across a set of distributed servers in a manner that is: (i) space-efficient: the total space needed to deduplicate and store the files is minimized and, (ii) access-efficient: each file can be accessed by communicating with a bounded number of servers, thereby minimizing networkaccess times in congested data center networks. A spaceoptimal solution in which we first deduplicate all the files and then distribute them across the servers (referred to as chunk-distribution), may require communication with many servers to access each file. On the other hand, an access-efficient solution in which we randomly partition the files cross the servers, and then store their unique chunks on each server may not exploit the similarities across files to reduce the space overhead. In this paper, we first show that finding an access-efficient, space optimal solution is an NP-Hard problem. Following this, we present the similarity-aware-partitioning (SAP) algorithms that find access-efficient solutions within polynomial time complexity and guarantees bounded space overhead for arbitrary files. Our experimental verification on files from Dropbox and CNN confirm that the SAP technique is much more space-efficient than random partitioning, while maintaining compression ratio close to the chunk-distribution solution.
AB - Given a set of files that show a certain degree of similarity, we consider a novel problem of deduplicating them (eliminating redundant chunks) across a set of distributed servers in a manner that is: (i) space-efficient: the total space needed to deduplicate and store the files is minimized and, (ii) access-efficient: each file can be accessed by communicating with a bounded number of servers, thereby minimizing networkaccess times in congested data center networks. A spaceoptimal solution in which we first deduplicate all the files and then distribute them across the servers (referred to as chunk-distribution), may require communication with many servers to access each file. On the other hand, an access-efficient solution in which we randomly partition the files cross the servers, and then store their unique chunks on each server may not exploit the similarities across files to reduce the space overhead. In this paper, we first show that finding an access-efficient, space optimal solution is an NP-Hard problem. Following this, we present the similarity-aware-partitioning (SAP) algorithms that find access-efficient solutions within polynomial time complexity and guarantees bounded space overhead for arbitrary files. Our experimental verification on files from Dropbox and CNN confirm that the SAP technique is much more space-efficient than random partitioning, while maintaining compression ratio close to the chunk-distribution solution.
UR - http://www.scopus.com/inward/record.url?scp=84904411034&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84904411034&partnerID=8YFLogxK
U2 - 10.1109/INFOCOM.2014.6847984
DO - 10.1109/INFOCOM.2014.6847984
M3 - Conference contribution
AN - SCOPUS:84904411034
SN - 9781479933600
T3 - Proceedings - IEEE INFOCOM
SP - 592
EP - 600
BT - IEEE INFOCOM 2014 - IEEE Conference on Computer Communications
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 27 April 2014 through 2 May 2014
ER -