TY - JOUR
T1 - A framework for scalable genome assembly on clusters, clouds, and grids
AU - Moretti, Christopher
AU - Thrasher, Andrew
AU - Yu, Li
AU - Olson, Michael
AU - Emrich, Scott
AU - Thain, Douglas
N1 - Funding Information:
This work was supported in part by a University of Notre Dame strategic initiative for Global Health, by the National Institutes of Health (NIAID contract HHSN266200400039C) and US National Science Foundation (NSF) grants CNS06-43229 and CNS08-55047. The authors thank the staff at the Purdue Rosen Center for Advanced Computing and the Wisconsn Condor Team for sharing their computing resources to make this work possible. They thank Dinesh Rajan for assisting with the testing on the Amazon cloud. They thank the anonymous reviewers for their comments and suggestions. The SAND software is licensed under the GNU General Public License and is available for download at http://www.nd.edu/ ccl/software/sand, along with the data sets used in this paper.
PY - 2012
Y1 - 2012
N2 - Bioinformatics researchers need efficient means to process large collections of genomic sequence data. One application of interest, genome assembly, has great potential for parallelization; however, most previous attempts at parallelization require uncommon high-end hardware. This paper introduces the Scalable Assembler at Notre Dame (SAND) framework that can achieve significant speedup using large numbers of commodity machines harnessed from clusters, clouds, and grids. SAND interfaces with the Celera open-source assembly toolkit, replacing two independent sequential modules with scalable parallel alternatives: the candidate selector exploits distributed memory capacity, and the sequence aligner exploits distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency. We show results for several data sets ranging from 738 thousand to over 320 million alignments using resources ranging from a small cluster to more than a thousand nodes spanning three institutions.
AB - Bioinformatics researchers need efficient means to process large collections of genomic sequence data. One application of interest, genome assembly, has great potential for parallelization; however, most previous attempts at parallelization require uncommon high-end hardware. This paper introduces the Scalable Assembler at Notre Dame (SAND) framework that can achieve significant speedup using large numbers of commodity machines harnessed from clusters, clouds, and grids. SAND interfaces with the Celera open-source assembly toolkit, replacing two independent sequential modules with scalable parallel alternatives: the candidate selector exploits distributed memory capacity, and the sequence aligner exploits distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency. We show results for several data sets ranging from 738 thousand to over 320 million alignments using resources ranging from a small cluster to more than a thousand nodes spanning three institutions.
KW - Distributed systems
KW - bioinformatics
KW - genome assembly
UR - http://www.scopus.com/inward/record.url?scp=84869439839&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84869439839&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2012.80
DO - 10.1109/TPDS.2012.80
M3 - Article
AN - SCOPUS:84869439839
SN - 1045-9219
VL - 23
SP - 2189
EP - 2197
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 12
M1 - 6165266
ER -