TY - GEN
T1 - Highly scalable genome assembly on campus grids
AU - Moretti, Christopher
AU - Olson, Michael
AU - Emrich, Scott
AU - Thain, Douglas
PY - 2009
Y1 - 2009
N2 - Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable modular genome assembler that can achieve significant speedup using large numbers of conventional desktop machines, such as those found in a campus computing grid. The system is based on the Celera open-source assembly toolkit, and replaces two independent sequential modules with scalable replacements: a scalable candidate selector exploits the distributed memory capacity of a campus grid, while the scalable aligner exploits the distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency on several scales of resources. We show results for several datasets ranging from 738 thousand to over 121 million alignments using campus grid resources ranging from a small cluster to more than a thousand nodes spanning three institutions. Our largest run so far achieves a 927x speedup with 71.3 percent efficiency.
AB - Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable modular genome assembler that can achieve significant speedup using large numbers of conventional desktop machines, such as those found in a campus computing grid. The system is based on the Celera open-source assembly toolkit, and replaces two independent sequential modules with scalable replacements: a scalable candidate selector exploits the distributed memory capacity of a campus grid, while the scalable aligner exploits the distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency on several scales of resources. We show results for several datasets ranging from 738 thousand to over 121 million alignments using campus grid resources ranging from a small cluster to more than a thousand nodes spanning three institutions. Our largest run so far achieves a 927x speedup with 71.3 percent efficiency.
UR - http://www.scopus.com/inward/record.url?scp=74049117532&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=74049117532&partnerID=8YFLogxK
U2 - 10.1145/1646468.1646480
DO - 10.1145/1646468.1646480
M3 - Conference contribution
AN - SCOPUS:74049117532
SN - 9781605587141
T3 - Proceedings of the 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers 2009, MTAGS '09
BT - Proceedings of the 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers 2009, MTAGS '09
T2 - 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers 2009, MTAGS '09
Y2 - 16 November 2009 through 16 November 2009
ER -