TY - GEN
T1 - Scalable speculative parallelization on commodity clusters
AU - Kim, Hanjun
AU - Raman, Arun
AU - Liu, Feng
AU - Lee, Jae W.
AU - August, David I.
PY - 2010
Y1 - 2010
N2 - While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49×. This compares favorably to the 15× speedup achieved by our implementation of TLS-only support for clusters.
AB - While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49×. This compares favorably to the 15× speedup achieved by our implementation of TLS-only support for clusters.
KW - Distributed systems
KW - Loop-level parallelism
KW - Multi-threaded transactions
KW - Pipelined parallelism
KW - Software transactional memory
KW - Thread-level speculation
UR - http://www.scopus.com/inward/record.url?scp=79951708803&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79951708803&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2010.19
DO - 10.1109/MICRO.2010.19
M3 - Conference contribution
AN - SCOPUS:79951708803
SN - 9780769542997
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 3
EP - 14
BT - Proceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010
T2 - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010
Y2 - 4 December 2010 through 8 December 2010
ER -