TY - GEN
T1 - Automatic thread extraction with decoupled software pipelining
AU - Ottoni, Guilherme
AU - Rangan, Ram
AU - Stoler, Adam
AU - August, David I.
PY - 2005
Y1 - 2005
N2 - Until recently, a steadily rising clock rate and other uniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturers to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.
AB - Until recently, a steadily rising clock rate and other uniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturers to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.
UR - http://www.scopus.com/inward/record.url?scp=33749375700&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33749375700&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2005.13
DO - 10.1109/MICRO.2005.13
M3 - Conference contribution
AN - SCOPUS:33749375700
SN - 0769524400
SN - 9780769524405
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 105
EP - 116
BT - MICRO-38
T2 - MICRO-38: 38th Annual IEEE/ACM International Symposium on Microarchitecture
Y2 - 12 November 2005 through 16 November 2005
ER -